Data verification methods, devices, equipment, and storage media based on OCR and NLP
By using OCR and NLP technologies to automatically identify and review documents in the insurance industry, the problem of low efficiency in traditional manual review has been solved, and an automated and accurate document review process has been achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA PING AN LIFE INSURANCE CO LTD
- Filing Date
- 2022-07-21
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional insurance industry document review relies on manual review, which is inefficient and cannot guarantee accuracy. In particular, when erroneous documents are uploaded, they need to be initially screened and adjusted manually, resulting in low work efficiency.
The system uses OCR technology to recognize characters in document images and performs semantic recognition through NLP models. Combined with image reference information such as the proportion of face regions, Euclidean distance, and edge detection, it automatically determines the validity and review results of the documents.
It enables automated document review without manual verification, reducing labor costs and improving the efficiency and accuracy of document review.
Smart Images

Figure CN115205883B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence, and in particular relates to a data review method, apparatus, device, and storage medium based on OCR and NLP. Background Technology
[0002] In the insurance industry, document review is an important process for insurance application or claims. Traditional document review methods mainly rely on staff to manually review the documents provided by customers and make review results based on personal experience. This not only requires staff to have extensive work experience, but also cannot guarantee accuracy under heavy workloads, and the work efficiency is relatively low.
[0003] With the emergence of Optical Character Recognition (OCR) and Natural Language Processing (NLP) technologies, some solutions exist that can use OCR to identify text from images of documents, and then use NLP to perform semantic recognition on the identified text, thereby achieving automated review. However, if an incorrect document is uploaded, subsequent recognition will fail. This necessitates manual initial screening of the document, followed by adjustments to the OCR recognition area and NLP keywords based on the content of the verification materials, resulting in very low work efficiency. Summary of the Invention
[0004] The following is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims.
[0005] This invention provides a method, apparatus, device, and storage medium for document review based on OCR and NLP, which can automatically determine the validity of document files and automatically complete OCR and NLP recognition, thereby improving the efficiency of document review.
[0006] In a first aspect, embodiments of the present invention provide a data review method based on OCR and NLP, comprising:
[0007] Acquire images and image reference information of underwriting documents, wherein the image reference information includes a preset reference ratio range;
[0008] Image recognition is performed on the data file image to determine the face region and character region of the data file image;
[0009] When the proportion of the face region in the data file image meets the reference proportion range, the data file image is determined to be a valid data image;
[0010] The target character is identified from the character region of the valid data image using a preset OCR model;
[0011] The title character and content character are determined from the target characters, and the target dictionary is determined from the preset candidate dictionary based on the title character;
[0012] The target dictionary and the content characters are input into a preset NLP model for semantic recognition to obtain the data review results.
[0013] Additionally, in some embodiments, the image reference information further includes a reference Euclidean distance, and the method for determining the data file image as a valid data image further includes:
[0014] Obtain the RGB value of each pixel in the data file image;
[0015] Determine the first Euclidean distance between the RGB value of each pixel and the first RGB reference value, wherein the color corresponding to the first RGB reference value is black;
[0016] Determine the second Euclidean distance between the RGB value of each pixel and the second RGB reference value, wherein the color corresponding to the second RGB reference value is white;
[0017] The average Euclidean distance of the data file image is determined based on the first Euclidean distance and the second Euclidean distance of all the pixels.
[0018] When the average Euclidean distance of the data file image is less than the reference Euclidean distance, the data file image is determined to be the valid data image.
[0019] Additionally, in some embodiments, obtaining the RGB value of each pixel in the data file image includes:
[0020] The image file is subjected to edge detection using a preset edge detection algorithm to determine the image edges;
[0021] The image detection region is determined based on the image edges, and the RGB value of each pixel is obtained in the image detection region.
[0022] In addition, in some embodiments, the step of recognizing the target character from the character region of the valid data image using a preset OCR model includes:
[0023] The valid data image is binarized to obtain a binarized image;
[0024] Character projection is performed on the binarized image, and character boundary points are determined based on the character projection results.
[0025] Based on all the character demarcation points, multiple characters to be identified are segmented from the valid data image;
[0026] Each of the characters to be identified is subjected to character recognition to obtain the target character.
[0027] Additionally, in some embodiments, the image reference information further includes title position information, and the step of determining the title character and content character from the target character includes:
[0028] The title area is determined based on the title position information, and the remaining area is determined as the content area.
[0029] The characters identified from the title area are determined as the title characters, and the characters identified from the content area are determined as the content characters.
[0030] In addition, in some embodiments, the image reference information further includes data style information, which includes the arrangement of data indicators in the data file image. The target dictionary includes multiple candidate keywords. The step of inputting the target dictionary and the content characters into a preset NLP model for semantic recognition to obtain the data review result includes:
[0031] Based on the data style information, at least one indicator recognition region is determined from the data file image, the indicator recognition region including an indicator name region and an indicator value region;
[0032] The content characters in the indicator name area are determined as the characters to be matched, and the content characters in the indicator value area are determined as the target indicator value, wherein the target indicator value includes at least one;
[0033] The target keyword is matched from the candidate keywords based on the character to be matched;
[0034] Each target keyword and its corresponding at least one target indicator value are input into the NLP model to obtain indicator analysis results;
[0035] The document review result is determined based on the analysis results of all the aforementioned indicators.
[0036] In addition, in some embodiments, the step of matching the target keyword from the candidate keywords based on the character to be matched includes:
[0037] Determine the third Euclidean distance between the character to be matched and the candidate keyword;
[0038] When the third Euclidean distance is less than a preset threshold, the corresponding candidate keyword is determined as the target keyword.
[0039] Secondly, embodiments of the present invention provide a document review device based on OCR and NLP, comprising:
[0040] The information acquisition unit is used to acquire images of documents and image reference information for underwriting business, wherein the image reference information includes a preset reference ratio range;
[0041] The image recognition unit is used to perform image recognition on the data file image and determine the face region and character region of the data file image;
[0042] The image detection unit is used to determine the data file image as a valid data image when the proportion of the face region in the data file image meets the reference proportion range.
[0043] A character recognition unit is used to identify target characters from the character region of the valid data image using a preset OCR model;
[0044] A dictionary determination unit is used to determine title characters and content characters from the target characters, and to determine a target dictionary from a preset candidate dictionary based on the title characters;
[0045] The semantic recognition unit is used to input the target dictionary and the content characters into a preset NLP model for semantic recognition to obtain the data review results.
[0046] Thirdly, embodiments of the present invention provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the data review method based on OCR and NLP as described in the first aspect.
[0047] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program for performing the OCR and NLP-based document review method as described in the first aspect.
[0048] This invention includes: acquiring images of underwriting documents and image reference information, the image reference information including a preset reference proportion range; performing image recognition on the document images to determine the face region and character region of the document images; determining the document images as valid document images when the proportion of the face region of the document images meets the reference proportion range; recognizing target characters from the character region of the valid document images using a preset OCR model; determining title characters and content characters from the target characters, and determining a target dictionary from a preset candidate dictionary based on the title characters; inputting the target dictionary and the content characters into a preset NLP model for semantic recognition to obtain document review results. According to the technical solution of this embodiment, since document documents are usually highly standardized certification documents, the validity of document images can be initially screened by face proportion, eliminating the need for manual verification, reducing labor costs. Furthermore, after determining the document images as valid images, OCR and NLP models are used to analyze the document review results, achieving automatic review of document images and improving work efficiency.
[0049] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention may be realized and obtained by means of the structures particularly pointed out in the description, claims, and drawings. Attached Figure Description
[0050] The accompanying drawings are provided to further understand the technical solutions of the present invention and constitute a part of the specification. They are used together with the embodiments of the present invention to explain the technical solutions of the present invention, and do not constitute a limitation on the technical solutions of the present invention.
[0051] Figure 1 This is a flowchart of a document review method based on OCR and NLP provided in one embodiment of the present invention;
[0052] Figure 2 This is a flowchart for determining the validity of a data file image provided in another embodiment of the present invention;
[0053] Figure 3 This is a flowchart for determining image edges provided in another embodiment of the present invention;
[0054] Figure 4 This is a flowchart of OCR recognition provided in another embodiment of the present invention;
[0055] Figure 5 This is a flowchart of title recognition provided in another embodiment of the present invention;
[0056] Figure 6This is a flowchart of NLP recognition provided in another embodiment of the present invention;
[0057] Figure 7 This is a flowchart of determining target keywords provided in another embodiment of the present invention;
[0058] Figure 8 This is a structural diagram of a document review device based on OCR and NLP provided in another embodiment of the present invention;
[0059] Figure 9 This is a device diagram of an electronic device provided in another embodiment of the present invention. Detailed Implementation
[0060] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0061] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. Terms such as "first," "objective," etc., in the specification, claims, or the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0062] This invention provides a document review method, apparatus, device, and storage medium based on OCR and NLP. The method includes: acquiring an image of an underwriting document and image reference information, wherein the image reference information includes a preset reference proportion range; performing image recognition on the document image to determine the face region and character region of the document image; determining the document image as a valid document image when the proportion of the face region of the document image meets the reference proportion range; identifying target characters from the character region of the valid document image using a preset OCR model; determining title characters and content characters from the target characters; determining a target dictionary from a preset candidate dictionary based on the title characters; and inputting the target dictionary and the content characters into a preset NLP model for semantic recognition to obtain the document review result. According to the technical solution of this embodiment, since document documents are usually highly standardized certification documents, the validity of the document image can be initially screened by the face proportion, eliminating the need for manual verification, reducing labor costs. Furthermore, after determining that the document image is a valid image, the OCR model and NLP model are used to analyze the document review result, realizing automatic review of document images and improving work efficiency.
[0063] The embodiments of this application can compile, acquire, and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application devices that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.
[0064] Foundational artificial intelligence technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive devices, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.
[0065] NLP is an important field within computer science and artificial intelligence. It studies the theories and methods for enabling effective communication between humans and computers using natural language. Natural Language Processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field involves natural language—the language people use in daily life—and thus it has a close connection with linguistic research. Natural Language Processing techniques typically include text processing, semantic understanding, machine translation, question answering, and knowledge graphs.
[0066] Computer vision (CV) is a science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in recognizing, tracking, and measuring targets, and then performs image processing to create images more suitable for human observation or transmission to instruments. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content / behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), autonomous driving, intelligent transportation, and common biometric recognition technologies such as facial recognition and fingerprint recognition.
[0067] like Figure 1 As shown, Figure 1 This is a flowchart of a document review method based on OCR and NLP provided in an embodiment of the present invention. The document review method based on OCR and NLP includes, but is not limited to, the following steps:
[0068] Step S110: Obtain the data file image and image reference information of the underwriting business. The image reference information includes a preset reference ratio range.
[0069] Step S120: Perform image recognition on the document image to determine the face region and character region of the document image;
[0070] Step S130: When the proportion of the face region in the data file image meets the reference proportion range, the data file image is determined as a valid data image.
[0071] Step S140: Identify the target character from the character region of the valid data image using a preset OCR model;
[0072] Step S150: Determine the title character and content character from the target characters, and determine the target dictionary from the preset candidate dictionary based on the title character;
[0073] Step S160: Input the target dictionary and content characters into the preset NLP model for semantic recognition to obtain the data review results.
[0074] It should be noted that the documents can be any documents related to the underwriting business, such as identity documents, financial documents, diagnostic certificates, etc. This embodiment does not limit the specific type of documents.
[0075] It should be noted that due to the diverse types of underwriting services, the composition of different document images varies. For example, identity documents include both facial and content areas, while financial documents typically only have content areas. These documents usually have a relatively uniform format; for instance, the facial proportion in identity documents is known, while in financial documents it is 0%. Therefore, a pre-set range for the facial proportion in each type of document image can be used as image reference information to verify the validity of the document image. For example, for identity documents, the reference proportion range is 5% to 10%. If image recognition determines that the facial area in the document image accounts for 9%, the document is considered valid, and subsequent judgment operations can be performed. If image recognition determines that the facial area in the document image accounts for 15%, the document is not an identity document, a prompt message is generated, and the recognition process exits. Similarly, if the document image required for underwriting is a financial document, and the reference proportion range for this document is 0, then as long as a facial area is detected in the document image, the document is considered invalid and mistakenly uploaded.
[0076] It should be noted that image recognition can be accomplished using common image recognition technologies. It is sufficient to identify the face region and character region from the document image. After determining the face region and character region, their respective areas can be calculated. The proportion of the face region can then be calculated by combining the area of the face region with the area of the document image.
[0077] It should be noted that when the document file includes multiple images, since the uploading of documents in underwriting business usually follows a fixed order, such as multiple document file images being identity documents and financial documents in sequence, when the order of the document file images is known, multiple reference ratio intervals can be pre-set in the image reference information. Each document file image is compared one by one. For example, the first document file image is compared with the first reference ratio interval to determine whether the first document file image is an identity document, and the second document file image is compared with the second reference ratio interval to determine whether the second document file image is a financial document. Those skilled in the art will have the motivation to adjust the setting method of the reference image information according to the timing and needs, and no further restrictions will be imposed here.
[0078] It should be noted that common OCR models can be used, and this embodiment does not limit the specific model selection, as long as it can achieve OCR recognition. The target character can be any character in the character region; in this embodiment, the characters include letters and numbers, which will not be further limited here.
[0079] It should be noted that title characters are typically used to distinguish the specific type of document images. For example, financial documents may include personal financial statements or company financial statements; the specific type of document can be determined by identifying the title character, which can be achieved through simple semantic recognition. Since different document types correspond to different metrics, the focus of semantic recognition also differs. To improve the accuracy of NLP model recognition, different dictionaries can be set for different document images. Matching by title characters ensures that the NLP model uses the correct dictionary for semantic recognition, avoiding the use of large, general-purpose dictionaries and effectively improving the efficiency of semantic recognition.
[0080] It should be noted that the NLP model can be any common model. This embodiment does not impose too many restrictions on the specific structure of the NLP model, as long as it can achieve semantic recognition based on the target dictionary. It is worth noting that the NLP model can identify the meaning represented by different indicators in data documents and images, such as whether each value in a financial certificate meets the underwriting standards. The specific semantic recognition method can be adjusted according to the meaning represented by the specific indicators.
[0081] Additionally, in one embodiment, the image reference information further includes a reference Euclidean distance, referring to... Figure 2 , Figure 1 Step S130 of the illustrated embodiment also includes, but is not limited to, the following steps:
[0082] Step S210: Obtain the RGB value of each pixel in the data file image;
[0083] Step S220: Determine the first Euclidean distance between the RGB value of each pixel and the first RGB reference value, wherein the color corresponding to the first RGB reference value is black;
[0084] Step S230: Determine the second Euclidean distance between the RGB value of each pixel and the second RGB reference value, wherein the color corresponding to the second RGB reference value is white;
[0085] Step S240: Determine the average Euclidean distance of the data file image based on the first Euclidean distance and the second Euclidean distance of all pixels;
[0086] Step S250: When the average Euclidean distance of the data file image is less than the reference Euclidean distance, the data file image is determined to be a valid data image.
[0087] It should be noted that during the actual uploading of data file images, incorrect images may be uploaded due to operational errors. For example, the uploaded image may be unrelated to underwriting. Such images usually have rich colors, while images for underwriting are relatively simple in color or even black and white. Therefore, the validity of data file images can be judged by the richness of their colors. If the image has a high degree of color richness, it may be an incorrect image. This allows for an initial screening of images before semantic recognition, avoiding the identification of incorrect images and ensuring the accuracy of underwriting.
[0088] Understandably, to determine the color richness of an image, the Euclidean distance between the color of a pixel and the black and white values can be calculated. Euclidean distance, also known as the Euclidean metric, measures the similarity between two data samples. It's the easiest-to-understand distance calculation method and a commonly used definition, referring to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance from the point to the origin). In two-dimensional and three-dimensional space, the Euclidean distance is the actual distance between two points. The smaller the Euclidean distance between the RGB values of a pixel and the RGB values representing white, the higher the similarity. Therefore, calculating the Euclidean distance between the RGB values and the first RGB value determines the similarity between the pixel and white, while calculating the Euclidean distance between the RGB values and the second RGB value determines the similarity between the pixel and black. When the average Euclidean distance of the data file image is within the reference Euclidean distance, the data file image can be considered a valid image.
[0089] It is worth noting that the reference Euclidean distance can be different for different document images. For example, identity documents have a certain color, so the reference Euclidean distance can be set to a larger value, while financial documents are black and white, so the reference Euclidean distance can be set to a smaller value, as long as it meets the accuracy requirements of image recognition.
[0090] It is worth noting that the specific calculation method of Euclidean distance is a technique well known to those skilled in the art, and will not be elaborated upon here.
[0091] Additionally, in one embodiment, reference is made to Figure 3 , Figure 2 Step S210 of the illustrated embodiment also includes, but is not limited to, the following steps:
[0092] Step S310: Perform edge detection on the data file image using a preset edge detection algorithm to determine the image edges;
[0093] Step S320: Determine the image detection area based on the image edge, and obtain the RGB value of each pixel in the image detection area.
[0094] It should be noted that since the data file images can be scanned images, there may be many blank areas in the images. In order to improve the accuracy of image recognition, edge detection can be performed on the data file images first using edge algorithms. The area within the image edge is used as the image detection area, and subsequent recognition is performed within the image detection area, which can improve the accuracy of image recognition.
[0095] It should be noted that the edge detection algorithm can be the common Canny algorithm or other algorithms. This embodiment does not limit the specific type of edge detection algorithm.
[0096] Additionally, in one embodiment, reference is made to Figure 4 , Figure 1 Step S140 of the illustrated embodiment also includes, but is not limited to, the following steps:
[0097] Step S410: Binarize the valid data image to obtain a binarized image;
[0098] Step S420: Project characters onto the binarized image and determine character boundary points based on the projection results.
[0099] Step S430: Based on all character demarcation points, segment out multiple characters to be recognized from the valid data image;
[0100] Step S440: Perform character recognition on each character to be recognized to obtain the target character.
[0101] It should be noted that since most documents and certificates consist of text characters, OCR recognition can be performed using projection methods. Common projection methods include vertical projection and horizontal projection. These methods utilize the histogram of pixel distribution in a binarized image for analysis, identifying the character boundaries between adjacent characters to segment multiple characters for recognition. Then, simple image recognition is performed on these characters to obtain the target character, thus completing the OCR recognition. The specific implementation process of projection methods is well-known to those skilled in the art and will not be elaborated upon here.
[0102] Additionally, in one embodiment, the image reference information further includes title position information, referring to... Figure 5 , Figure 1 The step S150 shown also includes, but is not limited to, the following steps:
[0103] Step S510: Determine the title area based on the title position information, and determine the remaining area as the content area;
[0104] In step S520, the characters identified from the title area are determined as title characters, and the characters identified from the content area are determined as content characters.
[0105] It should be noted that different supporting documents have different title areas. For example, the title of a common financial statement is usually on the left or top of the document. Since the supporting documents required for underwriting are predictable, the title position information can be pre-set, and the area corresponding to the title position information can be used as the title area for title character recognition. The remaining area can be determined as the content area for content recognition.
[0106] Additionally, in one embodiment, the image reference information further includes data style information, which includes the arrangement of data indicators in the data file image. The target dictionary includes multiple candidate keywords. Figure 6 , Figure 1 Step S160 of the illustrated embodiment also includes, but is not limited to, the following steps:
[0107] Step S610: Determine at least one indicator recognition region from the data file image based on the data style information. The indicator recognition region includes an indicator name region and an indicator value region.
[0108] Step S620: Determine the characters in the indicator name area as the characters to be matched, and determine the characters in the indicator value area as the target indicator value. The target indicator value shall include at least one.
[0109] Step S630: Match the target keyword from the candidate keywords based on the characters to be matched;
[0110] Step S640: Input each target keyword and its corresponding at least one target indicator value into the NLP model to obtain the indicator analysis results;
[0111] Step S650: Determine the data review results based on the analysis results of all indicators.
[0112] It should be noted that different supporting documents have different arrangement methods. For example, in asset certificates, each row or column represents a different financial indicator. To better perform semantic recognition, multiple indicator recognition areas can be determined based on the document format information. The characters to be matched are identified in the indicator name area to match target keywords and determine the specific indicator type, such as the various financial indicators in the financial certificate. Then, the numerical values corresponding to the indicator values of the characters to be matched are identified to determine the numerical values corresponding to the financial indicators, thereby determining the indicator analysis results.
[0113] It should be noted that the data style can be the distribution of indicator values and characters to be matched, such as the indicator values being located to the right of the characters to be matched, etc., which can be adjusted according to different data files.
[0114] It should be noted that there can be multiple target indicator values, such as financial data recorded in financial documents over several years. This embodiment does not limit the number of target indicator values. It should also be noted that when there are multiple target indicator values, the indicator analysis result can be determined by calculating the quantitative relationship between two adjacent target indicator values. For example, a year-on-year calculation can be performed based on financial data from several consecutive years to obtain the annual growth rate as the indicator analysis result. The specific calculation method can be adjusted in the NLP model according to the specific indicator type and underwriting requirements.
[0115] After obtaining the analysis results of multiple indicators, the data review results can be determined through NLP models. For example, the financial risk of users can be determined based on the analysis results of different financial indicators. The specific assessment method can be adjusted according to the type of data document.
[0116] Additionally, in one embodiment, reference is made to Figure 7 , Figure 6 Step S630 of the illustrated embodiment also includes, but is not limited to, the following steps:
[0117] Step S710: Determine the third Euclidean distance between the character to be matched and the candidate keyword;
[0118] Step S720: When the third Euclidean distance is less than the preset threshold, the corresponding candidate keyword is determined as the target keyword.
[0119] It should be noted that since the documents are usually relatively rigorous, the characters to be matched are typically highly descriptive. The text can be segmented using a forward maximum matching algorithm to identify keywords. After obtaining the text output by the OCR model, the characters to be matched can be converted into semantic vectors. These semantic vectors are then matched against candidate keywords in the target dictionary, and the Euclidean distance between the two words is calculated. If the Euclidean distance is within a threshold, the match is considered successful; otherwise, it is skipped as an unknown new word. This effectively improves the accuracy of target keyword recognition.
[0120] Additionally, refer to Figure 8 This invention provides a document review device based on OCR and NLP. The document review device 800 based on OCR and NLP includes, but is not limited to, the following units:
[0121] The information acquisition unit 810 is used to acquire the data file images and image reference information of the underwriting business. The image reference information includes a preset reference ratio range.
[0122] The image recognition unit 820 is used to perform image recognition on the data file image to determine the face region and character region of the data file image;
[0123] The image detection unit 830 is used to determine the data file image as a valid data image when the proportion of the face region in the data file image meets the reference proportion range.
[0124] The character recognition unit 840 is used to recognize target characters from the character region of a valid data image using a preset OCR model;
[0125] The dictionary determination unit 850 is used to determine the title character and content character from the target characters, and to determine the target dictionary from the preset candidate dictionary based on the title character;
[0126] The semantic recognition unit 860 is used to input the target dictionary and content characters into a preset NLP model for semantic recognition to obtain the data review results.
[0127] Additionally, refer to Figure 9 An embodiment of the present invention also provides an electronic device 900, which includes: a memory 910, a processor 920, and a computer program stored in the memory 910 and executable on the processor 920.
[0128] The processor 920 and memory 910 can be connected via a bus or other means.
[0129] The non-transitory software program and instructions required to implement the OCR and NLP-based data review method of the above embodiments are stored in memory 910. When executed by processor 920, the OCR and NLP-based data review method of the above embodiments is executed, for example, the method described above is executed. Figure 1 Method steps S110 to S160, Figure 2 Method steps S210 to S250 Figure 3 Method steps S310 to S320 Figure 4 Method steps S410 to S440 Figure 5 Method steps S510 to S520 Figure 6 Method steps S610 to S650 Figure 7 Method steps S710 to S720.
[0130] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0131] Furthermore, one embodiment of the present invention provides a computer-readable storage medium storing a computer program that is executed by a processor or controller, for example, by a processor in the above-described electronic device embodiment, causing the processor to perform the OCR and NLP-based data verification method described above, for example, performing the above-described... Figure 1 Method steps S110 to S160, Figure 2 Method steps S210 to S250 Figure 3 Method steps S310 to S320 Figure 4 Method steps S410 to S440 Figure 5 Method steps S510 to S520 Figure 6 Method steps S610 to S650 Figure 7The method steps S710 to S720 are described above. Those skilled in the art will understand that all or some of the steps and apparatus in the methods disclosed above can be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all physical components can be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit. Such software can be distributed on a computer-readable storage medium, which can include computer storage media (or non-transitory storage media) and communication storage media (or temporary storage media). As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable storage media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other storage medium that can be used to store desired information and is accessible by a computer. Furthermore, as is known to those skilled in the art, communication storage media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery storage medium.
[0132] This embodiment can be used in numerous general-purpose or special-purpose computer device environments or configurations. Examples include: personal computers, server computers, handheld or portable electronic devices, tablet-type electronic devices, multiprocessor devices, microprocessor-based devices, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above devices or electronic devices, etc. This application can be described in the general context of a computer program executed by a computer, such as a program module. Generally, a program module includes routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This application can also be practiced in distributed computing environments where tasks are performed by remote processing electronic devices connected via a communication network. In a distributed computing environment, program modules can reside in local and remote computer storage media, including storage electronic devices.
[0133] The units described in the embodiments of this application can be implemented in software or hardware, and the described units can also be located in a processor. The names of these units do not necessarily limit the specific unit itself.
[0134] It should be noted that although several modules or units for the electronic device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.
[0135] Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing electronic device (such as a personal computer, server, touch terminal, or network electronic device, etc.) to execute the method according to the embodiments of this application.
[0136] The electronic device in this embodiment may include components such as: radio frequency (RF) circuitry, memory, input unit, display unit, sensor, audio circuitry, wireless fidelity (WiFi) module, processor, and power supply. The RF circuitry can be used for receiving and transmitting signals during information transmission or calls, specifically receiving downlink information from the base station and processing it with the processor; additionally, it can send uplink data to the base station. Typically, the RF circuitry includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier (LNA), and a duplexer. Furthermore, the RF circuitry can also communicate wirelessly with networks and other devices. The aforementioned wireless communication can use any communication standard or protocol, including but not limited to Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, and Short Messaging Service (SMS). The memory can be used to store software programs and modules. The processor executes various functional applications and data processing of the electronic device by running the software programs and modules stored in the memory. The memory can mainly include a program storage area and a data storage area. The program storage area can store the operating system, applications required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area can store data created according to the use of the electronic device (such as audio data, telephone book, etc.). In addition, the memory can include high-speed random access memory, and can also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. The input unit can be used to receive input digital or character information, and generate key signal input related to the settings and function control of the electronic device. Specifically, the input unit can include a touch panel and other input devices. The touch panel, also known as a touch screen, can collect touch operations on or near it (such as operations using fingers, styluses, or any suitable object or accessory on or near the touch panel) and drive the corresponding connected devices according to a pre-set program. Optionally, the touch panel can include two parts: a touch detection device and a touch controller.The touch detection device detects the touch location and the signal generated by the touch operation, transmitting the signal to the touch controller. The touch controller receives touch information from the touch detection device, converts it into touch point coordinates, sends it to the processor, and can receive and execute commands from the processor. Furthermore, various types of touch panels can be implemented, including resistive, capacitive, infrared, and surface acoustic wave touch panels. Besides the touch panel, the input unit can also include other input devices. Specifically, other input devices can include, but are not limited to, one or more of physical keyboards, function keys (such as volume control buttons, power buttons, etc.), trackballs, mice, and joysticks. The display unit can be used to display input or provided information and various menus of the electronic device. The display unit can include a display panel, optionally configured as a Liquid Crystal Display (LCD), Organic Light-Emitting Diode (OLED), or similar display panel. Further, the touch panel can cover the display panel. When the touch panel detects a touch operation on or near it, it transmits the information to the processor to determine the type of touch event. Subsequently, the processor provides corresponding visual output on the display panel based on the type of touch event. The touch panel and display panel are two separate components for implementing input and output functions of the electronic device. However, in some embodiments, the touch panel and display panel can be integrated to achieve the input and output functions of the electronic device. The electronic device may also include at least one sensor, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel according to the ambient light level, and the proximity sensor can turn off the display panel and / or backlight when the electronic device is moved to the ear. As a type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in various directions (generally three axes). When stationary, it can detect the magnitude and direction of gravity, and can be used for applications that identify the posture of the electronic device (such as landscape / portrait switching, related games, magnetometer posture calibration), vibration recognition functions (such as pedometers, tapping), etc. Other sensors that may be configured in the electronic device, such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors, will not be elaborated here. Audio circuitry, speakers, and microphones provide an audio interface. The audio circuit can convert the received audio data into electrical signals and transmit them to the speaker, where the speaker converts them into sound signals for output. On the other hand, the microphone converts the collected sound signals into electrical signals, which are then received by the audio circuit and converted into audio data. The audio data is then output to the processor for processing and then sent to another electronic device, such as another electronic device, via the RF circuit, or output to the memory for further processing.
[0137] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein.
[0138] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.
[0139] The above is a detailed description of the preferred embodiments of the present invention. However, the present invention is not limited to the above embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. All such equivalent modifications or substitutions are included within the scope defined by the claims of the present invention.
Claims
1. A document review method based on OCR and NLP, characterized in that, include: Acquire images and image reference information of underwriting documents, wherein the image reference information includes a preset reference ratio range; Image recognition is performed on the data file image to determine the face region and character region of the data file image; When the proportion of the face region in the data file image meets the reference proportion range, the data file image is determined to be a valid data image; The target character is identified from the character region of the valid data image using a preset OCR model; The title character and content character are determined from the target characters, and the target dictionary is determined from the preset candidate dictionary based on the title character; The target dictionary and the content characters are input into a preset NLP model for semantic recognition to obtain the data review results; The image reference information also includes a reference Euclidean distance, and the method for determining the data file image as a valid data image further includes: Obtain the RGB value of each pixel in the data file image; Determine the first Euclidean distance between the RGB value of each pixel and the first RGB reference value, wherein the color corresponding to the first RGB reference value is black; Determine the second Euclidean distance between the RGB value of each pixel and the second RGB reference value, wherein the color corresponding to the second RGB reference value is white; The average Euclidean distance of the data file image is determined based on the first Euclidean distance and the second Euclidean distance of all the pixels. When the average Euclidean distance of the data file image is less than the reference Euclidean distance, the data file image is determined to be the valid data image.
2. The document review method based on OCR and NLP according to claim 1, characterized in that, The step of obtaining the RGB value of each pixel in the data file image includes: The image file is subjected to edge detection using a preset edge detection algorithm to determine the image edges; The image detection region is determined based on the image edges, and the RGB value of each pixel is obtained in the image detection region.
3. The document review method based on OCR and NLP according to claim 1, characterized in that, The step of identifying the target character from the character region of the valid data image using a preset OCR model includes: The valid data image is binarized to obtain a binarized image; Character projection is performed on the binarized image, and character boundary points are determined based on the character projection results. Based on all the character demarcation points, multiple characters to be identified are segmented from the valid data image; Each of the characters to be identified is subjected to character recognition to obtain the target character.
4. The document review method based on OCR and NLP according to claim 1, characterized in that, The image reference information also includes title position information, and the step of determining the title character and content character from the target character includes: The title area is determined based on the title position information, and the remaining area is determined as the content area. The characters identified from the title area are determined as the title characters, and the characters identified from the content area are determined as the content characters.
5. The document review method based on OCR and NLP according to claim 1, characterized in that, The image reference information also includes document style information, which includes the arrangement of document indicators in the document file image. The target dictionary includes multiple candidate keywords. The step of inputting the target dictionary and the content characters into a preset NLP model for semantic recognition to obtain document review results includes: Based on the data style information, at least one indicator recognition region is determined from the data file image, the indicator recognition region including an indicator name region and an indicator value region; The content characters in the indicator name area are determined as the characters to be matched, and the content characters in the indicator value area are determined as the target indicator value, wherein the target indicator value includes at least one; The target keyword is matched from the candidate keywords based on the character to be matched; Each target keyword and its corresponding at least one target indicator value are input into the NLP model to obtain indicator analysis results; The document review result is determined based on the analysis results of all the aforementioned indicators.
6. The document review method based on OCR and NLP according to claim 5, characterized in that, The step of matching the target keyword from the candidate keywords based on the character to be matched includes: Determine the third Euclidean distance between the character to be matched and the candidate keyword; When the third Euclidean distance is less than a preset threshold, the corresponding candidate keyword is determined as the target keyword.
7. A document review device based on OCR and NLP, characterized in that, include: The information acquisition unit is used to acquire images of documents and image reference information for underwriting business, wherein the image reference information includes a preset reference ratio range; The image recognition unit is used to perform image recognition on the data file image and determine the face region and character region of the data file image; The image detection unit is used to determine the data file image as a valid data image when the proportion of the face region in the data file image meets the reference proportion range. A character recognition unit is used to identify target characters from the character region of the valid data image using a preset OCR model; A dictionary determination unit is used to determine title characters and content characters from the target characters, and to determine a target dictionary from a preset candidate dictionary based on the title characters; The semantic recognition unit is used to input the target dictionary and the content characters into a preset NLP model for semantic recognition to obtain the data review result; The image reference information also includes a reference Euclidean distance, and determining the data file image as a valid data image includes: Obtain the RGB value of each pixel in the data file image; Determine the first Euclidean distance between the RGB value of each pixel and the first RGB reference value, wherein the color corresponding to the first RGB reference value is black; Determine the second Euclidean distance between the RGB value of each pixel and the second RGB reference value, wherein the color corresponding to the second RGB reference value is white; The average Euclidean distance of the data file image is determined based on the first Euclidean distance and the second Euclidean distance of all the pixels. When the average Euclidean distance of the data file image is less than the reference Euclidean distance, the data file image is determined to be the valid data image.
8. An electronic device, comprising: A memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the computer program, it implements the OCR and NLP-based data review method as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, The computer program is used to execute the OCR and NLP-based document review method as described in any one of claims 1 to 6.