Artificial intelligence-based picture recognition method and device, electronic equipment and medium
By using artificial intelligence methods to flip and correct tilted images, the traditional recognition schemes' reliance on high resolution and upright images is eliminated, achieving high-accuracy recognition of images at any angle.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-03-24
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional multimodal document recognition solutions require user-uploaded images to be high-resolution and upright, and cannot accurately recognize images at tilted angles.
The system uses artificial intelligence to acquire images for recognition, perform target detection, determine the tilt angle and perform flip correction, error correction, text merging and preprocessing, and extract the recognition results.
It improves the recognition accuracy of images from any angle, enhances the flexibility and accuracy of the recognition results, and meets user needs.
Smart Images

Figure CN116363649B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, specifically to an image recognition method, apparatus, electronic device, and medium based on artificial intelligence. Background Technology
[0002] Natural language processing and computer vision are both important research directions in the field of artificial intelligence. Multimodal document recognition refers to the process of understanding, classifying, extracting, and summarizing information such as text and formatting contained in web pages, images, or scanned documents through artificial intelligence technology.
[0003] Traditional multimodal document recognition solutions require high resolution of the document information in the user-uploaded images, preferably PDF scans, and require the user to upload them upright. They cannot accurately recognize and detect objects in images that are tilted. Summary of the Invention
[0004] In view of the above, it is necessary to propose an image recognition method, device, electronic device and medium based on artificial intelligence, which improves the image recognition accuracy by flipping the user-uploaded image from any angle.
[0005] A first aspect of the present invention provides an image recognition method based on artificial intelligence, the method comprising:
[0006] In response to a received image recognition request, obtain the image to be recognized;
[0007] The recognition image is subjected to target detection using OCR, and the target detection result for each bounding box is obtained;
[0008] Based on the target detection results of each bounding box, it is determined whether the recognized image has a tilt angle;
[0009] If the identified image has a tilt angle, the identified image is flipped and corrected to obtain the target image;
[0010] The target detection results of each bounding box are corrected to obtain the first text of the corresponding bounding box;
[0011] Multiple bounding boxes are sorted according to the first position information of each bounding box in the target image, and the multiple first texts of the sorted bounding boxes are merged to obtain the second text.
[0012] The second text is preprocessed to obtain the third text;
[0013] The third text is extracted, and the recognition result is returned.
[0014] Optionally, the step of performing flip correction on the identified image to obtain the target image includes:
[0015] Calculate the tangent value of the length direction of each bounding box in the recognized image, and average the tangent values in the length direction to obtain the first tangent value in the length direction;
[0016] Calculate the tangent value in the width direction of each bounding box in the recognized image, and average the tangent values in the width direction to obtain a second tangent value in the width direction;
[0017] Identify the image and obtain the target bounding box corresponding to the title;
[0018] Using the center of the target bounding box as the coordinate point, the recognition image is flipped according to the first tangent value in the length direction;
[0019] The flipped recognition image is corrected based on the second tangent value in the width direction to obtain the target image.
[0020] Optionally, the step of performing error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box includes:
[0021] The target text in the target detection result of each bounding box is segmented into words to obtain the character encoding of each character;
[0022] Based on the context of the target text, obtain the paragraph code and position code of each character;
[0023] The character encoding, the paragraph encoding, and the position encoding are superimposed to form a first character embedding vector;
[0024] The first character embedding vector is input into a preset error detection model to obtain the error probability of each character;
[0025] The first character embedding vector of each character is weighted and summed with the error probability of the corresponding character. The weighted sum is then input into a preset error correction model to obtain the first text of the corresponding bounding box.
[0026] Optionally, the step of sorting the multiple bounding boxes according to the first position information of each bounding box in the target image, and merging the multiple first texts of the sorted multiple bounding boxes to obtain the second text includes:
[0027] Obtain the second position information of each bounding box in the recognition image;
[0028] Update the second position information of each bounding box to the first position information of each bounding box in the target image;
[0029] Sort the multiple bounding boxes according to the first position information of each bounding box in the target image;
[0030] Based on the first position information of each bounding box, calculate the distance between each bounding box and its adjacent bounding boxes in the target image one by one;
[0031] When the distance between each bounding box and its adjacent bounding box meets the preset distance requirement, the first text in each bounding box is merged with the first text in the corresponding adjacent bounding box to obtain the second text.
[0032] Optionally, the preprocessing of the second text to obtain the third text includes:
[0033] Obtain the identification code of the image being identified;
[0034] The image type of the identified image is obtained based on the identification code;
[0035] If the image type is text, obtain the regular expression corresponding to the identification code; use the regular expression to perform anomaly detection on the second text to obtain the third text;
[0036] If the image type is an image type, the second text is detected by a DICOM object detector, and the attributes and attribute values of the second text are extracted; the attributes and attribute values are combined into attribute information according to a predetermined format to obtain the third text.
[0037] Optionally, the step of using OCR to perform target detection on the recognition image to obtain the target detection result for each bounding box includes:
[0038] The image to be identified is then subjected to target detection using OCR.
[0039] The position coordinates of each target object are detected, and the corresponding bounding box is drawn based on the position coordinates of each target object;
[0040] The first location information and target text of each bounding box are obtained and determined as the target detection result of the corresponding bounding box.
[0041] Optionally, the step of extracting the third text and returning the recognition result includes:
[0042] Obtain user requirements and the type of user requirements from the image recognition request;
[0043] If the user's requirement type is a user extraction requirement, the third text is extracted based on the extraction conditions of the user extraction requirement, and the recognition result is returned.
[0044] If the user's requirement type is a user profile requirement, perform profile tag extraction on the third text and return the user's profile tags.
[0045] A second aspect of the present invention provides an image recognition device based on artificial intelligence, the device comprising:
[0046] The acquisition module is used to acquire the image to be recognized in response to the received image recognition request;
[0047] The target detection module is used to perform target detection on the recognition image using OCR to obtain the target detection result for each bounding box;
[0048] The judgment module is used to determine whether the recognized image has a tilt angle based on the target detection results of each bounding box;
[0049] The flip correction module is used to flip the recognized image to correct the tilt angle if the recognized image has a tilt angle, so as to obtain the target image.
[0050] The error correction processing module is used to perform error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box;
[0051] The text merging module is used to sort multiple bounding boxes according to the first position information of each bounding box in the target image, and to merge the multiple first texts of the sorted bounding boxes to obtain the second text.
[0052] The preprocessing module is used to preprocess the second text to obtain the third text;
[0053] The extraction module is used to extract the third text and return the recognition result.
[0054] A third aspect of the present invention provides an electronic device comprising a processor and a memory, wherein the processor is configured to implement the artificial intelligence-based image recognition method by executing a computer program stored in the memory.
[0055] A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned artificial intelligence-based image recognition method.
[0056] In summary, the AI-based image recognition method, device, electronic device, and medium described in this invention can promote the construction of smart cities and be applied in fields such as smart buildings, smart security, smart communities, smart living, and the Internet of Things. By using OCR to perform target detection on the image, and based on the target detection results of each bounding box, images with tilt angles are flipped and corrected to obtain the target image. This eliminates the need for users to upload accurate PDF scans and properly aligned images, improving the flexibility of uploaded recognition images. Error correction processing is performed on the target detection results of each bounding box to obtain the first text of the corresponding bounding box. Multiple bounding boxes are sorted according to the first position information of each bounding box in the target image, and the multiple first texts of the sorted bounding boxes are merged to obtain the second text. This avoids the problem of inaccurate subsequent recognition results due to text breaks, improving the accuracy of text recognition results. The second text is preprocessed, and the resulting third text is extracted and the recognition result is returned. During the extraction process, recognition results are extracted according to different requirement types, improving the accuracy of the returned recognition results and user satisfaction. Attached Figure Description
[0057] Figure 1 This is a flowchart of an image recognition method based on artificial intelligence provided in Embodiment 1 of the present invention.
[0058] Figure 2 This is a schematic diagram of an image with a tilt angle provided in Embodiment 1 of the present invention.
[0059] Figure 3 This is a structural diagram of the image recognition device based on artificial intelligence provided in Embodiment 2 of the present invention.
[0060] Figure 4 This is a schematic diagram of the structure of the electronic device provided in Embodiment 3 of the present invention. Detailed Implementation
[0061] To better understand the above-mentioned objects, features, and advantages of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments of the present invention and the features thereof can be combined with each other.
[0062] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
[0063] Example 1
[0064] Figure 1This is a flowchart of an image recognition method based on artificial intelligence provided in Embodiment 1 of the present invention.
[0065] In this embodiment, the AI-based image recognition method can be applied to electronic devices. For electronic devices that require AI-based image recognition, the AI-based image recognition function provided by the method of this invention can be directly integrated into the electronic device, or it can run in the electronic device as a software development kit (SDK).
[0066] The embodiments of this invention can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that utilize digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.
[0067] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, as well as machine learning and deep learning.
[0068] like Figure 1 As shown, the image recognition method based on artificial intelligence specifically includes the following steps. Depending on different needs, the order of the steps in this flowchart can be changed, and some steps can be omitted.
[0069] 101. In response to the received image recognition request, obtain the image to be recognized.
[0070] In this embodiment, in the field of digital healthcare, hospitals, doctors, users, and medical-related companies all need to analyze users' medical records, test results, and reports. Based on the analysis results, they extract the user's specific diseases and physical conditions, and then provide relevant drug recommendations, insurance recommendations, and rehabilitation guidance. For example, users can simply take a photo of their examination report and upload it. The electronic device can recognize the uploaded image and extract all valuable information. This saves time for users, hospitals, and other relevant professionals in reviewing and understanding the information, allowing users to achieve satisfactory diagnosis and follow-up services. In this embodiment, when a user takes a picture of their examination report, they send an image recognition request to the electronic device. When the electronic device receives the image recognition request, it obtains the message information in the request and retrieves the image for recognition based on that message information.
[0071] 102. OCR is used to perform target detection on the recognition image to obtain the target detection result for each bounding box.
[0072] In this embodiment, OCR text recognition refers to the process of scanning text documents, analyzing and processing image files, and obtaining text and layout information.
[0073] In this embodiment, OCR mainly includes one or more of the following components: Image input: different storage and compression methods are used for images of different formats; Binarization: color images are processed, defining the foreground information as black and the background information as white; Noise removal: noise removal is performed using corresponding noise features for different images; Tilt correction: text tilt correction is performed using text recognition software; Layout analysis: document images are segmented; Character segmentation: connected characters are segmented; Character recognition.
[0074] In this embodiment, when detecting objects in an image, a bounding box is typically used to describe the spatial location of the object. For example, the bounding box can be a rectangle, determined by the sum of the coordinates of the top left and bottom right corners of the rectangle, or it can be determined by the axis coordinates of the center of the bounding box and the width and height of the box.
[0075] In this embodiment, when using OCR to recognize images, each character is treated as an object, and the spatial position of each recognized character is represented by a bounding box.
[0076] In an optional embodiment, the step of using OCR to perform target detection on the recognition image to obtain the target detection result for each bounding box includes:
[0077] The image to be identified is then subjected to target detection using OCR.
[0078] The position coordinates of each target object are detected, and the corresponding bounding box is drawn based on the position coordinates of each target object;
[0079] The first location information and target text of each bounding box are obtained and determined as the target detection result of the corresponding bounding box.
[0080] In this embodiment, the position coordinates can be represented as the center point coordinates of each target object.
[0081] In this embodiment, each bounding box is assumed to be rectangular, and the first position information of each bounding box includes the coordinates of the top left corner, the bottom left corner, the top right corner, and the bottom right corner of each bounding box.
[0082] 103. Based on the target detection results of each bounding box, determine whether the recognized image has a tilt angle.
[0083] In this embodiment, the tilt angle indicates that the image being recognized is not positioned correctly and is tilted.
[0084] Specifically, the presence of a tilt angle in the recognized image is determined based on the coordinate information of the upper left and upper right corners of each bounding box, or the presence of a tilt angle in the recognized image is determined based on the coordinate information of the lower left and lower right corners of each bounding box.
[0085] In an optional embodiment, determining whether the recognized image has a tilt angle based on the coordinate information of the upper left corner and the upper right corner of each bounding box includes: calculating the tangent value corresponding to the upper left corner or the tangent value corresponding to the upper right corner based on the coordinate information of the upper left corner and the upper right corner of each bounding box; determining whether the tangent value meets the image tilt requirement; if the tangent value meets the image tilt requirement, determining that the recognized image has a tilt angle; if the tangent value does not meet the image tilt requirement, determining that the recognized image does not have a tilt angle.
[0086] In an optional embodiment, determining whether the recognized image has a tilt angle based on the coordinate information of the lower left corner and the lower right corner of each bounding box includes: calculating the tangent value corresponding to the lower left corner or the lower right corner based on the coordinate information of the lower left corner and the lower right corner of each bounding box; determining whether the tangent value meets the image tilt requirement; if the tangent value meets the image tilt requirement, determining that the recognized image has a tilt angle; if the tangent value does not meet the image tilt requirement, determining that the recognized image does not have a tilt angle.
[0087] In this embodiment, the tangent value can be set to 0.2, and the specific setting depends on the actual image.
[0088] 104. If the identified image has a tilt angle, the identified image is flipped and corrected to obtain the target image.
[0089] In this embodiment, to improve the accuracy of image recognition, for images uploaded by users at different tilt angles, refer to... Figure 2 As shown, the identified image is flipped and corrected to obtain the target image.
[0090] In an optional embodiment, the step of performing flip correction on the identified image to obtain the target image includes:
[0091] Calculate the tangent value of the length direction of each bounding box in the recognized image, and average the tangent values in the length direction to obtain the first tangent value in the length direction;
[0092] Calculate the tangent value in the width direction of each bounding box in the recognized image, and average the tangent values in the width direction to obtain a second tangent value in the width direction;
[0093] Identify the image and obtain the target bounding box corresponding to the title;
[0094] Using the center of the target bounding box as the coordinate point, the recognition image is flipped according to the first tangent value in the length direction;
[0095] The flipped recognition image is corrected based on the second tangent value in the width direction to obtain the target image.
[0096] In this embodiment, see Figure 2 As shown, the target bounding box corresponding to the title (e.g., X Hospital's triage fee and billing fee) is flipped using the center of the bounding box as the coordinate point. Generally, the target bounding box corresponding to the title is located at the top or bottom of the image being recognized. The bounding box with the larger width is selected from the top and bottom as the target bounding box corresponding to the title.
[0097] Specifically, since each bounding box is rectangular, the tangent value of the length direction of each bounding box is calculated, that is, the tangent value of the tilt angle 1, and the average value of the tangent value in the length direction is calculated. The recognition image is then flipped based on the first tangent value in the length direction obtained from the calculation.
[0098] In this embodiment, theoretically, the sum of the tangent values in the length direction and the width direction is 90 degrees. By calculating the second tangent value in the width direction of each bounding box, that is, the tangent value of the tilt angle 2, the flipped recognition image is corrected based on the second tangent value in the width direction.
[0099] In this embodiment, the image to be recognized is flipped based on the first tangent value in the length direction, and the flipped image is corrected based on the second tangent value in the width direction. This avoids the problem of inaccurate image flipping caused by flipping only in the length or width direction, improves the accuracy of image flipping, and thus improves the accuracy of image recognition.
[0100] 105. Perform error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box.
[0101] In this embodiment, the error correction process refers to correcting the target text in the target detection result of the bounding box, completing missing words, and correcting erroneous words. For example, if the target text in the target detection result is "Q Central Hospital", the first text of the corresponding bounding box obtained through error correction is "Q City Central Hospital".
[0102] In an optional embodiment, the step of performing error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box includes:
[0103] The target text in the target detection result of each bounding box is segmented into words to obtain the character encoding of each character;
[0104] Based on the context of the target text, obtain the paragraph code and position code of each character;
[0105] The character encoding, the paragraph encoding, and the position encoding are superimposed to form a first character embedding vector;
[0106] The first character embedding vector is input into a preset error detection model to obtain the error probability of each character;
[0107] The first character embedding vector of each character is weighted and summed with the error probability of the corresponding character. The weighted sum is then input into a preset error correction model to obtain the first text of the corresponding bounding box.
[0108] In this embodiment, the target text is segmented based on a word segmentation algorithm. For example, the word segmentation algorithm can be an NLP word segmentation algorithm, a Chinese word segmentation algorithm, etc.
[0109] In this embodiment, the preset error detection model can be a soft-masked BERT model. The soft-masked BERT model is used to correct errors in the target text in the target detection results. The soft-masked BERT model includes two network models. The first network model is an error detection model, which is used to detect errors, predict the probability that the character at each position is a typo, and then use this probability to perform a soft mask on the character at that position. The soft mask at each position is then input into the error correction model. The second network model is an error correction model, which is used to correct spelling errors.
[0110] In this embodiment, the learning of the soft-masked BERT model is end-to-end. The training set consists of original sentences in a medical corpus and the correct sentences corresponding to those original sentences. The learning process involves optimizing two objectives, corresponding to error detection and error correction, respectively. Then, the two objectives are linearly combined to obtain the first text corresponding to the bounding box.
[0111] In this embodiment, the soft-masked BERT model is trained using text from the medical corpus of the medical system, which improves the robustness of the trained soft-masked BERT model and thus improves the accuracy of text error correction.
[0112] 106. Sort the multiple bounding boxes according to the first position information of each bounding box in the target image, and merge the multiple first texts of the sorted multiple bounding boxes to obtain the second text.
[0113] In this embodiment, by flipping the recognition image, the first position information of each bounding box in the recognition image changes. The bounding boxes in the recognition image are then reordered according to the first position information of each bounding box in the target image obtained after the flip. The second text is obtained by merging the multiple first texts of the reordered bounding boxes. In an optional embodiment, the step of sorting the multiple bounding boxes according to the first position information of each bounding box in the target image and merging the multiple first texts of the sorted bounding boxes to obtain the second text includes:
[0114] Obtain the second position information of each bounding box in the recognition image;
[0115] Update the second position information of each bounding box to the first position information of each bounding box in the target image;
[0116] Sort the multiple bounding boxes according to the first position information of each bounding box in the target image;
[0117] Based on the first position information of each bounding box, calculate the distance between each bounding box and its adjacent bounding boxes in the target image one by one;
[0118] When the distance between each bounding box and its adjacent bounding box meets the preset distance requirement, the first text in each bounding box is merged with the first text in the corresponding adjacent bounding box to obtain the second text.
[0119] Furthermore, the method also includes:
[0120] When the distance between each bounding box and its adjacent bounding box does not meet the preset distance requirement, the first text in each bounding box is determined as the second text.
[0121] In this embodiment, a distance requirement can be preset. The preset distance requirement can be preset based on historical experience. By judging whether the preset distance requirement is met, the split text in the two bounding boxes that meet the preset distance requirement is merged and integrated according to the judgment result. The merged second text is a text with complete semantics, which avoids the problem of inaccurate recognition results returned later due to text segmentation and improves the accuracy of text recognition results.
[0122] 107. Preprocess the second text to obtain the third text.
[0123] In this embodiment, the preprocessing refers to extracting content from the merged second text, tagging it, etc., to obtain the third text.
[0124] In an optional embodiment, the preprocessing of the second text to obtain the third text includes:
[0125] Obtain the identification code of the image being identified;
[0126] The image type of the identified image is obtained based on the identification code;
[0127] If the image type is text, obtain the regular expression corresponding to the identification code; use the regular expression to perform anomaly detection on the second text to obtain the third text;
[0128] If the image type is an image type, the second text is detected by a DICOM object detector, and the attributes and attribute values of the second text are extracted; the attributes and attribute values are combined into attribute information according to a predetermined format to obtain the third text.
[0129] In this embodiment, different image types correspond to different preprocessing strategies. When the image type is text, the anomalies and their specific attributes are extracted from the second text, which is the third text. When the image type is image, the second text is detected using a DICOM object detector, and the attributes and attribute values of the second text are extracted. For example, when a user uploads a CT scan report, the DICOM object detector is used to detect the CT scan report, and certain nodules and their corresponding attribute information are extracted from the target detection results.
[0130] 108. Extract the third text and return the recognition result.
[0131] In this embodiment, different users have different requirements for the returned recognition results, and the recognition results can be returned according to specific business needs.
[0132] In an optional embodiment, the extraction of the third text and the return of the recognition result includes:
[0133] Obtain user requirements and the type of user requirements from the image recognition request;
[0134] If the user's requirement type is a user extraction requirement, the third text is extracted based on the extraction conditions of the user extraction requirement, and the recognition result is returned.
[0135] If the user's requirement type is a user profile requirement, perform profile tag extraction on the third text and return the user's profile tags.
[0136] In this embodiment, the user requirements include user extraction requirements and user profiling requirements. User extraction requirements refer to users providing extraction conditions and extracting third-party text based on those conditions. User profiling requirements refer to identifying user groups, extracting profile tags from the third-party text based on those user groups, and returning the profile tags associated with each user.
[0137] In this embodiment, by extracting recognition results from the preprocessed third text according to different demand types, the accuracy of the returned recognition results and user satisfaction are improved.
[0138] In this embodiment, target detection is performed on the recognition image using OCR to obtain the target detection result for each bounding box. Based on the target detection result of each bounding box, it is determined whether the recognition image has a tilt angle. If the recognition image has a tilt angle, it is flipped to obtain the target image. At the same time, text correction processing is performed on the target detection result of each bounding box, and the text in the text correction processing result is merged. The merged text is preprocessed and content extracted to obtain the recognition result. It supports users to upload recognition images at any angle, and the electronic device can automatically detect and flip them. This solves the problem in the prior art that users need to upload accurate PDF scans and upright images for accurate recognition, thus improving the image recognition accuracy.
[0139] 109. If the identified image does not have a tilt angle, the identified image is not flipped for correction.
[0140] In summary, the AI-based image recognition method described in this embodiment uses OCR to detect targets in the image. Based on the target detection results of each bounding box, it flips and corrects tilted images to obtain the target image. This eliminates the need for users to upload accurate PDF scans and properly aligned images, improving the flexibility of uploaded images. By correcting the target detection results of each bounding box, the first text of the corresponding bounding box is obtained. Multiple bounding boxes are sorted according to their first position information in the target image, and the first texts of the sorted bounding boxes are merged to obtain the second text. This avoids inaccurate recognition results due to text breaks, improving the accuracy of text recognition. The second text is preprocessed, and the resulting third text is extracted and the recognition result is returned. During the extraction process, recognition results are extracted according to different requirement types, improving the accuracy of the returned recognition results and user satisfaction.
[0141] Example 2
[0142] Figure 3 This is a structural diagram of the image recognition device based on artificial intelligence provided in Embodiment 2 of the present invention.
[0143] In some embodiments, the AI-based image recognition device 20 may include multiple functional modules composed of program code segments. The program code of each program segment in the AI-based image recognition device 20 may be stored in the memory of the electronic device and executed by the at least one processor to perform (see details). Figure 1 (Description) Image recognition function based on artificial intelligence.
[0144] In this embodiment, the AI-based image recognition device 20 can be divided into multiple functional modules according to its functions. These functional modules may include: an acquisition module 201, a target detection module 202, a judgment module 203, a flip correction module 204, an error correction processing module 205, a text merging module 206, a preprocessing module 207, and an extraction module 208. The module referred to in this invention is a series of computer-readable instruction segments that can be executed by at least one processor and perform a fixed function, stored in memory. In this embodiment, the functions of each module will be detailed in subsequent embodiments.
[0145] The acquisition module 201 is used to acquire the image to be recognized in response to the received image recognition request.
[0146] The target detection module 202 is used to perform target detection on the recognition image using OCR to obtain the target detection result for each bounding box.
[0147] The judgment module 203 is used to determine whether the recognized image has a tilt angle based on the target detection results of each bounding box.
[0148] The flip correction module 204 is used to flip the recognition image to obtain the target image if the recognition image has a tilt angle.
[0149] The error correction processing module 205 is used to perform error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box.
[0150] The text merging module 206 is used to sort multiple bounding boxes according to the first position information of each bounding box in the target image, and to merge the multiple first texts of the sorted multiple bounding boxes to obtain the second text.
[0151] The preprocessing module 207 is used to preprocess the second text to obtain the third text.
[0152] Extraction module 208 is used to extract the third text and return the recognition result.
[0153] In an optional embodiment, the target detection module 202 is used to: perform target detection on the recognition image using OCR; detect the position coordinates of each target object, draw a corresponding bounding box based on the position coordinates of each target object; and obtain the first position information of each bounding box and the target text to determine the target detection result of the corresponding bounding box.
[0154] In an optional embodiment, the flip correction module 204 is configured to: calculate the tangent value of the length direction of each bounding box of the recognition image, and average the tangent values in the length direction to obtain a first tangent value in the length direction; calculate the tangent value of the width direction of each bounding box of the recognition image, and average the tangent values in the width direction to obtain a second tangent value in the width direction; recognize the recognition image and obtain the target bounding box corresponding to the title; flip the recognition image based on the center of the target bounding box as the coordinate point and the first tangent value in the length direction; and correct the flipped recognition image based on the second tangent value in the width direction to obtain the target image.
[0155] In an optional embodiment, the error correction processing module 205 is configured to: perform word segmentation on the target text in the target detection result of each bounding box to obtain the character encoding of each character; obtain the paragraph encoding and position encoding of each character according to the context relationship of the target text; superimpose the character encoding, the paragraph encoding and the position encoding to form a first character embedding vector; input the first character embedding vector into a preset error detection model to obtain the error probability of each character; perform a weighted summation of the first character embedding vector of each character and the error probability of the corresponding character, and input the weighted summation result into a preset error correction model to obtain the first text of the corresponding bounding box.
[0156] In an optional embodiment, the text merging module 206 is configured to: obtain second position information of each bounding box in the recognition image; update the second position information of each bounding box to the first position information of each bounding box in the target image; sort the multiple bounding boxes according to the first position information of each bounding box in the target image; calculate the distance between each bounding box in the target image and its adjacent bounding boxes according to the first position information of each bounding box; and when the distance between each bounding box and its adjacent bounding boxes meets a preset distance requirement, merge the first text in each bounding box with the first text in the corresponding adjacent bounding box to obtain the second text.
[0157] In an optional embodiment, the preprocessing module 207 is configured to: obtain the identification code of the identified image; obtain the image type of the identified image based on the identification code; if the image type is text, obtain the regular expression corresponding to the identification code; use the regular expression to perform anomaly detection on the second text to obtain the third text; if the image type is an image, use a DICOM object detector to detect the second text, and extract the attributes and attribute values of the second text; combine the attributes and attribute values in a predetermined format to form the attribute information to obtain the third text.
[0158] In an optional embodiment, the extraction module 208 is configured to: obtain user requirements and the type of user requirements from the image recognition request; if the type of user requirements is user extraction requirements, extract the third text based on the extraction conditions of the user extraction requirements and return the recognition result; if the type of user requirements is user profile requirements, extract profile tags from the third text and return the user's profile tags.
[0159] In this embodiment, target detection is performed on the recognition image using OCR to obtain the target detection result for each bounding box. Based on the target detection result of each bounding box, it is determined whether the recognition image has a tilt angle. If the recognition image has a tilt angle, it is flipped to obtain the target image. At the same time, text correction processing is performed on the target detection result of each bounding box, and the text in the text correction processing result is merged. The merged text is preprocessed and content extracted to obtain the recognition result. It supports users to upload recognition images at any angle, and the electronic device can automatically detect and flip them. This solves the problem in the prior art that users need to upload accurate PDF scans and upright images for accurate recognition, thus improving the image recognition accuracy.
[0160] In summary, the AI-based image recognition device described in this embodiment performs target detection on the image using OCR. Based on the target detection results of each bounding box, it flips and corrects tilted images to obtain the target image. This eliminates the need for users to upload accurate PDF scans and properly aligned images, improving the flexibility of uploaded images. By correcting the target detection results of each bounding box, the device obtains the first text for that bounding box. Multiple bounding boxes are sorted according to their first position information in the target image, and the first texts of the sorted bounding boxes are merged to obtain the second text. This avoids inaccurate recognition results due to text breaks, improving the accuracy of text recognition. The second text is preprocessed, and the resulting third text is extracted before returning the recognition result. During the extraction process, recognition results are extracted according to different requirement types, improving the accuracy of the returned recognition results and user satisfaction.
[0161] Example 3
[0162] See Figure 4 The diagram shown is a structural schematic of an electronic device provided in Embodiment 3 of the present invention. In a preferred embodiment of the present invention, the electronic device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.
[0163] Those skilled in the art should understand that Figure 4 The structure of the electronic device shown does not constitute a limitation of the embodiments of the present invention. It can be a bus structure or a star structure. The electronic device 3 may also include more or fewer other hardware or software than shown, or different component arrangements.
[0164] In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), programmable gate arrays (FPGAs), digital processors, and embedded devices. The electronic device 3 may also include client devices, including, but not limited to, any electronic product capable of human-computer interaction with a client via a keyboard, mouse, remote control, touchpad, or voice control device, such as personal computers, tablet computers, smartphones, and digital cameras.
[0165] It should be noted that the electronic device 3 is merely an example. Other existing or future electronic products that are suitable for this invention should also be included within the scope of protection of this invention and are incorporated herein by reference.
[0166] In some embodiments, the memory 31 is used to store program code and various data, such as an AI-based image recognition device 20 installed in the electronic device 3, and to achieve high-speed and automatic access to programs or data during the operation of the electronic device 3. The memory 31 includes read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or storing data.
[0167] In some embodiments, the at least one processor 32 may be composed of integrated circuits, such as a single packaged integrated circuit or multiple integrated circuits packaged with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The at least one processor 32 is the control unit of the electronic device 3, connecting various components of the entire electronic device 3 via various interfaces and lines. It executes programs or modules stored in the memory 31 and calls data stored in the memory 31 to perform various functions and process data of the electronic device 3.
[0168] In some embodiments, the at least one communication bus 33 is configured to enable communication between the memory 31 and the at least one processor 32, etc.
[0169] Although not shown, the electronic device 3 may also include a power supply (such as a battery) to power the various components. Optionally, the power supply may be logically connected to the at least one processor 32 via a power management device, thereby enabling functions such as charging, discharging, and power consumption management. The power supply may also include one or more DC or AC power supplies, recharging devices, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components. The electronic device 3 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be described in detail here.
[0170] It should be understood that the embodiments described are for illustrative purposes only and are not limited to this structure in the scope of the patent application.
[0171] The integrated unit implemented as a software functional module described above can be stored in a computer-readable storage medium. This software functional module, stored in a storage medium, includes several instructions to cause a computer device (which may be a personal computer, electronic device, or network device, etc.) or processor to execute portions of the methods described in the various embodiments of the present invention.
[0172] In a further embodiment, combined with Figure 3 The at least one processor 32 can execute the operating device of the electronic device 3 and various installed applications (such as the artificial intelligence-based image recognition device 20), program code, etc., for example, the various modules mentioned above.
[0173] The memory 31 stores program code, and the at least one processor 32 can call the program code stored in the memory 31 to execute related functions. For example, Figure 3 The modules described herein are program codes stored in the memory 31 and executed by the at least one processor 32, thereby realizing the functions of the modules to achieve the purpose of image recognition based on artificial intelligence.
[0174] For example, the program code can be divided into one or more modules / units, which are stored in the memory 31 and executed by the processor 32 to complete this application. The one or more modules / units can be a series of computer-readable instruction segments capable of performing specific functions, which describe the execution process of the program code in the electronic device 3. For example, the program code can be divided into an acquisition module 201, a target detection module 202, a judgment module 203, a flip correction module 204, an error correction processing module 205, a text merging module 206, a preprocessing module 207, and an extraction module 208.
[0175] In one embodiment of the present invention, the memory 31 stores a plurality of computer-readable instructions, which are executed by the at least one processor 32 to realize an artificial intelligence-based image recognition function.
[0176] Specifically, the specific implementation method of the above instructions by the at least one processor 32 can be referred to Figure 1 and Figure 2 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.
[0177] In the several embodiments provided by this invention, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0178] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0179] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0180] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be considered illustrative and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other elements, and the singular does not exclude the plural. Multiple elements or devices recited in the present invention may also be implemented by a single element or device in software or hardware. The terms "first," "second," etc., are used to denote names and do not indicate any particular order.
[0181] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims
1. An image recognition method based on artificial intelligence, characterized in that, The method includes: In response to a received image recognition request, obtain the image to be recognized; The recognition image is subjected to target detection using OCR, and the target detection result for each bounding box is obtained; Based on the target detection results of each bounding box, it is determined whether the recognized image has a tilt angle; If the identified image has a tilt angle, the identified image is flipped and corrected to obtain the target image, including: calculating the tangent value of the length direction of each bounding box of the identified image, and averaging the tangent values in the length direction to obtain a first tangent value in the length direction; calculating the tangent value of the width direction of each bounding box of the identified image, and averaging the tangent values in the width direction to obtain a second tangent value in the width direction; identifying the bounding boxes in the top and bottom regions of the identified image as first bounding boxes, and selecting the bounding box with the largest width from the first bounding boxes as the target bounding box corresponding to the title; using the center of the target bounding box as the coordinate point, flipping the identified image according to the first tangent value in the length direction; and correcting the flipped identified image based on the second tangent value in the width direction to obtain the target image. The target detection results of each bounding box are corrected to obtain the first text of the corresponding bounding box; Multiple bounding boxes are sorted according to the first position information of each bounding box in the target image, and the multiple first texts of the sorted bounding boxes are merged to obtain the second text. The second text is preprocessed to obtain the third text; The third text is extracted, and the recognition result is returned.
2. The image recognition method based on artificial intelligence as described in claim 1, characterized in that, The step of performing error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box includes: The target text in the target detection result of each bounding box is segmented into words to obtain the character encoding of each character; Based on the context of the target text, obtain the paragraph code and position code of each character; The character encoding, the paragraph encoding, and the position encoding are superimposed to form a first character embedding vector; The first character embedding vector is input into a preset error detection model to obtain the error probability of each character; The first character embedding vector of each character is weighted and summed with the error probability of the corresponding character. The weighted sum is then input into a preset error correction model to obtain the first text of the corresponding bounding box.
3. The image recognition method based on artificial intelligence as described in claim 1, characterized in that, The process of sorting multiple bounding boxes according to the first position information of each bounding box in the target image, and merging the multiple first texts of the sorted bounding boxes to obtain the second text includes: Obtain the second position information of each bounding box in the recognition image; Update the second position information of each bounding box to the first position information of each bounding box in the target image; Sort the multiple bounding boxes according to the first position information of each bounding box in the target image; Based on the first position information of each bounding box, calculate the distance between each bounding box and its adjacent bounding boxes in the target image one by one; When the distance between each bounding box and its adjacent bounding box meets the preset distance requirement, the first text in each bounding box is merged with the first text in the corresponding adjacent bounding box to obtain the second text.
4. The image recognition method based on artificial intelligence as described in claim 1, characterized in that, The preprocessing of the second text to obtain the third text includes: Obtain the identification code of the image being identified; The image type of the identified image is obtained based on the identification code; If the image type is text, obtain the regular expression corresponding to the identification code; The second text is subjected to anomaly detection using the regular expression to obtain the third text; If the image type is an image type, the second text is detected using a DICOM object detector, and the attributes and attribute values of the second text are extracted. The attributes and attribute values are combined into attribute information according to a predetermined format to obtain third text.
5. The image recognition method based on artificial intelligence as described in claim 1, characterized in that, The method of using OCR to perform target detection on the recognition image to obtain the target detection results for each bounding box includes: The image to be identified is then subjected to target detection using OCR. The position coordinates of each target object are detected, and the corresponding bounding box is drawn based on the position coordinates of each target object; The first location information and target text of each bounding box are obtained and determined as the target detection result of the corresponding bounding box.
6. The image recognition method based on artificial intelligence as described in claim 1, characterized in that, The extraction of the third text and the return of the recognition result include: Obtain user requirements and the type of user requirements from the image recognition request; If the user's requirement type is a user extraction requirement, the third text is extracted based on the extraction conditions of the user extraction requirement, and the recognition result is returned. If the user's requirement type is a user profile requirement, perform profile tag extraction on the third text and return the user's profile tags.
7. An image recognition device based on artificial intelligence, characterized in that, The device includes: The acquisition module is used to acquire the image to be recognized in response to the received image recognition request; The target detection module is used to perform target detection on the recognition image using OCR to obtain the target detection result for each bounding box; The judgment module is used to determine whether the recognized image has a tilt angle based on the target detection results of each bounding box; A flip correction module is used to flip the recognition image to obtain a target image if the recognition image has a tilt angle. The module includes: calculating the tangent value of the length direction of each bounding box in the recognition image and averaging the tangent values in the length direction to obtain a first tangent value in the length direction; calculating the tangent value of the width direction of each bounding box in the recognition image and averaging the tangent values in the width direction to obtain a second tangent value in the width direction; identifying the bounding boxes in the top and bottom regions of the recognition image as first bounding boxes, and selecting the bounding box with the largest width from the first bounding boxes as the target bounding box corresponding to the title; flipping the recognition image based on the first tangent value in the length direction, using the center of the target bounding box as the coordinate point; and correcting the flipped recognition image based on the second tangent value in the width direction to obtain the target image. The error correction processing module is used to perform error correction processing on the target detection results of each bounding box to obtain the first text of the corresponding bounding box; The text merging module is used to sort multiple bounding boxes according to the first position information of each bounding box in the target image, and to merge the multiple first texts of the sorted bounding boxes to obtain the second text. The preprocessing module is used to preprocess the second text to obtain the third text; The extraction module is used to extract the third text and return the recognition result.
8. An electronic device, characterized in that, The electronic device includes a processor and a memory, wherein the processor is used to execute a computer program stored in the memory to implement the artificial intelligence-based image recognition method as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by the processor, it implements the artificial intelligence-based image recognition method as described in any one of claims 1 to 6.