Image problem solving method and device, electronic equipment, storage medium and program product
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA UNITED NETWORK COMM GRP CO LTD
- Filing Date
- 2024-12-17
- Publication Date
- 2026-06-26
Smart Images

Figure CN122290133A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to an image problem-solving method, apparatus, electronic device, storage medium, and program product. Background Technology
[0002] With the rapid development of the education sector and continuous technological advancements, problem-solving services have gradually transformed from an auxiliary tool into an indispensable core component of the teaching process. These services not only play a vital role in classroom teaching but are also widely used in student self-study and online education platforms. As problem-solving services become more widespread, the public's demands for their accuracy and precision are also constantly increasing.
[0003] Currently, users typically first take a picture of the question using a mobile phone or other device, then send the image to the service. The service uses Optical Character Recognition (OCR) technology to recognize the text information in the image and searches its question bank for a target question that matches the text information, along with the corresponding answer. However, in some scenarios, the matched questions may be inaccurate, resulting in the inability to obtain the correct answer. Summary of the Invention
[0004] This application provides an image-based problem-solving method, apparatus, electronic device, storage medium, and program product to solve the technical problem that the matched questions are inaccurate during image-based problem-solving, thus making it impossible to obtain the correct answer.
[0005] In a first aspect, this application provides an image-based problem-solving method, including:
[0006] Get the image of the question;
[0007] Based on the image of the question, determine the coordinates of the vertices of the quadrilaterals in the question area of the image, and then determine the perspective transformation matrix corresponding to the coordinates of the quadrilateral vertices.
[0008] Based on the perspective transformation matrix, the question area is interpolated to obtain the target image;
[0009] Based on the target image, text recognition and formula recognition are performed on the question area to determine the target question contained in the target image and the coordinates of the target question;
[0010] Input the target question into the preset large language model, obtain the corresponding answer to the target question, and display the target question and the corresponding answer based on coordinates.
[0011] Optionally, based on the coordinates of the quadrilateral vertices, the perspective transformation matrix corresponding to the coordinates of the quadrilateral vertices is determined, including:
[0012] The coordinates of the quadrilateral vertices are subjected to rectangular correction to obtain the target height and target width;
[0013] Determine the coordinates of the rectangle's vertices based on the target's height and width;
[0014] Determine the perspective transformation matrix based on the coordinates of the quadrilateral vertices and the rectangle vertices.
[0015] Optionally, based on the problem image, determine the coordinates of the vertices of the quadrilaterals within the problem area in the problem image, including:
[0016] Based on the first target detection model, target recognition is performed on the question image to obtain the coordinates of the quadrilateral vertices in the question region.
[0017] Optionally, based on the target image, text recognition and formula recognition are performed on the question region to determine the target question contained in the target image and the coordinates corresponding to the target question, including:
[0018] Based on the second object detection model, text recognition and formula recognition are performed on the question area respectively to obtain the area label and rectangle coordinates corresponding to the text, or to obtain the area label and rectangle coordinates corresponding to the text and the area label and rectangle coordinates corresponding to the formula.
[0019] Based on the region label and rectangle coordinates, the region containing the text, or the region containing the text and the region containing the formula, is clipped to obtain the text region, or the text region and the formula region.
[0020] The text region and the formula region are scaled separately to obtain the target text image and the target formula image;
[0021] Based on the target text image and the target formula image, determine the target question and the coordinates corresponding to the target question.
[0022] Optionally, based on the target text image and the target formula image, the target question and its corresponding coordinates are determined, including:
[0023] Based on preset text recognizers and formula recognizers, the target text image and target formula image are recognized accordingly to obtain the text and formula;
[0024] Based on the coordinates of the rectangles, the text and formulas are sorted to obtain the target sequence, which is obtained by arranging the rectangle coordinates in order from top to bottom and from left to right.
[0025] Traverse the target sequence to determine the target question and its corresponding coordinates.
[0026] Optionally, traverse the target sequence to determine the target question and its corresponding coordinates, including:
[0027] If the starting position of the text or formula in the target sequence is the question number, and the difference between the x-coordinate of the top left corner of the starting position and the x-coordinate of the top left corner of the question area meets the preset threshold requirement, the question area is segmented, and the text and formula in the target sequence are classified into the target question.
[0028] Otherwise, until the next question number is found, classify the text and formulas in the target sequence into the target question.
[0029] Secondly, this application provides an image-based problem-solving apparatus, comprising:
[0030] The acquisition module is used to acquire the image of the question.
[0031] The determination module is used to determine the coordinates of the quadrilateral vertices in the question area of the question image based on the question image, and to determine the perspective transformation matrix corresponding to the quadrilateral vertex coordinates based on the quadrilateral vertex coordinates;
[0032] The processing module is used to perform interpolation processing on the question area based on the perspective transformation matrix to obtain the target image;
[0033] The recognition module is used to perform text recognition and formula recognition on the question area based on the target image, so as to determine the target question contained in the target image and the coordinates of the target question;
[0034] The module is used to input the target question into a preset large language model, obtain the answer corresponding to the target question, and display the target question and the answer corresponding to the target question based on coordinates.
[0035] Optionally, a module is defined, specifically for:
[0036] The coordinates of the quadrilateral vertices are subjected to rectangular correction to obtain the target height and target width;
[0037] Determine the coordinates of the rectangle's vertices based on the target's height and width;
[0038] Determine the perspective transformation matrix based on the coordinates of the quadrilateral vertices and the rectangle vertices.
[0039] Optionally, a module is defined, specifically for:
[0040] Based on the first target detection model, target recognition is performed on the question image to obtain the coordinates of the quadrilateral vertices in the question region.
[0041] Optionally, the identification module is specifically used for:
[0042] Based on the second object detection model, text recognition and formula recognition are performed on the question area respectively to obtain the area label and rectangle coordinates corresponding to the text, or to obtain the area label and rectangle coordinates corresponding to the text and the area label and rectangle coordinates corresponding to the formula.
[0043] Based on the region label and rectangle coordinates, the region containing the text, or the region containing the text and the region containing the formula, is clipped to obtain the text region, or the text region and the formula region.
[0044] The text region and the formula region are scaled separately to obtain the target text image and the target formula image;
[0045] Based on the target text image and the target formula image, determine the target question and the coordinates corresponding to the target question.
[0046] Optionally, the identification module is also used for:
[0047] Based on preset text recognizers and formula recognizers, the target text image and target formula image are recognized accordingly to obtain the text and formula;
[0048] Based on the coordinates of the rectangles, the text and formulas are sorted to obtain the target sequence, which is obtained by arranging the rectangle coordinates in order from top to bottom and from left to right.
[0049] Traverse the target sequence to determine the target question and its corresponding coordinates.
[0050] Optionally, the identification module is also used for:
[0051] If the starting position of the text or formula in the target sequence is the question number, and the difference between the x-coordinate of the top left corner of the starting position and the x-coordinate of the top left corner of the question area meets the preset threshold requirement, the question area is segmented, and the text and formula in the target sequence are classified into the target question.
[0052] Otherwise, until the next question number is found, classify the text and formulas in the target sequence into the target question.
[0053] Thirdly, this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
[0054] The memory stores the instructions that the computer executes;
[0055] The processor executes computer-executable instructions stored in memory to implement the method described in the first aspect above and various possible implementations of the first aspect.
[0056] Fourthly, this application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the methods described in the first aspect and various possible implementations of the first aspect.
[0057] Fifthly, this application provides a computer program product, including a computer program, and a method for implementing the first aspect and various possible implementations of the first aspect when the computer program is executed by a processor.
[0058] The image-based problem-solving method, apparatus, electronic device, storage medium, and program product provided in this application acquire a problem image, determine the coordinates of the quadrilateral vertices of the problem area within the problem image, and then determine the corresponding perspective transformation matrix. Based on this, the problem area is interpolated according to the perspective transformation matrix to obtain a target image, thereby correcting perspective distortion in the image and making the problem area clearer. Furthermore, text recognition and formula recognition are performed on the problem area based on the target image, which helps improve the accuracy of text and formula recognition and ensures that the content of the target problem extracted from the image is more reliable. Finally, the recognized target problem is input into a preset large language model to obtain the corresponding answer. Through the natural language understanding and reasoning capabilities of the large language model, a more accurate answer is provided. Moreover, displaying the target problem and its corresponding answer based on coordinates allows users to intuitively see the correspondence between the problem and the answer, improving the user experience. Attached Figure Description
[0059] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0060] Figure 1 A flowchart illustrating the image-based problem-solving method provided in this application embodiment. Figure 1 ;
[0061] Figure 2 A flowchart illustrating the image-based problem-solving method provided in this application embodiment. Figure 2 ;
[0062] Figure 3 A flowchart illustrating the image-based problem-solving method provided in this application embodiment. Figure 3 ;
[0063] Figure 4 A schematic diagram illustrating the segmentation of the title area provided in an embodiment of this application;
[0064] Figure 5 This is another schematic diagram showing the segmentation of the title region provided in an embodiment of this application;
[0065] Figure 6 This is a schematic diagram of the image problem-solving device provided in the embodiments of this application;
[0066] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0067] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concepts of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0068] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0069] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with relevant laws, regulations and standards, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0070] In existing technologies, users typically use their mobile phones or other devices to take images containing the questions in natural environments. Therefore, these images may be affected by factors such as lighting and angle. The images are then uploaded to a problem-solving service platform, which processes and analyzes them. OCR technology is then used to analyze the text information in the images, converting the visual information into computer-processable text data. Based on this text information, the system searches a pre-built question bank—a database containing numerous questions and their answers. The search goal is to find a target question that matches the user's text information. Once a match is found, the system extracts the corresponding answer and returns it to the user via a mobile application or web interface. However, in some scenarios, such as when the image is distorted during photography, or when the searched question is not included in the question bank, this method of problem-solving can easily lead to incorrect results. For example, even with the "chicken and rabbit in the same cage" problem, slightly modifying the numerical values in the question or replacing "chicken" and "rabbit" with other animals can cause problems in the matching stage, resulting in inaccurate matching and ultimately, incorrect answers.
[0071] Based on this, this application proposes an image-based problem-solving method, apparatus, electronic device, storage medium, and program product. This solution, based on considerations of problem image processing, first analyzes the image to identify the problem region after acquiring it. This region is typically represented as a quadrilateral, and the coordinates of its four vertices can be determined. Then, the corresponding perspective transformation matrix is calculated based on these coordinates, and the perspective transformation matrix is used to interpolate the problem region, generating a perspective-corrected target image. The problem region is adjusted to a frontal view to correct perspective distortion, making the problem region appear as if it were photographed from the front. Text recognition and formula recognition are performed on the target image to determine the specific problem content (i.e., the target problem) and its position coordinates within the image. Subsequently, to prevent insufficient problem storage in the question bank, the identified target problem can be input into a pre-set large language model for processing to obtain the corresponding answer. This improves both problem recognition accuracy and answer accuracy, simplifies the user operation process, significantly enhances the user experience, and makes the image-based problem-solving process more convenient and reliable.
[0072] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.
[0073] Figure 1A flowchart illustrating the image-based problem-solving method provided in this application embodiment. Figure 1 .like Figure 1 As shown, the method may include:
[0074] S101. Obtain the image of the question.
[0075] The execution subject of this application embodiment is an electronic device, which can be a terminal device, such as a laptop, desktop computer, or tablet computer, or a server. In practical applications, whether the electronic device is a terminal device or a server can be determined according to the actual situation, and no specific limitation is imposed on it.
[0076] The title image can refer to an image containing the title that the user captures or obtains through a device, such as a mobile phone or other camera.
[0077] In this embodiment, the user takes a picture and uploads an image containing the question to the problem-solving service platform as input for the problem-solving process.
[0078] S102. Based on the image of the question, determine the coordinates of the vertices of the quadrilaterals in the question area of the image, and determine the perspective transformation matrix corresponding to the coordinates of the quadrilateral vertices.
[0079] Among them, the quadrilateral vertex coordinates can refer to the coordinates of the four corner points used to describe the problem area. These coordinates are used to define the shape and position of the problem area.
[0080] A perspective transformation matrix is a mathematical matrix used to transform an image from one perspective viewpoint to another. In this embodiment, it can be used to map points in a region from the original coordinate system to the target coordinate system, thereby correcting perspective distortion in the image.
[0081] It should be noted that the question image may be distorted due to tilting or rotation. By recognizing the question image, the predicted coordinates of the quadrilateral vertices can be output. These points constitute the four corners of the question area in the question image. Usually, they will form a quadrilateral, but it may not be a rectangle because there may be perspective distortion (i.e., the edges are tilted due to different viewing angles).
[0082] In this embodiment, the object detection model, such as Faster-RCNN, YOLO, SSD, etc., can be used to identify and predict the question image. In addition to outputting the object category and bounding box, it is also necessary to regress the coordinates of the quadrilateral vertices of each question area to correct for perspective changes in the question area.
[0083] Furthermore, in order to transform the quadrilateral region of the original question image into a rectangular region, it is necessary to define the corrected rectangular vertex coordinates. Then, the perspective transformation matrix is calculated using the quadrilateral vertex coordinates and the rectangular vertex coordinates. By constructing a linear equation using the two sets of coordinates and solving the equation using the least squares method, the perspective transformation matrix can be obtained.
[0084] As an example, perspective transformation calculates the new position of each point using the following formula:
[0085]
[0086] Among them, (x i y i (x) is any coordinate among the vertices of the quadrilateral. i ′,y i ′) are the coordinates of the rectangle vertices after the coordinate transformation, and M is the perspective transformation matrix, which is a 3×3 matrix with elements h1, h2, ..., h9.
[0087] It should be noted that perspective transformation is a non-linear transformation, but homogeneous coordinates can be transformed into a linear problem using the above relationship, that is, for each pair of corresponding points (x... i y i ) and (x i ′,y i From these equations, we can obtain two sets of equations:
[0088] For x i ′, we can obtain: h1x i +h2y i +h3=x i ′(h7x i +h8y i +h9)
[0089] For y i ′, we can get: h4x i +h5y i +h6=y i ′(h7x i +h8y i +h9)
[0090] Therefore, based on the four sets of corresponding vertices, eight equations can be obtained, which in turn form a system of linear equations. This system of linear equations can be solved by the least squares method to obtain the matrix elements, namely the perspective transformation matrix M. The solution process is a conventional technique and will not be described in detail here.
[0091] S103. Based on the perspective transformation matrix, interpolate the question area to obtain the target image.
[0092] Interpolation refers to filling in the pixel values of the target image during image transformation to obtain a smoother target image.
[0093] The target image can refer to an image that has undergone perspective correction and interpolation processing, used for further text and formula recognition.
[0094] In this embodiment, after obtaining the perspective transformation matrix, the question region in the original question image can be perspective transformed into a rectangular region in the target image. Since each pixel of the image is mapped to the target coordinate system, an interpolation algorithm can be used to calculate the value of each pixel in the target image. Optionally, the question region can be interpolated using methods such as nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation; this embodiment does not impose any limitations on this. Thus, through interpolation, each pixel in the target image can be filled, ensuring that the transformed image maintains as high an image quality as possible.
[0095] Understandably, if the question area in the question image is distorted due to issues such as shooting angle, perspective transformation can "straighten" the area into a rectangle, thereby removing the distortion, restoring the true proportions of the question image, and facilitating subsequent recognition and analysis.
[0096] S104. Based on the target image, perform text recognition and formula recognition on the question area to determine the target question contained in the target image and the coordinates corresponding to the target question.
[0097] In this embodiment, deep learning models, such as Faster-RCNN, YOLO, and SSD, can be used to identify and predict the question region, outputting classification labels and bounding box coordinates. The "formula" label represents a formula, and the "text" label represents text. The bounding box coordinates are in the form (x1, y1) and (x2, y2), representing the top-left and bottom-right corner coordinates of the question region, respectively. Then, for targets labeled "text," a text recognizer is used for identification; for targets labeled "formula," a formula recognizer is used. The identification results of the text and formula in the question region are obtained, thereby determining the target question and its corresponding coordinates.
[0098] S105. Input the target question into the preset large language model, obtain the answer corresponding to the target question, and display the target question and the answer corresponding to the target question based on coordinates.
[0099] Among them, a large language model is an artificial intelligence-based model that can understand and generate natural language text to answer questions and provide information, such as LLama, Qwen, ChatGLM, etc. This embodiment does not limit the type of large language model.
[0100] In some embodiments, suggestive phrases, such as "Please solve the following math problem" or "Please provide detailed solution steps," may be selectively added to improve the user experience.
[0101] For example, in the case of a math problem, adding "Please explain the solution process in detail" can help the model generate the calculation process for each step. In the case of other types of problems, prompts can help the model generate more specific answers, rather than just brief results.
[0102] In this embodiment, the answer generated by the large language model can be a complete problem-solving process (such as the derivation process of a mathematical problem) or a direct and concise answer (such as the correct answer to a multiple-choice question). Finally, the system will combine the text content of each target question (including text, formulas, coordinates, etc.) with the solution results generated by the large language model and integrate them. The answers are usually listed in the order of the questions or displayed with detailed steps of the problem-solving process.
[0103] Understandably, using a large language model to automatically answer questions solves the tasks of automating test paper grading and answering questions, preventing errors caused by insufficient questions in the question bank. At the same time, the system not only generates answers, but also combines the location coordinates of the questions to associate the answer to each question with its location, thereby providing more structured and easier-to-understand results.
[0104] The image-based problem-solving method provided in this application accurately extracts the area where the problem is located by identifying the coordinates of the quadrilateral vertices of the problem area and applying a perspective transformation matrix. This overcomes the influence of shooting angle deviation, making the extraction of the problem area more accurate. Perspective transformation and interpolation processing can optimize the image quality of the problem area, making subsequent text recognition and formula recognition clearer and more accurate, avoiding recognition errors caused by image distortion or blurring. Then, by performing text recognition and formula recognition on the target image, the problem content and the mathematical formulas contained therein can be automatically extracted, realizing automated problem-solving. Finally, by using a large language model to analyze the problem and generate the answer, the correct solution can be provided quickly, improving problem-solving efficiency. At the same time, combined with the identified coordinate information, the problem and answer can be accurately displayed in the image, ensuring that the visual display matches the content. Thus, efficient and accurate automated problem-solving is achieved, improving the user experience.
[0105] Figure 2 A flowchart illustrating the image-based problem-solving method provided in this application embodiment. Figure 2 .like Figure 2 As shown, this embodiment is... Figure 1Based on the example, the process of determining the perspective transformation matrix corresponding to the vertex coordinates of the quadrilateral in S102 according to the vertex coordinates of the quadrilateral is described in detail, specifically including the following steps:
[0106] S201. Perform rectangular correction processing on the coordinates of the quadrilateral vertices to obtain the target height and target width.
[0107] Among them, target height and target width refer to the height and width of the rectangular area obtained after rectangular correction processing. These parameters can be used to define the size of the corrected rectangle.
[0108] In this embodiment, the rectangle correction process refers to performing perspective transformation correction on the predicted quadrilateral vertex coordinates (x1, y1), (x2, y2), (x3, y3), and (x4, y4) to determine the rectangle's width w (target height) and height h (target width).
[0109] Specifically,
[0110]
[0111] Understandably, calculating the average coordinates of the four vertices using the above formula ensures that the target height and width are processed on a uniform scale, which helps to maintain consistent recognition results across different devices and resolutions.
[0112] S202. Determine the coordinates of the rectangle vertices based on the target height and target width.
[0113] Among them, the vertex coordinates of the rectangle refer to the coordinates of the four corner points of the rectangular area, that is, the standardized coordinates obtained after correction processing, which can be used for perspective transformation.
[0114] In this embodiment, based on the target height and target width, the vertex coordinates of the rectangle can be obtained as (0, 0), (w, 0), (w, h), and (0, h). The target coordinate system in which the rectangle is located takes the zero point as the origin of the coordinate axis, and the irregular quadrilateral area is adjusted into a regular rectangle to facilitate subsequent image processing and analysis.
[0115] S203. Determine the perspective transformation matrix based on the coordinates of the quadrilateral vertices and the rectangle vertices.
[0116] In this embodiment, the perspective transformation matrix can be calculated by the least squares method based on the correspondence between the four sets of vertices mentioned above, as explained in the previous embodiments, and will not be repeated here.
[0117] Understandably, by using rectangle correction to ensure that the question area is accurately adjusted into a regular rectangle, the impact of perspective distortion on the accuracy of subsequent question recognition can be reduced.
[0118] In some embodiments, it can also be applied in augmented reality environments, where users can see the correction process of the question area in real time and interact with and solve the questions in the corrected view, thereby providing more comprehensive and accurate question identification and solutions.
[0119] Correspondingly, before step S201 above, it is necessary to first perform target recognition on the question image based on the first target detection model to obtain the coordinates of the quadrilateral vertices of the question region.
[0120] The first object detection model can be any machine learning or deep learning-based model used to identify and locate specific target regions in an image. In this embodiment, the model is used to identify the question region.
[0121] It should be noted that the first object detection model and the second object detection model in the subsequent embodiments are both models that have been learned through training data. The first object detection model can accurately locate the question area in the image and output the coordinates of the quadrilateral vertices of the area.
[0122] Specifically, this first object detection model can not only output the object category and bounding box, but also additionally regress the quadrilateral vertex coordinates of each object region. Its structure includes: a backbone network (such as VGG, ResNet, DarkNet) for extracting image features; a neck module (such as PAN, Bi-FPN) for enhancing and fusing image features; and a head prediction part for outputting the category, bounding box coordinates, and quadrilateral vertex coordinates.
[0123] The image-based problem-solving method provided in this application can eliminate image distortion caused by shooting angle or deformation by correcting the coordinates of quadrilateral vertices, converting areas that may have been tilted or distorted into standard rectangles, thus improving the accuracy of subsequent image processing and analysis. Furthermore, by calculating the corrected target height and width, it can provide an accurate reference scale for subsequent perspective transformations, ensuring the adaptation and consistency of image size. Then, based on the coordinates of the quadrilateral vertices and the corrected rectangle vertices, a perspective transformation matrix is generated, enabling the original tilted image to be accurately converted into a planar perspective, ensuring that the image content is not distorted, and making it easier to recognize text and formulas in subsequent steps.
[0124] Figure 3 A flowchart illustrating the image-based problem-solving method provided in this application embodiment. Figure 3 .like Figure 3As shown, this embodiment is... Figure 1 Based on the embodiments, the process of S104 will be described in detail, specifically including the following steps:
[0125] S301. Based on the second target detection model, perform text recognition and formula recognition on the question area respectively to obtain the area label and rectangle coordinates corresponding to the text, or obtain the area label and rectangle coordinates corresponding to the text and the area label and rectangle coordinates corresponding to the formula.
[0126] The second target recognition model can be any machine learning or deep learning model used to identify and locate different types of regions (such as text and formulas) in an image. In this embodiment, the structure of the second target recognition model is the same as that of the first target detection model, and will not be described again here.
[0127] In this embodiment, the question area contains text. Therefore, by performing text recognition on the question area, the corresponding area label and rectangle coordinates can be obtained. When the question contains formulas, the formulas and text are recognized separately to obtain the corresponding area labels and rectangle coordinates for the text and the formulas, respectively.
[0128] S302. Based on the region label and the rectangle coordinates, crop the region where the text is located, or the region where the text is located and the region where the formula is located, to obtain the text region, or the text region and the formula region.
[0129] In this embodiment, cropping refers to extracting a specified region from the image, that is, extracting the region where the text is located or the region where the formula is located respectively.
[0130] Specifically, for the target labeled "text" (text) and the target labeled "formula" (formula), the clipping height is h and the width is w, where h is the target height in the above embodiment and w is the target width in the above embodiment.
[0131] S303. Scale the text region and the formula region respectively to obtain the target text image and the target formula image.
[0132] In this embodiment, after cropping the text area and formula area, both with a size of h×w, they are scaled proportionally to an image of size H×W, where H is a pre-set fixed value, such as 32 or 64; W is calculated by the original size of the text area and the cropped size, and the calculation formula is: W=H / (h×w).
[0133] S304. Based on the target text image and the target formula image, determine the target question and the coordinates corresponding to the target question.
[0134] In one possible implementation, S304 can be specifically implemented through the following steps:
[0135] First, based on the preset text recognizer and formula recognizer, the target text image and target formula image are recognized accordingly to obtain the text and formula; then, according to the rectangle coordinates, the text and formula are sorted to obtain the target sequence; then, the target sequence is traversed to determine the target question and the coordinates corresponding to the target question.
[0136] The target sequence is obtained by arranging the coordinates of the rectangles from top to bottom and from left to right. This target sequence reflects the natural reading order of the text and formulas in the image.
[0137] A text recognizer is an algorithm or model used to extract and recognize text from an image. It is usually based on optical character recognition (OCR) technology, such as CRNN, ABINet, ParSeq, etc.
[0138] A formula recognizer is an algorithm or model used to identify and parse mathematical formulas or symbolic expressions from images.
[0139] It's important to note that the processing steps for formula recognition are essentially the same as those for text recognition. The difference lies in the output: the formulas output by the formula recognizer are in LaTeX expression form, not ordinary text. LaTeX is a standard language for typesetting and expressing mathematical formulas. It can accurately describe complex mathematical symbols, formulas, equations, and structures, such as fractions, square roots, exponents, and summation symbols. Ordinary text cannot intuitively express these symbols and their structures, thus improving the accuracy of formula recognition.
[0140] Understandably, by using a specialized text and formula recognizer, the content in the image can be extracted more accurately. Then, by sorting the coordinates of the rectangles, the text and formulas are arranged in a natural reading order, which is convenient for parsing and understanding. Finally, by traversing the target sequence, the coordinate position of each target question can be accurately determined, which is convenient for subsequent display and interaction.
[0141] In this embodiment, if the starting position of the text or formula in the target sequence is a question number, and the difference between the x-coordinate of the upper left corner of the starting position and the x-coordinate of the upper left corner of the question area meets a preset threshold requirement, the question area is segmented, and the text and formula in the target sequence are classified into the target question; otherwise, the text and formula in the target sequence are classified into the target question until the next question number is found.
[0142] The question number refers to the sequence number of the question, which is usually a number or a letter, such as the commonly used 1, 2, etc., and is used to identify the beginning position of the question.
[0143] The preset threshold requirement is a predefined value based on the size of the question area, used to determine whether the interpolation between two coordinates is within an acceptable range.
[0144] In this embodiment, by traversing the target sequence, it identifies whether the starting position of the text or formula is a question number, and determines the difference between the x-coordinate of the top-left corner of the question number and the x-coordinate of the top-left corner of the question area. If the difference in x-coordinates meets a preset threshold, the position is considered the starting position of the question number. The text and formulas in the target sequence are then categorized into the current target question, thereby segmenting the question area and categorizing the identified text and formulas into the current target question. This process continues until the next question number is found, which is also the beginning of the current question and the end of the previous question. Furthermore, the above steps are repeated for each newly found question number to ensure that all text and formulas are correctly categorized.
[0145] In some embodiments, the question number and the start and end points of the target question can also be determined by information such as numbers and brackets in the text.
[0146] Furthermore, after identifying the target question, the coordinates of the target question can be determined by calculating the difference between the smallest coordinate (starting position) and the largest coordinate (ending position) in the target question using the coordinates of the text and formulas in the target question.
[0147] As an example, such as Figure 4 As shown, Figure 4 This is a schematic diagram of the segmentation of the question area provided in an embodiment of this application. In this embodiment, text is first detected and content is recognized. When text and formula are detected, they are respectively input into the text recognizer and the formula recognizer to obtain the text and formula. The formula is parsed into LaTax format text, and the coordinate information corresponding to each field is returned. Then, according to the user's question segmentation requirements, the questions are segmented based on the question number to obtain several target questions, including question 1, question 2, question 3, question 4, and question 5. By inputting them into the large language model, the question answers and question location information are obtained. For example, the location coordinates of question 1 are [26.0, 30.0, 532.0, 229.0303192138672], and the location coordinates of question 2 are [25.0, 233.60340881347656, 611.9300537109375, 348.08355712890625].
[0148] As another example, such as Figure 5 As shown, Figure 5This is another schematic diagram of question area segmentation provided in an embodiment of this application. In this embodiment, the questions are first segmented according to the question number to obtain several target questions. For each target question, after the text is detected and recognized, the adjacent text area and formula area are concatenated and combined. Taking question 1 as an example, the detection result is obtained, and the text in the detection result is concatenated and assembled with the formula to obtain the concatenated and assembled result. Then, the positions of the upper left and lower right corners of the concatenated and assembled result are calculated. The coordinates of the upper left corner are [26.0, 32.0], and the coordinates of the lower right corner are [532.0, 229.0303192138672]. The final position can be obtained, that is, the position coordinates of question 1 are [26.0, 30.0, 532.0, 229.0303192138672].
[0149] Understandably, by identifying question numbers and judging thresholds, the content in the question area can be more accurately segmented and categorized. At the same time, by dynamically judging the position of the question number and segmenting the area, it can adapt to questions with different formats and layouts, thereby improving the accuracy of question recognition.
[0150] The image-based problem-solving method provided in this application ensures clear extraction and standardized processing of problem content through precise text and formula recognition, region cropping, scaling, and coordinate positioning, thus providing higher accuracy, efficiency, and stability for the subsequent problem-solving process.
[0151] Figure 6 This is a schematic diagram of the image problem-solving device provided in an embodiment of this application. Figure 6 As shown, the image-based problem-solving device 60 provided in this embodiment includes:
[0152] Module 601 is used to acquire the image of the question.
[0153] The determination module 602 is used to determine the coordinates of the quadrilateral vertices in the question area of the question image based on the question image, and to determine the perspective transformation matrix corresponding to the quadrilateral vertex coordinates based on the quadrilateral vertex coordinates.
[0154] The processing module 603 is used to perform interpolation processing on the question area according to the perspective transformation matrix to obtain the target image;
[0155] The recognition module 604 is used to perform text recognition and formula recognition on the question area according to the target image, so as to determine the target question contained in the target image and the coordinates corresponding to the target question;
[0156] The module 605 is used to input the target question into a preset large language model, obtain the answer corresponding to the target question, and display the target question and the answer corresponding to the target question based on coordinates.
[0157] Optionally, module 602 is specifically used for:
[0158] The coordinates of the quadrilateral vertices are subjected to rectangular correction to obtain the target height and target width;
[0159] Determine the coordinates of the rectangle's vertices based on the target's height and width;
[0160] Determine the perspective transformation matrix based on the coordinates of the quadrilateral vertices and the rectangle vertices.
[0161] Optionally, module 602 is specifically used for:
[0162] Based on the first target detection model, target recognition is performed on the question image to obtain the coordinates of the quadrilateral vertices in the question region.
[0163] Optionally, the identification module 604 is specifically used for:
[0164] Based on the second object detection model, text recognition and formula recognition are performed on the question area respectively to obtain the area label and rectangle coordinates corresponding to the text, or to obtain the area label and rectangle coordinates corresponding to the text and the area label and rectangle coordinates corresponding to the formula.
[0165] Based on the region label and rectangle coordinates, the region containing the text, or the region containing the text and the region containing the formula, is clipped to obtain the text region, or the text region and the formula region.
[0166] The text region and the formula region are scaled separately to obtain the target text image and the target formula image;
[0167] Based on the target text image and the target formula image, determine the target question and the coordinates corresponding to the target question.
[0168] Optionally, the identification module 604 is also used for:
[0169] Based on preset text recognizers and formula recognizers, the target text image and target formula image are recognized accordingly to obtain the text and formula;
[0170] Based on the coordinates of the rectangles, the text and formulas are sorted to obtain the target sequence, which is obtained by arranging the rectangle coordinates in order from top to bottom and from left to right.
[0171] Traverse the target sequence to determine the target question and its corresponding coordinates.
[0172] Optionally, the identification module 604 is also used for:
[0173] If the starting position of the text or formula in the target sequence is the question number, and the difference between the x-coordinate of the top left corner of the starting position and the x-coordinate of the top left corner of the question area meets the preset threshold requirement, the question area is segmented, and the text and formula in the target sequence are classified into the target question.
[0174] Otherwise, until the next question number is found, classify the text and formulas in the target sequence into the target question.
[0175] The image-based problem-solving device provided in this embodiment can execute the methods provided in the above-described method embodiments. Its implementation principle and technical effects are similar, and will not be described in detail here.
[0176] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 7 As shown, the electronic device 70 provided in this embodiment includes at least one processor 701 and a memory 702. Optionally, the device 70 further includes a communication component 703. The processor 701, memory 702, and communication component 703 are connected via a bus 704.
[0177] In a specific implementation, at least one processor 701 executes computer execution instructions stored in memory 702, causing at least one processor 701 to perform the above-described method.
[0178] The specific implementation process of processor 701 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.
[0179] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.
[0180] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.
[0181] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0182] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0183] This application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the above-described method.
[0184] The aforementioned readable storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The readable storage medium can be any available medium accessible to a general-purpose or special-purpose computer.
[0185] An exemplary readable storage medium is coupled to a processor, enabling the processor to read information from and write information to the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and the readable storage medium can exist as discrete components in the device.
[0186] The division of units is merely a logical functional division; in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or units, and may be electrical, mechanical, or other forms.
[0187] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0188] In addition, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0189] If a function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0190] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0191] Finally, it should be noted that other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein, and is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.
Claims
1. A method for solving problems using images, characterized in that, include: Get the image of the question; Based on the question image, determine the coordinates of the quadrilateral vertices in the question area of the question image, and based on the quadrilateral vertices coordinates, determine the perspective transformation matrix corresponding to the quadrilateral vertices coordinates; Based on the perspective transformation matrix, the question area is interpolated to obtain the target image; Based on the target image, text recognition and formula recognition are performed on the question area to determine the target question contained in the target image and the coordinates corresponding to the target question; The target question is input into a preset large language model to obtain the answer corresponding to the target question, and the target question and the answer corresponding to the target question are displayed based on the coordinates.
2. The method according to claim 1, characterized in that, The step of determining the perspective transformation matrix corresponding to the vertex coordinates of the quadrilateral based on the vertex coordinates includes: The quadrilateral vertex coordinates are subjected to rectangular correction processing to obtain the target height and target width; Determine the coordinates of the rectangle vertices based on the target height and the target width; The perspective transformation matrix is determined based on the coordinates of the quadrilateral vertices and the coordinates of the rectangle vertices.
3. The method according to claim 1, characterized in that, The step of determining the coordinates of the quadrilateral vertices in the question region of the question image based on the question image includes: Based on the first target detection model, target recognition is performed on the question image to obtain the coordinates of the quadrilateral vertices of the question region.
4. The method according to any one of claims 1-3, characterized in that, The step of performing text recognition and formula recognition on the question region based on the target image to determine the target question contained in the target image and the coordinates corresponding to the target question includes: Based on the second target detection model, text recognition and formula recognition are performed on the question area respectively to obtain the area label and rectangle coordinates corresponding to the text, or to obtain the area label and rectangle coordinates corresponding to the text and the area label and rectangle coordinates corresponding to the formula. Based on the region label and the rectangle coordinates, the region where the text is located, or the region where the text is located and the region where the formula is located, are cropped to obtain the text region, or the text region and the formula region; The text region and the formula region are scaled respectively to obtain the target text image and the target formula image; Based on the target text image and the target formula image, determine the target question and the coordinates corresponding to the target question.
5. The method according to claim 4, characterized in that, The step of determining the target question and its corresponding coordinates based on the target text image and the target formula image includes: Based on a preset text recognizer and formula recognizer, the target text image and the target formula image are recognized accordingly to obtain the text and the formula; Based on the coordinates of the rectangle, the text and the formula are sorted to obtain a target sequence, which is obtained by arranging the coordinates of the rectangle from top to bottom and from left to right. Traverse the target sequence to determine the target question and the coordinates corresponding to the target question.
6. The method according to claim 5, characterized in that, The step of traversing the target sequence to determine the target question and the coordinates corresponding to the target question includes: If the starting position of the text or formula in the target sequence is the question number, and the difference between the x-coordinate of the upper left corner corresponding to the starting position and the x-coordinate of the upper left corner of the question area meets a preset threshold requirement, the question area is segmented, and the text and formula in the target sequence are classified into the target question. Otherwise, until the next question number is found, the text and formulas in the target sequence are classified into the target question.
7. An image-based problem-solving device, characterized in that, include: The acquisition module is used to acquire the image of the question. The determination module is used to determine the coordinates of the quadrilateral vertices in the question region of the question image based on the question image, and to determine the perspective transformation matrix corresponding to the quadrilateral vertices based on the quadrilateral vertices coordinates; The processing module is used to perform interpolation processing on the question area according to the perspective transformation matrix to obtain the target image; The recognition module is used to perform text recognition and formula recognition on the question area according to the target image, so as to determine the target question contained in the target image and the coordinates corresponding to the target question; The module is used to input the target question into a preset large language model, obtain the answer corresponding to the target question, and display the target question and the answer corresponding to the target question based on the coordinates.
8. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1 to 6.
10. A computer program product, characterized in that, Includes a computer program, which, when executed by a processor, is used to implement the method as described in any one of claims 1 to 6.