Image processing method, apparatus and electronic device
By extracting and fusing features from geometric images, combined with semantic segmentation and multi-task detection, the problem of separating geometric features from non-geometric features is solved, achieving high-precision analysis of geometric images and enhancing the performance of intelligent education and automatic problem-solving systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LENOVO (BEIJING) LTD
- Filing Date
- 2026-02-05
- Publication Date
- 2026-06-19
AI Technical Summary
Existing geometric graph analysis methods separate geometric features from non-geometric features, resulting in a failure to effectively integrate spatial layout and semantic information. This affects the accuracy of geometric element recognition and the completeness of geometric relationship prediction, and makes it difficult to distinguish geometric semantic structures at different levels, leading to misjudgment or omission of relationships.
By extracting geometric and non-geometric features from the image to be processed, fusing them together to determine the attribute information of the elements, and performing semantic segmentation, instance clustering, and multi-task detection based on the fused and non-geometric features, combined with the principle of spatial proximity, the association between geometric and non-geometric elements can be identified.
It enhances the model's ability to understand complex geometric relationships, improves the completeness and accuracy of image semantic information, and is suitable for various application scenarios such as intelligent education systems, automatic problem-solving systems, and geometric graph retrieval systems.
Smart Images

Figure CN122243869A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to an image processing method, apparatus and electronic device. Background Technology
[0002] Image processing has wide applications in fields such as educational assessment, engineering design, and intelligent analysis. As an important form of image, the automatic parsing of the structural information and semantic relationships of geometric graphs is crucial for realizing intelligent education systems and automated analysis. Geometric graphs can contain geometric elements such as points, lines, and circles, as well as non-geometric elements such as text and symbols. Accurate identification and understanding of these elements and their interrelationships are fundamental to improving image parsing performance.
[0003] In related technologies, geometric graph analysis methods employ image detection techniques to extract geometric primitives and use graph neural networks for relation inference. However, these methods separate the processing of geometric features from that of non-geometric features, resulting in a failure to effectively integrate spatial layout and semantic information, thus affecting the accuracy of geometric element recognition and the completeness of geometric relationship prediction. Furthermore, these technologies struggle to distinguish between different levels of geometric semantic structures, leading to problems such as misjudgment or omission of relationships. Summary of the Invention
[0004] This application provides an image processing method, apparatus, and electronic device. The image processing method includes: Geometric and non-geometric features are extracted from the image to be processed to obtain geometric and non-geometric features respectively. Geometric features and non-geometric features are fused to obtain fused features; Based on at least one of fused features and non-geometric features, determine first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed; the first attribute information and the second attribute information include at least the position and category of the element. Based on the first attribute information and the second attribute information, the semantic information of the image to be processed is determined. The semantic information includes at least geometric information and the relationship between geometric elements and non-geometric elements.
[0005] In some embodiments, determining first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed based on at least one of fused features and non-geometric features includes: performing semantic segmentation and instance clustering on the fused features respectively to obtain first attribute information of geometric elements, wherein the first attribute information includes at least the geometric category probability and the geometric instance to which each geometric pixel belongs in the geometric element, and the position of the geometric element; performing multi-task detection on the non-geometric features and fused features to obtain second attribute information of non-geometric elements, wherein the second attribute information includes at least the non-geometric category probability and the centrality confidence of each non-geometric pixel in the non-geometric element in the non-geometric object, and the position and text content of the non-geometric element; wherein the multi-task detection includes at least classification detection, position detection and pixel centrality detection.
[0006] In some embodiments, semantic segmentation and instance clustering are performed on the fused features to obtain the first attribute information of the geometric elements, including: performing semantic segmentation on the fused features based on a semantic segmentation network to obtain the geometric category probability of each geometric pixel in the geometric elements; the semantic segmentation network is trained at least using a weighted binary cross-entropy loss function; locating the geometric elements in the geometric features based on a regression network to obtain the position of the geometric elements; and performing instance clustering on the fused features based on an instance segmentation network to obtain the geometric instance to which each geometric pixel in the geometric elements belongs; the instance segmentation network is trained at least using a discriminative loss function based on a distance metric.
[0007] In some embodiments, multi-task detection is performed on non-geometric features and fused features to obtain second attribute information of non-geometric elements, including: classifying and detecting the fused features based on a classification network to obtain the non-geometric class probability of each non-geometric pixel in the non-geometric element; the classification network is trained at least through a class imbalance loss function; detecting the position of the non-geometric elements in the non-geometric features based on a regression network to obtain the position of the non-geometric elements; performing pixel centrality detection on the position of each pixel in the non-geometric elements in the non-geometric features based on a centrality network to obtain the centrality confidence of each non-geometric pixel in its respective non-geometric object; and performing content recognition on the non-geometric elements in the non-geometric features based on a character recognition network to obtain the text content of the non-geometric elements.
[0008] In some embodiments, semantic information of the image to be processed is determined based on first attribute information and second attribute information, including: matching geometric elements and non-geometric elements based on spatial proximity principle, the position of geometric elements in the first attribute information, and the position of non-geometric elements in the second attribute information to obtain the matching relationship between geometric elements and non-geometric elements; determining the geometric relationship of geometric elements based on the position of geometric elements in the first attribute information and the geometric category probability and the geometric instance to which each pixel in the geometric element belongs; and determining geometric graphic information and association relationship based on the matching relationship and the geometric relationship.
[0009] In some embodiments, the geometric relationship of a geometric element is determined based on the position of the geometric element in the first attribute information, the geometric category probability of each pixel in the geometric element, and the geometric instance to which it belongs. This includes: determining pixels belonging to the same instance as the same geometric primitive based on the geometric instance to which they belong; the geometric primitive includes at least a point, an edge, and a circle; and detecting the spatial relationship between different geometric primitives based on the geometric category probability and the position of the geometric element to obtain the geometric relationship of the geometric element.
[0010] In some embodiments, determining geometric information and associations based on matching relationships and geometric relationships includes: fusing matching relationships and geometric relationships to obtain geometric information of the image to be processed; the geometric representation includes at least geometric figures and corresponding text symbols; performing geometric constraint verification on the geometric information to obtain a verification result; and determining associations based on the geometric information in response to the verification result indicating that the verification has passed.
[0011] In some embodiments, the image processing method further includes: performing multi-scale feature extraction on the geometric map to obtain a multi-scale feature map; correspondingly, performing geometric feature extraction and non-geometric feature extraction on the image to be processed to obtain geometric features of geometric elements and non-geometric features of non-geometric elements, including: performing geometric feature extraction and non-geometric feature extraction on the multi-scale feature map to obtain geometric features and non-geometric features.
[0012] This application provides an image processing apparatus, comprising: a feature extraction module for extracting geometric features and non-geometric features from an image to be processed, respectively, to obtain geometric features and non-geometric features; a fusion module for fusing the geometric features and non-geometric features to obtain fused features; a first determination module for determining, based on at least one of the fused features and non-geometric features, first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed; the first attribute information and the second attribute information include at least the position and category of the elements; and a second determination module for determining, based on the first attribute information and the second attribute information, semantic information including at least geometric graphic information and the association relationship between geometric elements and non-geometric elements.
[0013] This application provides an electronic device, which includes: a memory for storing executable instructions; and a processor for executing the executable instructions stored in the memory to implement the image processing method provided in this application.
[0014] This application provides a computer-readable storage medium storing a computer program or computer-executable instructions for implementing the image processing method provided in this application when executed by a processor.
[0015] This application provides a computer program product, including a computer program or computer executable instructions. When the computer program or computer executable instructions are executed by a processor, they implement the image processing method provided in this application. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application; Figure 2 This is an optional flowchart illustrating the image processing method provided in an embodiment of this application; Figure 3 This is a model framework diagram of the planar geometry analysis method provided in the embodiments of this application; Figure 4 This is a schematic diagram of the hierarchical structure of the geometry provided in the embodiments of this application.
[0017] It should be noted that the terms "first" and "second" mentioned above are only used to distinguish between different options and do not represent the degree of superiority or inferiority of the options or their priority in the implementation process. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0019] It should be understood that the following description of the embodiments is intended to explain and illustrate the overall concept of the embodiments of this application, and should not be construed as limiting the embodiments of this application. In the specification and drawings, the same or similar reference numerals refer to the same or similar parts or components. For clarity, the drawings are not necessarily drawn to scale, and some well-known parts and structures may be omitted in the drawings.
[0020] In some embodiments, unless otherwise defined, the technical or scientific terms used in the embodiments of this application should have the meaning understood by a person skilled in the art to which the embodiments of this application pertain. The terms "first," "second," and similar terms used in the embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. The word "a" or "an" does not exclude multiple components. The terms "comprising" or similar terms mean that the element or object preceding the word covers the elements or objects listed after the word and their equivalents, without excluding other elements or objects. The terms "connected" or similar terms are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. "Above," "below," "left," "right," "top," or "bottom," etc., are used only to indicate relative positional relationships, and these relative positional relationships may change accordingly when the absolute position of the described object changes. When an element such as a layer, film, region, or substrate is referred to as being "above" or "below" another element, the element may be "directly" located "above" or "below" the other element, or there may be intermediate elements present.
[0021] Planar geometric graphs have significant application value in fields such as educational assessment, engineering design, and intelligent analysis. The automatic analysis of geometric structures, geometric relationships, and text annotations within geometric graphs is a key technological component of intelligent education systems, automated problem-solving systems, and geometric graph retrieval systems. Current automatic geometric graph analysis primarily relies on image detection and graph neural network relationship reasoning frameworks, but in real-world complex scenarios, the following prominent problems remain: First, geometric and non-geometric features are separated. Related technologies generally treat geometric primitive detection (such as points, lines, and circles) and non-geometric primitive recognition (such as text and symbols) as independent modules, ignoring their deep correlation in spatial layout and semantics. This leads to errors in distinguishing geometric elements and insufficient accuracy in predicting geometric relationships. Second, geometric relationship prediction is inaccurate: related technologies use a single graph neural network to predict global binary relationships across the entire geometric graph, failing to distinguish multi-level geometric semantic structures, resulting in both a lack of and redundancy in geometric relationship prediction. For example, the length text "5" of line segment AB should not be linked to other non-AB line segments; a small number of texts, such as the angle "120°", should not be linked to the corresponding arrow, thus losing the geometric relationship between "120°" and the "∠ABC" pointed to by the arrow.
[0022] Therefore, the relevant technologies have significant shortcomings in terms of parsing accuracy, semantic integrity, and adaptability to real-world scenarios, and there is an urgent need for an image parsing method with hierarchical structure modeling capabilities and cross-modal feature interaction mechanisms.
[0023] To address the problems existing in related technologies, embodiments of this application provide an image processing method, which involves extracting geometric features and non-geometric features from the image to be processed to obtain geometric features and non-geometric features respectively; fusing the geometric features and non-geometric features to obtain fused features; determining first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed based on at least one of the fused features and non-geometric features; the first attribute information and the second attribute information include at least the position and category of the elements; and determining semantic information of the image to be processed based on the first attribute information and the second attribute information, wherein the semantic information includes at least geometric graphic information and the association relationship between geometric elements and non-geometric elements.
[0024] Thus, through cross-modal feature interaction mechanisms and reasoning, collaborative parsing of geometric and non-geometric elements is achieved. This not only enhances the model's ability to understand complex geometric relationships but also significantly improves the completeness and accuracy of image semantic information. It is applicable to various application scenarios such as intelligent education systems, automatic problem-solving systems, and geometric graph retrieval systems, enhancing the flexibility and scalability of the method.
[0025] In some embodiments, the image processing methods provided in this application can be executed by an electronic device, which may be a terminal, a server, or an edge computing device. That is, the image processing methods in the various embodiments of this application can be executed by a terminal or by a server. The server may be a physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.
[0026] Figure 1 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Figure 1 The illustrated electronic device 10 includes at least one processor 110, a memory 150, at least one network interface 120, and a user interface 130. Various components within the electronic device are coupled together via a bus system 140. It is understood that the bus system 140 is used to implement communication between these components. In addition to a data bus, the bus system 140 also includes a power bus, a control bus, and a status signal bus. However, for clarity, ... Figure 1 The general labeled all buses as Bus System 140.
[0027] The processor 110 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.
[0028] User interface 130 includes one or more output devices 131 that enable the presentation of media content, and one or more input devices 132.
[0029] Memory 150 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard disk drives, optical disk drives, etc. Memory 150 may optionally include one or more storage devices physically located away from processor 110. Memory 150 may include volatile memory or non-volatile memory, or both. Non-volatile memory may be read-only memory (ROM), and volatile memory may be random access memory (RAM). The memory 150 described in this application embodiment is intended to include any suitable type of memory. In some embodiments, memory 150 is capable of storing data to support various operations, examples of which include AI agents, programs, modules, and data structures, or subsets or supersets thereof, as illustrated below.
[0030] Operating system 151 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., for implementing various basic business functions and handling hardware-based tasks; The network communication module 152 is used to reach other computing devices via one or more (wired or wireless) network interfaces 120, exemplary network interfaces 120 including: Bluetooth, WiFi, and Universal Serial Bus (USB), etc. The input processing module 153 is used to detect one or more inputs or interactions from one or more input devices 132.
[0031] In some embodiments, the apparatus provided in this application may be implemented in software. Figure 1An image processing device 154 stored in memory 150 is shown. This image processing device 154 can be an image processing device in an electronic device, and can be software in the form of programs and plug-ins. It includes the following software modules: a feature extraction module 1541, a fusion module 1542, a first determination module 1543, and a second determination module 1544. These modules can be logically linked and therefore can be arbitrarily combined or further divided according to the functions they implement. The functions of each module will be described below.
[0032] In other embodiments, the apparatus provided in this application can also be implemented in hardware. As an example, the apparatus provided in this application can be a processor in the form of a hardware decoding processor, which is programmed to execute the image processing method provided in this application. For example, the processor in the form of a hardware decoding processor can be one or more application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field-programmable gate arrays (FPGAs), or other electronic components.
[0033] The technical solution of this application will now be described in detail with reference to the accompanying drawings.
[0034] Figure 2 This is an optional flowchart illustrating an image processing method provided in an embodiment of this application, such as... Figure 2 As shown, the image processing method provided in this application embodiment can be implemented through steps S201 to S204: S201, Perform geometric feature extraction and non-geometric feature extraction on the image to be processed to obtain geometric features and non-geometric features respectively.
[0035] In some embodiments, the image to be processed may refer to an image that includes at least geometric shapes, text symbols, and text annotations. Geometric and non-geometric features in the image to be processed can be extracted through feature extraction.
[0036] This application embodiment can use deep convolutional neural networks (such as ResNet, Convolutional Neural Network, etc.) to perform multi-level feature extraction on the input image, obtaining geometric and non-geometric features. The purpose of geometric feature extraction is to obtain the position, shape, and other attribute information of various geometric primitives in the image, while non-geometric feature extraction is to identify the text content and its semantic meaning.
[0037] Here, geometric features can refer to numerical representations of geometric elements (such as points, lines, circles, angles, etc.) extracted from the image to be processed. These features can include spatial information such as position coordinates, shape contours, and orientation angles, and can be learned from image pixels through computer vision models (such as object detection networks).
[0038] Non-geometric features refer to information features in the image to be processed other than geometric shapes, and can include semantic content such as text, symbols, numbers, and annotations. These features can be extracted through optical character recognition (OCR) or text detection models, and include numerical representations of information such as character content, text position, and symbol type.
[0039] S202, fuse geometric features and non-geometric features to obtain fused features.
[0040] In this embodiment, fusion refers to the cross-modal interaction of geometric and non-geometric features through specific algorithms (such as feature concatenation, attention mechanisms, graph neural networks, etc.) to obtain a new feature representation, enabling the geometric and non-geometric features to enhance and complement each other. The fusion process not only includes concatenation along the feature dimension but also involves information exchange at the semantic level. The fused features retain spatial geometric information while incorporating semantic label information, forming a unified, context-sensitive feature vector.
[0041] Here, feature concatenation can be the direct connection of two feature vectors into a longer vector; attention mechanism can be the process of making geometric features focus on relevant non-geometric features, or the process of making non-geometric features focus on relevant geometric positions, thereby enhancing important information through weighted fusion; graph neural network fusion can be the process of treating geometric elements and text elements as nodes in a graph, and using graph convolution to transmit information and update features.
[0042] Here, during fusion, geometric features are converted into semantically friendly representations so that non-geometric elements can better understand the spatial location and geometric constraints of geometric features; non-geometric features (such as text) are also converted into geometrically friendly representations to enhance the correlation between geometric elements.
[0043] Fusion can be achieved by using geometric prior knowledge to guide the analysis of non-geometric elements, transforming geometric and non-geometric features separately through MLP, and injecting the transformed features into the other branch, thus realizing cross-modal joint modeling. This allows geometric and non-geometric branches to refer to each other's understanding results during the decision-making process, thereby improving the adaptability to complex geometric scenarios.
[0044] In some embodiments, the fused features can be passed through nonlinear transformation layers such as multilayer perceptrons (MLP) to enhance feature representation capabilities and obtain the final fused features.
[0045] S203, based on at least one of the fused features and non-geometric features, determine the first attribute information of geometric elements and the second attribute information of non-geometric elements in the image to be processed; the first attribute information and the second attribute information include at least the position and category of the element.
[0046] In the embodiments of this application, geometric elements can be basic geometric units in an image, including basic graphic elements such as points, lines, line segments, circles, arcs, and angles. Non-geometric elements can be non-graphical content in the image such as text, symbols, numbers, and labels, such as point labels (A, B, C), angle labels (∠ABC=90°), and length labels (AB=5cm).
[0047] In some embodiments, for geometric elements, since geometric elements have strong structural features, the fused features can provide sufficient information to support the attribute determination of geometric elements. Therefore, the fused features can be processed to obtain first attribute information. The first attribute information can refer to the specific attribute description of the geometric element, including at least position, category, and other attributes. Position can refer to the coordinate position of the geometric element in the image (such as the (x, y) coordinates of a point, the endpoint coordinates of a line); category can refer to the type of geometric element (such as point, line, circle, triangle, etc.); other attributes can be measurement information such as length, angle, and radius.
[0048] When determining the first attribute information, the precise coordinates of each geometric element can be predicted by a regression network based on the spatial information in the fused features; the element type at each position can be determined by a classification network based on the semantic information in the fused features; and a detection head structure similar to object detection can be used to perform position regression and category classification on each position on the fused feature map to obtain the first attribute information.
[0049] In some embodiments, geometric features and fusion features can be combined to determine the first attribute information in order to improve the accuracy of the first attribute information.
[0050] In some embodiments, for non-geometric elements, especially text elements, it is often necessary to combine semantic information from the original non-geometric features to achieve more accurate classification and matching. Therefore, the fused features and non-geometric features can be combined and processed to obtain second attribute information, thereby obtaining more comprehensive semantic support. The second attribute information can refer to the specific attribute description of the non-geometric element, which may include at least attribute information such as position, category, font size, and color. Position can be the location of text or symbols in the image (such as the coordinate range of a text box); content can be the specific content of the text (such as the character "A", the number "90", or the symbol "°").
[0051] In some embodiments, OCR results from non-geometric features can also be used for verification and correction to improve the accuracy of the second attribute information.
[0052] S204. Based on the first attribute information and the second attribute information, determine the semantic information of the image to be processed. The semantic information includes at least geometric information and the relationship between geometric elements and non-geometric elements.
[0053] In some embodiments, semantic information can refer to the set of logical relationships between geometric and non-geometric elements in an image, including the composition of geometric figures, topological relationships between elements, geometric information, and the association between geometric and non-geometric elements. For example, a triangle may consist of three sides connected to each other, and text annotations may provide the length values of the sides; angle symbols or perpendicular symbols may be associated with specific geometric elements; a line segment may have its length annotated by text; and an angle symbol may indicate the angle between two sides.
[0054] In some embodiments, geometric information can be data such as the shape and comprehensive attributes (e.g., perimeter, area, etc.) of geometric elements. The relationship between geometric and non-geometric elements can be described by triples, for example, (AB, perpendicular, CD) indicates that line segment AB is perpendicular to line segment CD, which can completely describe the geometric structure and semantic relationships in the image.
[0055] In this embodiment, the first and second attribute information can be input into a graph neural network for multi-level semantic reasoning. The micro-layer is responsible for identifying fine-grained relationships between elements (such as parallelism and perpendicularity), while the macro-layer is used to identify higher-order geometric structures (such as triangles and quadrilaterals). Finally, complete semantic information is output, which covers the relationships between all geometric elements.
[0056] This application's embodiments enhance the interaction between different modal features and improve the accuracy of subsequent attribute judgments by extracting geometric and non-geometric features separately and then fusing them. Based on at least one of the fused features and non-geometric features, the attribute information of geometric and non-geometric elements is determined separately, thereby achieving a comprehensive analysis of the structured information in the image. Combining the first and second attribute information, the semantic information of the image can be more accurately deduced. Through cross-modal feature interaction mechanisms and reasoning, collaborative analysis of geometric and non-geometric elements is achieved, which not only enhances the model's ability to understand complex geometric relationships but also improves the completeness and accuracy of image semantic information. It is applicable to various application scenarios such as intelligent education systems, automatic problem-solving systems, and geometric graph retrieval systems, enhancing the flexibility and scalability of the method.
[0057] In some embodiments, step S203 can be implemented by steps S2031 to S2032: S2031, perform semantic segmentation and instance clustering on the fused features to obtain the first attribute information of the geometric elements. The first attribute information includes at least the geometric category probability and the geometric instance to which each geometric pixel belongs in the geometric element, as well as the position of the geometric element.
[0058] Here, semantic segmentation is used to assign each pixel in an image to a specific semantic category in order to identify basic geometric elements (such as points, lines, circles, etc.) in a geometric graph and assign a geometric category probability to each pixel, thereby achieving pixel-level classification.
[0059] Geometric category probability represents the probability distribution of a pixel belonging to a certain geometric primitive (such as a point, line, or arc). For example, the probability that a pixel could be a line is 0.95, while the probability of it being a circle is 0.05. A geometric instance can refer to an actual geometric shape existing in the image, such as parallel lines or triangles. Location information represents the specific coordinate range of a geometric instance in the image, used for subsequent geometric relationship reasoning.
[0060] Instance clustering refers to the process of grouping pixels belonging to the same geometric object into one class after semantic segmentation, thus determining the geometric instance to which a pixel belongs. High-dimensional feature vectors for each pixel can be learned through embedded representations, and spatially adjacent and semantically consistent pixels are grouped into the same instance based on these high-dimensional feature vectors. Instance clustering can not only distinguish boundaries between different geometric objects but also solve the problem of differentiating slender or overlapping geometric objects.
[0061] The embodiments of this application can accurately identify geometric elements at the pixel level and classify the geometric elements into different instances, thereby constructing a complete geometric structure.
[0062] S2032, perform multi-task detection on non-geometric features and fused features to obtain the second attribute information of non-geometric elements. The second attribute information includes at least the non-geometric category probability of each non-geometric pixel in the non-geometric element and the centrality confidence in its respective non-geometric object, as well as the position and text content of the non-geometric element; wherein, the multi-task detection includes at least classification detection, position detection and pixel centrality detection.
[0063] In this embodiment, non-geometric objects refer to parts of an image that do not belong to geometric primitives, such as text, symbols, and arrows. Non-geometric objects contain important semantic information, such as length labels, angle values, and vertical symbols. To effectively parse text, symbols, arrows, and other elements in an image that do not belong to geometric primitives, this embodiment can employ a multi-task detection approach to perform classification, localization, and centrality prediction detection tasks on non-geometric features and fused features to obtain second attribute information.
[0064] In some embodiments, classification detection is used to identify the type of non-geometric elements, such as text, arrows, vertical symbols, etc. Position detection is used to determine the specific location of each non-geometric element, which can be represented as a bounding box. Pixel centrality detection is used to evaluate the confidence level of the central region of the non-geometric element, thereby suppressing low-quality detection results and improving overall positioning accuracy. Centrality confidence indicates that as the value increases, a pixel is more likely to be located at the center of the target; centrality confidence helps to accurately locate the target boundary.
[0065] This application embodiment can also extract text content that is not geometric, such as numbers, units, and symbols, to facilitate subsequent geometric relationship matching. For example, if a piece of text contains the value 5cm, it can be inferred that the text content may correspond to the length attribute of a line segment.
[0066] This application's embodiments, through semantic segmentation and instance clustering of fused features, can more precisely identify the categories and instance affiliations of geometric elements, while simultaneously acquiring their spatial locations, thus improving the resolution accuracy of geometric elements. By classifying, locating, and performing centrality analysis on non-geometric elements through multi-task detection, the semantic information and spatial distribution characteristics of non-geometric elements can be effectively captured, better meeting the resolution needs of diverse elements in complex images.
[0067] In some embodiments, step S2031 can also be implemented through steps S1 to S3: S1, based on the semantic segmentation network, performs semantic segmentation on the fused features to obtain the geometric category probability of each geometric pixel in the geometric element; the semantic segmentation network is at least trained using a weighted binary cross-entropy loss function.
[0068] In this embodiment of the application, the semantic segmentation network can be a deep neural network structure used to classify each pixel in an image into a corresponding semantic category, and to identify which type of geometric primitive (such as point, line, circle) each pixel in the geometric graph belongs to.
[0069] The input to a semantic segmentation network can be a fused feature map, and the output can be a class probability map with the same size as the original image. Each pixel in the class probability map corresponds to a geometric class probability.
[0070] The weighted binary cross-entropy loss function can be the objective function for semantic segmentation networks, used to address class imbalance. In geometric graphs, the frequency of different geometric elements may vary significantly; for example, straight lines may appear much more often than circles. To prevent the semantic segmentation network from favoring high-frequency classes while ignoring low-frequency classes, embodiments of this application can dynamically adjust the weights of each class involved in the weighted binary cross-entropy loss function during the training process of the semantic segmentation network, enabling the semantic segmentation network to learn features from each class more evenly.
[0071] In some embodiments, the weighted binary cross-entropy loss function is shown in Equation (1): (1); Where * represents the extracted primitive category; It balances the weight ratio of positive and negative pixels, and an initial value is set based on experience. ; This refers to the number of pixels in the segmented image. The semantic segmentation loss for geometric elements is the sum of the segmentation losses for points, lines, and circles; the total segmentation loss is... Represented as formula (2): (2); in, The point segmentation loss; For line segmentation loss; For the segmentation loss of the circle, S2, based on a regression network, locates the geometric elements in the geometric features to obtain the position of the geometric elements.
[0072] A regression network can be a neural network used to accurately locate identified geometric elements. The regression network regresses the bounding box of each geometric element and outputs the position coordinates of the bounding box (such as the coordinates of the top left and bottom right corners), thereby obtaining the spatial distribution information of each geometric element.
[0073] Obtaining high-precision location information through regression networks can ensure the accuracy of subsequent graph neural network inference processes.
[0074] S3, based on the instance segmentation network, performs instance clustering on the fused features to obtain the geometric instance to which each geometric pixel in the geometric element belongs; the instance segmentation network is at least trained by a discriminative loss function based on distance metric.
[0075] In this embodiment, the instance segmentation network can be a network structure capable of distinguishing different instances under the same semantic category, used to differentiate different geometric elements (such as two different lines) under the same semantic category. This can be achieved by generating an embedding vector for each pixel and clustering based on the distance between these vectors, thereby determining which geometric instance each pixel belongs to.
[0076] Instance segmentation networks can be trained using discriminative loss functions based on distance metrics, including intra-cluster variance loss and inter-cluster distance loss. Intra-cluster variance loss constrains pixel embedding vectors within the same geometric instance to be as close as possible, while inter-cluster distance loss ensures that embedding vectors from different instances maintain a certain distance. This training method enhances the discriminability between instances.
[0077] In some embodiments, the discriminative loss function based on a distance metric includes intra-cluster variance loss. Inter-cluster distance loss formula As shown in formulas (3) and (4): (3); (4); in, The total number of examples of lines and circles, i.e. ,in and These represent the number of line instances and the number of circle instances, respectively. For the first The number of pixels in a geometric primitive instance; For the first In the instance, the _th Embedding vector of pixels; and They represent the first The first instance and the first The cluster center of each instance, that is, the mean center of all pixel embeddings; The threshold for the embedding radius is set to a default value of 0.5 to control the maximum allowable distance between pixels within a cluster and the center. Set the minimum distance threshold between cluster centers, with a default value of 1.5, to ensure that the cluster centers of different instances are sufficiently far apart.
[0078] This application employs a combination of semantic segmentation networks and instance segmentation networks, which can obtain the category probability of each pixel and perform instance-level clustering, thereby achieving pixel-level geometric element parsing. Simultaneously, by introducing a weighted binary cross-entropy loss function and a discriminative loss function, the data imbalance problem can be effectively alleviated and the discriminative ability improved. This not only enhances the recognition ability of geometric elements but also strengthens the reasoning ability regarding geometric relationships.
[0079] In some embodiments, step S2032 can be implemented by steps S11 to S14: S11, based on the classification network, classify and detect the fused features to obtain the non-geometric class probability of each non-geometric pixel in the non-geometric elements; the classification network is at least trained using the class imbalance loss function.
[0080] A classification network is a deep learning model used to classify and predict input features. A classification network can consist of multiple convolutional layers and fully connected layers, and can extract high-dimensional features and output class probability distributions.
[0081] In this embodiment, the classification network is used to classify each pixel in the fused feature map to obtain the non-geometric category probability of each non-geometric pixel, which is used to determine which non-geometric element (such as text, arrow, symbol, etc.) each non-geometric pixel belongs to.
[0082] The class imbalance loss function is a loss function designed to address the problem of imbalanced sample classes in a dataset. In the embodiments of this application, since some non-geometric elements (such as specific types of symbols or rare text) occur with low frequency, using a class imbalance loss function (such as Focal Loss) can improve the detection capability of small sample classes among non-geometric elements. By assigning different weights to easy and difficult samples, the model focuses on small sample classes, improving the classification accuracy of non-geometric elements, especially significantly improving the detection performance when dealing with uneven sample class distribution.
[0083] In some embodiments, the class imbalance loss function is as shown in formula (5): (5); in, The number of positive samples; The set of positive samples; It is a collection of categories such as dot character and length; To predict class probabilities; This is a real label; and These are the category weight coefficient and the focus parameter, respectively.
[0084] S12, based on a regression network, performs position detection on non-geometric elements in non-geometric features to obtain the position of non-geometric elements.
[0085] Regression networks can predict the precise location of each non-geometric element based on information from non-geometric feature maps. The result of location detection can be a four-dimensional vector, representing the distance from the current pixel to the left, top, right, and bottom boundaries. The information contained in the location detection result can be used for subsequent bounding box generation and object instance segmentation, achieving precise localization of non-geometric elements.
[0086] By introducing a regression network into the target detection system to perform the position detection task, the target detection system can significantly improve the spatial positioning accuracy for non-geometric elements. The target detection system can avoid the hyperparameter sensitivity problem caused by anchor frame setting in traditional methods, and it can also improve the adaptability to target objects with complex structures.
[0087] S13, Based on the centrality network, pixel centrality detection is performed on the position of each pixel in the non-geometric elements of the non-geometric features to obtain the centrality confidence of each non-geometric pixel in its respective non-geometric object.
[0088] Centrality networks can be used to evaluate whether each pixel in an image is located in the center region of a non-geometric object. Each pixel outputs a value between 0 and 1, representing the probability that each pixel is the center of an object. The higher the confidence level, the more likely a pixel is to be part of the object's center.
[0089] S14, based on the character recognition network, performs content recognition on non-geometric elements in non-geometric features to obtain the text content of non-geometric elements.
[0090] Here, the character recognition network can be a deep learning model built on OCR technology, used to recognize text content in images, and used to extract text content of non-geometric elements, such as numbers, letters, and symbols, from non-geometric feature maps.
[0091] Content recognition of non-geometric elements is crucial for geometric graph analysis because much geometric information is presented in text form, such as length annotations, angle values, and label names. Only by accurately recognizing the text content can geometric relationships and attributes be further deduced.
[0092] This application embodiment optimizes four sub-tasks—classification, regression, centrality, and character recognition—to accurately extract information about non-geometric elements from multiple dimensions, including category, location, centrality, and text content. This achieves efficient and accurate identification of non-geometric elements, providing a reliable data foundation for subsequent geometric relationship reasoning and attribute perception, thereby significantly improving the performance and robustness of the entire geometric graph parsing.
[0093] In some embodiments, step S204 can be implemented by steps S2041 to S2043: S2041, based on the principle of spatial proximity, the position of geometric elements in the first attribute information, and the position of non-geometric elements in the second attribute information, the geometric elements and non-geometric elements are matched to obtain the matching relationship between the geometric elements and non-geometric elements.
[0094] The principle of spatial proximity refers to the high spatial proximity of symbolic text or labels to their corresponding geometric primitives (such as points, lines, and circles) in a geometric graph. Therefore, during parsing, the Euclidean distance between geometric and non-geometric elements can be calculated to initially screen potential candidate pairs and obtain the matching relationship between them. Proximity-based matching methods can effectively reduce the possibility of false matches and provide a foundation for subsequent semantic verification. For example, in a geometric graph, the angle value 120° is more likely to appear near the vertex corresponding to the angle value 120°, rather than far from the region corresponding to the angle value 120°.
[0095] By establishing matching relationships based on the principle of spatial proximity, a preliminary candidate set of geometric element-non-geometric element associations can be quickly established, which can improve matching efficiency, avoid blindly traversing all possible combinations, reduce redundant calculations, and improve the overall reasoning speed.
[0096] S2042, Based on the position of the geometric element in the first attribute information and the geometric category probability and the geometric instance to which each pixel in the geometric element belongs, determine the geometric relationship of the geometric element.
[0097] The geometric relationships between geometric elements can refer to the mathematical or topological connections between geometric elements in a geometric diagram, such as points on lines, line segments intersecting, line segments parallel, line segments perpendicular, collinear, tangent, etc.
[0098] This application's embodiments can infer the specific geometric relationships between geometric elements by combining the spatial location information of geometric elements, the geometric category probability of each pixel, and instance attribution information. For example, if two straight lines have the same slope, they can be determined to be parallel; if the angle between them is 90 degrees, they can be determined to be perpendicular.
[0099] S2043, based on matching and geometric relationships, determine geometric information and association relationships.
[0100] Here, based on the established matching relationships between geometric and non-geometric elements, as well as the geometric relationships between geometric elements, a complete geometric graph structure representation can be constructed, including geometric graphic information and related relationships.
[0101] Geometric relationships can include the connection methods between elements, topological structure, and attribute descriptions. For example, a triangle consists of three sides, and there is a specific angular relationship between these three sides, as well as the triangle's specific perimeter and area.
[0102] The relationships between geometric and non-geometric elements refer to the logical or spatial connections between different types of elements in an image. For example, a line segment might be labeled with its length, and an angle symbol might indicate the angle between two sides. These relationships form the semantic basis of an image, helping to understand its overall structure and function.
[0103] This application's embodiments match geometric elements with non-geometric elements using the principle of spatial proximity, establishing semantic connections between them. Furthermore, it utilizes the category probability and instance information of geometric elements to further determine geometric relationships, thereby constructing a complete geometric structure and ultimately forming the semantic information of the image. This significantly improves the accuracy of the association between symbols and geometric elements in the geometric diagram, thus mitigating the semantic loss problem caused by cross-modal feature fragmentation in traditional methods, and providing efficient and accurate image parsing capabilities.
[0104] In some embodiments, step S2042 can also be implemented via steps S21 to S22: S21, based on the geometric instance to which they belong, determine pixels belonging to the same instance as the same geometric primitive; the geometric primitive includes at least points, edges and circles.
[0105] A geometric instance can refer to the geometric instance to which a pixel belongs; for example, multiple pixels may belong to the same line. Geometric primitives are the basic units that make up complex geometric figures and can include basic graphic elements such as points, edges, and circles.
[0106] Here, pixels belonging to the same instance can be identified as the same geometric primitive, and the topological relationship of the entire geometric graph can be gradually constructed.
[0107] S22, based on the geometric category probability and the position of the geometric elements, detects the spatial relationship between different geometric primitives to obtain the geometric relationship of the geometric elements.
[0108] In this embodiment, the geometric category probability can characterize the probability value of each pixel belonging to a certain geometric primitive (such as a point, line, or circle), and is used to represent the degree of matching between the pixel and different geometric primitives. The positional information of the geometric elements can refer to the specific coordinates or bounding boxes of each geometric primitive in the image.
[0109] This application embodiment can accurately determine the geometric relationship between geometric elements by combining the probability value of geometric categories with the actual position information of pixels, such as the relative relationship between different geometric primitives, such as perpendicular, parallel, intersecting, etc.
[0110] This application combines category probability and location information to detect spatial relationships between different primitives, thereby constructing the topological structure of geometric elements. Compared with traditional methods that rely on manually defined rules to define geometric relationships, this method has stronger adaptability and generalization ability, and improves the accuracy of geometric relationship reasoning.
[0111] In some embodiments, step S2043 can also be implemented via steps S31 to S33: S31, the matching relationship and geometric relationship are fused to obtain the geometric information of the image to be processed; the geometric representation includes at least the geometric figure and the corresponding text symbol.
[0112] Matching relationships can refer to the logical or semantic correspondence between non-geometric elements (such as perpendicular symbols, parallel symbols, and angles) and geometric elements (such as points, lines, and circles). Geometric relationships can refer to the mathematical or topological connections between geometric primitives, such as points on lines, line segments intersecting, line segments parallel, line segments perpendicular, collinear, and tangent.
[0113] The fusion here refers to jointly reasoning with matching relationships and geometric relationships to extract more accurate and comprehensive geometric information. Geometric information refers to the structured data obtained after fusion, containing geometric figures and their associated textual symbols. For example, a triangle might contain three sides, three vertices, and the degree measure of each angle. Textual symbols can be non-geometric elements used to express geometric attributes, such as angle values (e.g., 60°), lengths (e.g., 5cm), and areas (e.g., 12m²). These textual symbols can be attached to specific geometric primitives and provide numerical information to those primitives.
[0114] There is a synergistic effect between matching relationships and geometric relationships. For example, after identifying an angle symbol, the system analyzes whether the geometric primitives (such as rays and line segments) adjacent to the angle symbol can form a valid angle structure, and verifies whether the vertex of the angle corresponding to the angle symbol is located in the correct geometric position, thereby obtaining geometric information.
[0115] S32 performs geometric constraint verification on the geometric information and obtains the verification results.
[0116] Geometric constraint verification can refer to checking the logical consistency of analytical geometric information based on the axiomatic system of Euclidean geometry. For example, if a line segment is marked as perpendicular, then the two line segments connected to this line segment must satisfy the condition that their slopes are negative reciprocals of each other; if an angle is marked as a right angle, then this angle must be formed by two mutually perpendicular line segments.
[0117] The verification result can be a judgment on whether the geometric information conforms to geometric rules, such as verification passed or failed. If the verification passes, it indicates that the current data parsing result has a high degree of reliability; if the verification fails, it indicates that there may be errors in the parsing process, and further correction or optimization is required.
[0118] The geometric constraint verification provided in this application helps to eliminate unreasonable geometric configurations and improve the robustness and accuracy of the analytical results.
[0119] S33, in response to the verification result indicating that the verification has passed, the association relationship is determined based on the geometric information.
[0120] Here, when the verification result is successful, it means that the current geometric information has conformed to geometric rules, and the relationship between each element will be determined based on the current geometric information.
[0121] Relationships can refer to the connections or constraints between elements (such as points, lines, and surfaces) in a geometric figure. For example, a line segment may connect two points, an angle may be formed by two rays, and a triangle may be composed of three sides.
[0122] This application embodiment integrates matching relationships with geometric relationships to extract more accurate and complete geometric information. Furthermore, a geometric constraint verification mechanism ensures that the output results conform to geometric axioms, thereby improving the reliability of semantic information, effectively reducing erroneous associations, enhancing the accuracy of image semantic parsing, and significantly improving the precision and reliability of geometric parsing.
[0123] In some embodiments, the image processing method may further include step S41: S41, perform multi-scale feature extraction on the geometric graph to obtain a multi-scale feature map.
[0124] In the embodiments of this application, features can be extracted from geometric graphs at multiple scales using deep neural networks (such as convolutional neural networks), thereby capturing geometric structural information at different scales. For example, edges and corners can be detected in low-level features, while more complex shapes or semantic information can be identified in high-level features.
[0125] By using multi-scale feature extraction methods, we can comprehensively acquire local details and global structural information in geometric graphs, thereby improving the model's ability to perceive geometric elements at different scales and providing a high-quality multi-scale feature map foundation for subsequent extraction of geometric and non-geometric features.
[0126] Correspondingly, step S201 can be achieved through step S2011: S2011, geometric and non-geometric features are extracted from the multi-scale feature map respectively to obtain geometric and non-geometric features.
[0127] In some embodiments, by performing two-way branching processing on the multi-scale feature maps respectively, feature extraction of geometric and non-geometric elements can be achieved, which can improve the perception of multiple elements in the geometric map and thus enable end-to-end structured understanding of the geometric map.
[0128] The embodiments of this application enhance the model's perception capability of targets of different sizes through multi-scale feature extraction, making subsequent geometric and non-geometric feature extraction more accurate and enabling more comprehensive capture of key information in the image, which is suitable for analytical tasks of complex geometric figures.
[0129] The following will describe an exemplary application of the embodiments of this application in a real-world application scenario.
[0130] To address the problems existing in related technologies, this application provides a multi-level microscopic and macroscopic collaborative hierarchical graph neural network parsing method for achieving end-to-end structured understanding of geometric graphs, including a geometric and non-geometric feature interaction mechanism and a multi-level graph network high-level semantic regression mechanism.
[0131] The geometric and non-geometric feature interaction mechanism is based on the prior geometric correlation between text, symbols, and geometric shapes, and designs a feature interaction module. On the one hand, it promotes cross-modal feature fusion, thereby enhancing the recognition ability in the relationship prediction stage; on the other hand, in the end-to-end learning process, it improves the discriminative power of different element features through implicit geometric prior knowledge.
[0132] The high-level semantic regression mechanism of multi-layer graph networks employs an edge graph attention network to perform fine-grained relationship classification on each pair of geometric elements based on the feature vectors of geometric elements, predicting their geometric relationship attributes (such as perpendicular, parallel, collinear, tangent, etc.) and outputting edge labels with confidence scores. Based on the effective edges output by the micro-layer, candidate subgraphs (such as triangles composed of three edges) are clustered and input into a graph isomorphism network. High-level semantic regression is performed on each subgraph to predict its shape category and comprehensive attributes (such as perimeter, area, etc.), i.e., geometric information, thereby improving overall robustness and analytical accuracy.
[0133] This invention provides a planar geometry graph analysis method based on feature fusion. Figure 3 This is a model framework diagram of the planar geometry analysis method provided in the embodiments of this application, such as... Figure 3 As shown, this method uses a graph neural network as its core, focusing on the parsing and understanding of symbolic text, geometric elements, their geometric relationships, and attributes contained in geometric diagrams. Based on the inherent hierarchical structure of geometric figures, the geometric diagram is encoded into a graph structure. Further, joint reasoning is performed among geometric elements, symbolic text, and geometric attribute structures to achieve geometric feature attribute perception fusion, thereby better capturing geometric relationships and potential geometric attribute information. The specific implementation process includes a feature extraction module 301, a geoprimitive branch module 302, a non-geoprimitive branch module 303, an inter-semantic interaction module 304, an optical character recognition module 305, and a joint reasoning module 306.
[0134] First, the non-geometric primitive branch module 303 performs symbolic text detection. To avoid the hyperparameter sensitivity problem caused by anchor box setting in traditional object detection methods, this application proposes an anchor-free pixel-by-pixel prediction mechanism, which achieves efficient localization of symbolic text by directly regressing the target bounding box. This anchor-free detection strategy eliminates the anchor box matching process, significantly reduces model complexity, and improves generalization ability, and can flexibly adapt to various types of symbols and text structures in geometric graphs.
[0135] In its specific implementation, the feature extraction module 301 of the detection framework uses a feature pyramid network as the core feature extraction backbone and utilizes the design of a multi-level feature layer sharing a detection head to achieve joint learning of multi-scale features, that is, features of all scales are given to the same set of detection heads.
[0136] The detection head consists of at least three parts: a classification branch, a regression branch, and a centrality branch. The classification branch (i.e., the classification network) is used to predict the probability distribution of non-geometric element categories corresponding to each pixel. It can output a probability map of H×W×C, where C is the number of categories, such as text, arrows, symbols, etc. It can also be text (e.g., six subcategories: point name, line name, length, angle, subscript, area), vertical symbols, parallel symbols, equal-length symbols, equal-angle symbols, line segment range markers, and arrow markers. The regression branch (i.e., the regression network) is used to predict the precise bounding box coordinates of the target. For example, it outputs a coordinate feature map of H×W×4. Each pixel predicts a 4-dimensional vector (l, t, r, b), and the four channels represent the distances from the point to the left, top, right, and bottom edges of the bounding box, respectively. The centrality branch (i.e., the centrality network) is used to estimate the confidence of the target's central region to suppress low-quality predicted boxes and improve localization accuracy. It can output a confidence map of H×W×1 with a value range of [0, 1].
[0137] The classification branch performs classification learning for seven common non-geometric elements in the geometric graph. Focal Loss (i.e., class imbalance loss function) can be used as the loss function, as shown in formula (6), to alleviate the problem of class imbalance in the samples and ensure the robustness and accuracy of the detection results.
[0138] (6); in, The number of positive samples; The set of positive samples; It is a collection of categories such as dot character and length; To predict class probabilities; This is a real label; and These are the category weight coefficient and the focus parameter, respectively.
[0139] Through this multi-branch joint detection mechanism, this application achieves cross-scale detection and accurate semantic recognition of symbols and text elements in geometric figures, providing a reliable feature foundation for subsequent geometric semantic analysis.
[0140] Secondly, geometric element extraction is performed. The geometric primitive branch module 302 can adopt a dual-branch collaborative architecture (i.e., the two branches have the same structure, which is simplified). Figure 3The two branches are merged together, but their parameters are not shared. One branch outputs a semantic embedding (Semantic Emb), while the other branch has two more convolutional layers, outputting a segmentation map and centrality. This enables fine-grained parsing of geometric primitives. The semantic segmentation branch (i.e., the semantic segmentation network) performs pixel-level category determination of geometric primitives, outputting an H×W×C probability map, where C is the number of geometric categories. Simultaneously, it combines with a pixel embedding branch (i.e., the instance segmentation network, the semantic embedding branch, outputting an H×W×D probability map) to achieve instance clustering. This embedding representation of pixels ensures that pixels belonging to the same instance are as close as possible within the spatial range, i.e., pixels of the same instance are close together, while pixels of different instances are far apart. In other words, pixels belonging to the same geometric element in space are also compactly clustered in the embedding space. The two branches work together to complete pixel-level instance segmentation of multi-category geometric primitives, thus effectively solving the problem of distinguishing elongated geometric elements.
[0141] In some embodiments, the Center-ness branch outputs an H×W×1 probability map, representing the center offset, i.e., the offset of each location from the true center.
[0142] The geometric primitive branch module 302 employs a fully convolutional neural network structure to learn pixel-to-pixel mappings. It progressively extracts deep geometric features rich in semantic information through multi-level downsampling operations, then recovers spatial detail information via a symmetrical upsampling process, ultimately outputting a probability distribution map with the same resolution as the original input. Each spatial location in this probability map (corresponding to a pixel in the original image) contains the prediction confidence of the corresponding geometric category, providing reliable prior information on the category for subsequent processing.
[0143] In this embodiment of the application, the multi-scale features obtained by the feature extraction module 301 can be fused from small to large: a 1*1 convolution operation is performed on the features of the Lth layer, and then the features of the L+1th layer are upsampled by bilinear interpolation and added to the Lth layer. Then, the features of the Lth layer are fused through the convolutional layer to obtain the final features of the Lth layer.
[0144] To address the class imbalance problem commonly encountered in geometric element segmentation tasks, this branch employs a weighted binary cross-entropy loss function. Optimization is performed, as shown in formula (7). By dynamically adjusting the loss weights of pixels of different geometric primitive categories, the model bias problem caused by uneven data distribution can be effectively alleviated.
[0145] (7); Where * represents the extracted primitive category; It balances the weight ratio of positive and negative pixels, and an initial value is set based on experience. , , ; This refers to the number of pixels in the segmented image. The semantic segmentation loss for geometric elements is the sum of the segmentation losses for points, lines, and circles; the total segmentation loss is... Represented as formula (8): (8); The Semantic Embedding (SEM) branch constructs a discriminative feature space through a metric learning mechanism. Its aim is to map each pixel to a point in the feature space, ensuring that pixels belonging to the same geometric primitive instance are close together, forming compact clusters, while pixel clusters from different instances maintain sufficient separation. This part employs a distance-based discriminative loss function, including intra-cluster variance loss. Inter-cluster distance loss formula End-to-end optimization is performed to better distinguish geometric primitive instances, as shown in equations (9) and (10): (9); (10); in, The total number of examples of lines and circles, i.e. ,in and These represent the number of line instances and the number of circle instances, respectively. For the first The number of pixels in a geometric primitive instance; For the first In the instance, the _th Embedding vector of pixels; and They represent the first The first instance and the first The cluster center of each instance, that is, the mean center of all pixel embeddings; The threshold for the embedding radius is set to a default value of 0.5 to control the maximum allowable distance between pixels within a cluster and the center. Set the minimum distance threshold between cluster centers, with a default value of 1.5, to ensure that the cluster centers of different instances are sufficiently far apart.
[0146] Intra-cluster variance loss This is used to constrain the minimization of the Euclidean distance between pixel embedding vectors within the same geometric primitive instance, when the pixel vectors With the mean vector of the corresponding geometric primitive instance The distance is greater than the embedding radius distance At that time, the model will be updated to bring the pixel vector closer to the cluster center, ensuring the consistency of features within the instance; inter-cluster distance loss. Then control the spacing between cluster centers of different instances. If the distance between different geometric primitive instances is less than a preset threshold... When the model is updated, it moves the two elements further apart, enhancing the distinguishability between instances. This dual-constraint optimization strategy effectively addresses the common problem of overlapping geometric instances in planar geometric graphs. By jointly optimizing both, it balances intra-cluster compactness and inter-cluster separation to achieve accurate pixel-to-instance clustering. Finally, the total loss of the geometric element extraction module is the sum of the semantic segmentation branch and the pixel embedding branch loss, expressed as formula (11): (11); Finally, there is the joint reasoning module 306, whose core function is to construct a graph network based on geometric elements, relationships between elements, and latent attributes to achieve multi-dimensional joint reasoning. For example, it accurately matches symbolic text (such as ⊥, ∥) in the geometric graph to corresponding geometric elements (such as lines, angles). Based on the initial extraction of geometric elements and symbolic text, this method can achieve deep feature fusion and constraint optimization through the internal semantic interaction submodule.
[0147] The internal semantic interaction submodule is jointly modeled based on the principle of spatial proximity and geometric relation semantic constraints. It achieves spatial pairing of symbolic text and geometric primitives through an Euclidean distance function and introduces geometric logic constraint functions to verify the rationality of the match. For example, perpendicular symbols are only applicable to pairs of orthogonal lines, and parallel symbols must satisfy slope consistency constraints. By jointly optimizing the objective function to minimize distance error and semantic conflict penalty terms, accurate matching of symbolic text and geometric primitives is achieved, significantly improving the robustness and accuracy of geometric relation parsing. Joint optimization can calculate the final matching score by combining spatial distance and semantic confidence, where the score = 1 - (distance penalty + conflict penalty). The closer the distance and the more semantically matched, the higher the score.
[0148] In conventional geometric images, symbolic text can be attached to specific geometric primitives. Therefore, the joint inference module 306 first performs a proximity-driven "symbol-primitive" assignment based on spatial distance metrics. The spatial distance function dist(si, pj) between a symbol si and a geometric primitive pj is defined using Euclidean distance. This distance is then minimized to initially assign each symbolic text to the nearest geometric primitive, forming candidate association pairs (si, pj). Considering that relying solely on proximity may lead to mismatches, such as symbols being close in distance but semantically incompatible, a geometric relationship constraint function F(si, pj) (i.e., geometric constraint verification) can be introduced to verify the logical compatibility between the symbol and the primitive. Function F verifies that its true angle or slope meets the constraint conditions. For example, perpendicular symbols are only applicable to pairs of orthogonal lines, and parallel symbols must satisfy conditions such as slope consistency.
[0149] To further improve the model's ability to recognize high-order geometric elements (such as triangles and quadrilaterals), this application proposes a structured subgraph representation mechanism. This mechanism, based on the Euclidean geometry axioms, models the hierarchical structure of complex geometric shapes as multi-node subgraphs. Figure 4 This is a schematic diagram of the hierarchical structure of the geometric shape provided in the embodiments of this application, such as... Figure 4 As shown, in the primitive layer 401, nodes represent geometric primitives, such as points, lines, and circles; in the intermediate layer 402, edges represent the relationships and attributes between primitives, such as two points forming a line, or two lines forming an angle; the attribute layer 403 embodies the high-level semantics of geometric elements, such as triangles and quadrilaterals. This method achieves the joint encoding of geometric topological relationships and attribute information, enabling the effective injection of geometric prior knowledge into the network learning process and providing structured support for the recognition of higher-order geometric shapes.
[0150] This application introduces a structured subgraph representation method, effectively utilizing the inherent hierarchical relationships of geometric elements. Based on the Euclidean geometry axiomatic system, at the microscopic level, all complex shapes and their binary relationships are constructed from basic elements such as points, lines, and circles; at the macroscopic level, geometric attributes such as the side length of a triangle and the radius of a circle are further derived from composite shapes. Therefore, the model decomposes geometric shapes into attribute-related subgraph structures for representation. This structure is constructed with geometric elements as nodes and the relationships and attributes between elements as edges. For example, a triangle corresponds to a three-node subgraph.
[0151] In some embodiments, the semantics of an edge can be recovered through prior knowledge, such as the existence of a connection relationship between a triangle point and an area point, which is an attribute relationship. This application simplifies the various semantic relationships of edges into connection relationships, reducing the multi-classification problem of edges to a binary classification problem.
[0152] During graph construction, the model can uniformly map geometric elements and symbolic text into a heterogeneous set of nodes, and use element relationships and attribute features as edge sets to form a complex multi-relation hypergraph. To reduce complexity, this application simplifies the heterogeneous graph into a homogeneous sparse graph, retaining only the key connections determined by geometric prior knowledge. The initial features of a node can be composed of visual position embeddings, analytical position features, and semantic features. The visual embeddings achieve shape-to-fixed-dimensional vector mapping through mask averaging or bilinear interpolation; the analytical position features are obtained based on the analytical representation of geometric parameters such as points, lines, and circles. The final node features are fused from multiple sources to support subsequent graph structure reasoning.
[0153] In the graph reasoning stage, this application may employ a hierarchical graph neural network structure, including a micro-level edge graph attention network and a macro-level graph isomorphic network.
[0154] The micro-layer adaptively learns the spatial constraints and geometric relationships between geometric elements based on the edge graph attention mechanism. It predicts the constraint types between nodes through the edge classification task, thereby realizing fine-grained modeling of topological relationships. The loss function is shown in Equation (12), and the edge relationship prediction loss is obtained. Meanwhile, in the text classification task, geometric context features are used to perform fine-grained text semantic recognition, which effectively improves the ability to distinguish visually similar texts. The loss function is shown in formula (13), which is obtained.
[0155] (12); (13); The macroscopic layer can perform subgraph-level feature regression through graph isomorphic networks, integrating local relational features using a message-passing mechanism to achieve overall attribute inference of geometric shapes. This network employs injective aggregation functions to ensure strict distinction of topological differences and uses a dual aggregation strategy of max pooling and average pooling to obtain a global graph representation for multi-task prediction of shape categories and attributes.
[0156] The joint reasoning module 306 includes Geo Attribute Perception, which is implemented through a multi-layer graph neural network. The information transmission of the graph neural network is represented by formulas (14) and (15): (14); (15); Where MLP represents a multilayer perceptron network, POOL represents pooling, and CONCAT represents feature concatenation. The features of each node v are represented as follows: The eigenvectors of the entire graph are represented as .
[0157] Formula (13) defines how information propagates along the edges of the graph. After each layer, each node contains information about its neighbors. After multiple layers of such propagation, a node can perceive information about nodes far away in the graph (its receptive field increases).
[0158] Formula (14) transforms a set of node features (unordered, variable number) into a fixed-length feature vector, which can be used as a fingerprint or embedding representation of the entire graph for subsequent graph-level tasks, such as graph classification and graph property prediction.
[0159] In terms of loss design, this application employs a multi-task joint optimization strategy to simultaneously minimize the edge relationship prediction loss. Text classification loss and subgraph-level attribute regression loss This forms a joint optimization objective function for edge feature perception and subgraph-level regression, as shown in formula (16), where the subgraph-level attribute regression loss... As shown in Equation (17), this strategy ensures the synergistic consistency between local structural constraints and global attribute learning, thereby achieving a unified framework for geometric relationship recognition and attribute reasoning.
[0160] (16); (17); This application's embodiments achieve joint analysis of geometry and text, significantly improving analysis capabilities in real-world image scenes; it introduces geometric and text feature interaction, with different geometric elements mutually constraining each other to enhance relationship recognition capabilities; and it employs a hierarchical graph network structure, effectively combining local relationship reasoning with global semantic modeling, thereby improving the analysis accuracy of planar geometric graphs. Please continue to refer to Figure 1 The image processing device 154 may include: a feature extraction module 1541, a fusion module 1542, a first determination module 1543, and a second determination module 1544. The feature extraction module 1541 is used to extract geometric features and non-geometric features from the image to be processed, respectively, to obtain geometric features and non-geometric features. The fusion module 1542 is used to fuse the geometric features and non-geometric features to obtain fused features. The first determination module 1543 is used to determine, based on at least one of the fused features and non-geometric features, first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed; the first attribute information and the second attribute information at least include the position and category of the elements. The second determination module 1544 is used to determine the semantic information of the image to be processed based on the first attribute information and the second attribute information; the semantic information at least includes geometric graphic information and the association relationship between geometric elements and non-geometric elements.
[0161] It should be noted that the description of the device embodiments in this application is similar to the description of the method embodiments described above, and has similar beneficial effects as the method embodiments; therefore, it will not be repeated. For technical details not disclosed in the device embodiments, please refer to the description of the method embodiments in this application for understanding.
[0162] It should be noted that, in the embodiments of this application, if the above-described image processing method is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, or the part that contributes to the related technology, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a terminal to execute all or part of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, mobile hard drives, read-only memory (ROM), magnetic disks, or optical disks. Thus, the embodiments of this application are not limited to any specific hardware and software combination.
[0163] This application provides a storage medium storing executable instructions. When the executable instructions are executed by a processor, the processor will execute the image processing method provided in this application.
[0164] In some embodiments, the storage medium may be a computer-readable storage medium, such as a ferromagnetic random access memory (FRAM), a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic surface memory, optical disc, or a compact disk-read-only memory (CD-ROM); or it may be a device that includes one or any combination of the above-mentioned memories.
[0165] In some embodiments, executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
[0166] As an example, executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts within a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple co-located files (e.g., files storing one or more modules, subroutines, or code sections). As an example, executable instructions may be deployed to execute on a single computing device, or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.
[0167] This application provides a computer program product, including a computer program or computer executable instructions, which, when executed by a processor, implement the image processing method described above.
[0168] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. It should be understood that in the various embodiments of this application, the sequence numbers of the above-described processes do not imply a sequential order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. The sequence numbers of the above-described embodiments are merely descriptive and do not represent the superiority or inferiority of the embodiments.
[0169] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that includes a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components may be combined, or integrated into another system, or some features may be ignored or not performed.
[0170] The above are merely embodiments of this application and are not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.
Claims
1. An image processing method, comprising: Geometric and non-geometric features are extracted from the image to be processed to obtain geometric and non-geometric features respectively. The geometric features and the non-geometric features are fused to obtain a fused feature; Based on at least one of the fused features and the non-geometric features, first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed are determined; the first attribute information and the second attribute information include at least the position and category of the element. Based on the first attribute information and the second attribute information, the semantic information of the image to be processed is determined. The semantic information includes at least geometric information and the relationship between geometric elements and non-geometric elements.
2. The image processing method according to claim 1, wherein determining the first attribute information of geometric elements and the second attribute information of non-geometric elements in the image to be processed based on at least one of the fused features and the non-geometric features includes: The fused features are subjected to semantic segmentation and instance clustering respectively to obtain the first attribute information of the geometric element. The first attribute information includes at least the geometric category probability and the geometric instance to which each geometric pixel belongs in the geometric element, as well as the position of the geometric element. Multi-task detection is performed on the non-geometric features and the fused features to obtain the second attribute information of the non-geometric elements. The second attribute information includes at least the non-geometric category probability of each non-geometric pixel in the non-geometric element and the centrality confidence in its respective non-geometric object, as well as the position and text content of the non-geometric element. The multi-task detection includes at least classification detection, position detection and pixel centrality detection.
3. The image processing method according to claim 2, wherein performing semantic segmentation and instance clustering on the fused features to obtain the first attribute information of the geometric elements includes: Based on a semantic segmentation network, the fused features are semantically segmented to obtain the geometric category probability of each geometric pixel in the geometric element; The semantic segmentation network is trained at least using a weighted binary cross-entropy loss function; Based on a regression network, the geometric elements in the geometric features are located to obtain the position of the geometric elements; Based on the instance segmentation network, instance clustering is performed on the fused features to obtain the geometric instance to which each geometric pixel in the geometric element belongs; The instance segmentation network is trained at least using a discriminative loss function based on a distance metric.
4. The image processing method according to claim 2, wherein performing multi-task detection on the non-geometric features and the fused features to obtain the second attribute information of the non-geometric elements includes: Based on a classification network, the fused features are classified and detected to obtain the non-geometric category probability of each non-geometric pixel in the non-geometric element. The classification network is at least trained using a class imbalance loss function; Based on a regression network, the non-geometric elements in the non-geometric features are detected to obtain the positions of the non-geometric elements. Based on the centrality network, pixel centrality detection is performed on the position of each pixel in the non-geometric elements of the non-geometric features to obtain the centrality confidence of each non-geometric pixel in its non-geometric object. Based on a character recognition network, content recognition is performed on the non-geometric elements in the non-geometric features to obtain the text content of the non-geometric elements.
5. The image processing method according to any one of claims 1 to 4, wherein determining the semantic information of the image to be processed based on the first attribute information and the second attribute information comprises: Based on the principle of spatial proximity, the position of geometric elements in the first attribute information and the position of non-geometric elements in the second attribute information, the geometric elements and non-geometric elements are matched to obtain the matching relationship between the geometric elements and the non-geometric elements; Based on the position of the geometric element in the first attribute information and the geometric category probability and the geometric instance to which each pixel in the geometric element belongs, the geometric relationship of the geometric element is determined. Based on the matching relationship and the geometric relationship, the geometric information and the association relationship are determined.
6. The image processing method according to claim 5, wherein determining the geometric relationship of the geometric element based on the position of the geometric element in the first attribute information and the geometric category probability and the geometric instance to which each pixel in the geometric element belongs includes: Based on the geometric instance to which they belong, pixels belonging to the same instance are identified as the same geometric primitive; The geometric primitives include at least points, edges, and circles; Based on the geometric category probability and the position of the geometric element, the spatial relationship between different geometric primitives is detected to obtain the geometric relationship of the geometric element.
7. The image processing method according to claim 5, wherein determining the geometric information and the association relationship based on the matching relationship and the geometric relationship comprises: The matching relationship and the geometric relationship are fused to obtain the geometric information of the image to be processed; The geometric representation includes at least a geometric figure and its corresponding text symbol; The geometric constraints of the geometric information are verified to obtain the verification results; In response to the verification result indicating that the verification is successful, the association relationship is determined based on the geometric information.
8. The image processing method according to any one of claims 1 to 4, further comprising: Multi-scale feature extraction is performed on the geometric graph to obtain a multi-scale feature map; Correspondingly, the step of extracting geometric features and non-geometric features from the image to be processed, to obtain the geometric features of geometric elements and the non-geometric features of non-geometric elements, includes: Geometric and non-geometric features are extracted from the multi-scale feature map to obtain the geometric features and the non-geometric features.
9. An image processing apparatus, comprising: The feature extraction module is used to extract geometric features and non-geometric features from the image to be processed, respectively, to obtain geometric features and non-geometric features. The fusion module is used to fuse the geometric features and the non-geometric features to obtain fused features; A first determining module is configured to determine, based on at least one of the fused features and the non-geometric features, first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed; the first attribute information and the second attribute information include at least the position and category of the element. The second determining module is used to determine the semantic information of the image to be processed based on the first attribute information and the second attribute information. The semantic information includes at least geometric information and the relationship between geometric elements and non-geometric elements.
10. An electronic device, comprising: Memory, configured to store computer programs that can run on a processor; When a processor is configured to execute the computer program, it performs the following steps: Geometric and non-geometric features are extracted from the image to be processed to obtain geometric and non-geometric features respectively. The geometric features and the non-geometric features are fused to obtain a fused feature; Based on at least one of the fused features and the non-geometric features, first attribute information of geometric elements and second attribute information of non-geometric elements in the image to be processed are determined; the first attribute information and the second attribute information include at least the position and category of the element. Based on the first attribute information and the second attribute information, the semantic information of the image to be processed is determined. The semantic information includes at least geometric information and the relationship between geometric elements and non-geometric elements.