A CAD drawing information extraction method based on multi-modal large model constructs multi-agent

By constructing a multi-agent system using a multimodal large model, and employing a spiral progressive combination network and a multi-head cross-attention mechanism to extract CAD drawing information, the problem of inaccurate cross-modal semantic alignment was solved, achieving precise matching of geometric primitives and text annotations, and improving the accuracy and consistency of CAD drawing information extraction.

CN121765486BActive Publication Date: 2026-06-16BEIJING NANCAL RUIYUAN DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING NANCAL RUIYUAN DIGITAL TECH CO LTD
Filing Date
2026-03-03
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, cross-modal semantic alignment is inaccurate when extracting information from CAD drawings, resulting in a high error rate in matching geometric primitives with text annotations.

Method used

A multimodal large model is used to construct multi-agent systems. A cross-modal semantic alignment model is used for format conversion and standardization. A spiral progressive combination network structure and a multi-head cross-attention mechanism are used for multimodal feature extraction. By combining hierarchical attention processing and minimum cut segmentation optimization, accurate alignment of geometric primitives and text annotations is achieved.

🎯Benefits of technology

It improves the matching accuracy between geometric primitives and text annotations, solves the semantic gap problem existing in traditional technologies, and ensures the accuracy and consistency of CAD drawing information extraction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121765486B_ABST
    Figure CN121765486B_ABST
Patent Text Reader

Abstract

The application provides a CAD drawing information extraction method based on a multi-modal large model for constructing a multi-agent, and belongs to the technical field of multi-modal large models.The CAD drawing is represented by a weighted graph structure, a spiral progressive combination network and a multi-head cross attention cross-modal semantic alignment model are used to extract and align geometric figure visual features and text annotation language features, four-tree space division is combined with dense attention and sparse global token communication mechanism to perform hierarchical attention processing, and an analysis agent and a verification agent are used to perform initial information extraction and consistency checking; when the consistency confidence is lower than a preset threshold, the graph elements are re-divided and information extraction is performed through minimum cut segmentation optimization, and finally complete drawing information is output, thereby solving the technical problem of high error rate of geometric elements and text annotation matching caused by inaccurate cross-modal semantic alignment during CAD drawing information extraction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of multimodal large model technology, specifically, it relates to a method for extracting CAD drawing information based on multimodal large models to construct multi-agent systems. Background Technology

[0002] In current engineering design and manufacturing fields, CAD drawing information extraction is a crucial step in digitizing design data. Traditional technologies employ rule-based template matching methods or unimodal deep learning models to identify geometric primitives and text annotations in drawings. However, these traditional technologies have significant limitations when processing complex engineering drawings. Rule-based methods rely on manually defined feature templates, which cannot adapt to diverse drawing formats and styles. Unimodal models only process visual or textual features, ignoring the semantic relationships between geometric primitives and text annotations. Since textual information such as dimensions, material specifications, and tolerances in CAD drawings must be accurately matched with their corresponding geometric primitives to fully express the design intent, traditional technologies lack effective cross-modal semantic alignment mechanisms, resulting in a semantic gap between geometric primitives and text annotations in the feature representation space. In other words, existing technologies suffer from a high error rate in matching geometric primitives and text annotations due to inaccurate cross-modal semantic alignment during CAD drawing information extraction. Summary of the Invention

[0003] In view of this, the present invention provides a method for extracting CAD drawing information based on a multimodal large model to construct a multi-agent system, which can solve the technical problem in the prior art that the inaccurate cross-modal semantic alignment during CAD drawing information extraction leads to a high error rate in matching geometric primitives with text annotations.

[0004] This invention is implemented as follows: It provides a method for extracting CAD drawing information based on a multimodal large model and multi-agent architecture. The input CAD drawing undergoes format conversion and standardization to construct a weighted graph representation. This is then input into a cross-modal semantic alignment model for multimodal feature extraction, outputting aligned feature vectors. Hierarchical attention processing is applied to these aligned feature vectors to output long-range dependency features. Initial information extraction and consistency checks are performed on the parsing and validation agents. When the consistency confidence level falls below a preset threshold, minimum cut segmentation optimization is performed to re-extract information, resulting in a complete extraction of drawing information. The cross-modal semantic alignment model employs a spiral progressive network structure combined with a multi-head cross-attention mechanism. The cross-layer connections of the spiral path enable the fusion of features at different scales. The multi-head cross-attention mechanism learns the bidirectional correlation mapping between visual and linguistic features from multiple representation subspaces. A contrastive learning framework optimizes the triplet loss of positive and negative sample pairs to establish accurate cross-modal semantic alignment.

[0005] The construction of weighted graph structure representation refers to extracting the spatial adjacency relationships between geometric primitives in the drawing as edges, and calculating the edge weights based on visual similarity and semantic association strength.

[0006] The spiral progressive combined network structure of the cross-modal semantic alignment model includes an initial feature encoding layer, a spiral feature expansion layer, and a global feature fusion layer. The spiral feature expansion layer contains four cascaded spiral expansion units. The output features of the i-th layer in each spiral expansion unit are simultaneously transmitted to the i+1 and i+2 layers.

[0007] Among them, the receptive field coverage of the spiral path is 6.25% of the original drawing area in the first spiral extension unit, 25% in the second spiral extension unit, 56.25% in the third spiral extension unit, and 100% in the fourth spiral extension unit.

[0008] The multi-head cross-attention mechanism includes eight attention heads. Each attention head independently calculates the attention weights of visual features on linguistic features and the attention weights of linguistic features on visual features. Through cross-attention operations, a bidirectional association mapping between geometric primitives and text annotations is established.

[0009] The global feature fusion layer concatenates the output features of four spiral expansion units and eight attention heads, and outputs an aligned feature vector with a dimension of 512 after passing through two fully connected networks and residual connection operations.

[0010] Among them, when deploying the cross-modal semantic alignment model, model quantization technology is used to convert the weight parameters from floating-point 32-bit precision to integer 8-bit precision, and a symmetric quantization method is used to map the weight values. For the activation values, a dynamic quantization method is used to dynamically determine the quantization parameters according to the distribution of activation values ​​of each batch of input data.

[0011] Among them, the operator fusion optimization technology identifies continuous convolution operations, batch normalization operations and activation function operations in the computation graph, and merges continuous operations into a single fusion operator. The fusion operator completes all calculations in a single graphics processor kernel function call.

[0012] In the training of the cross-modal semantic alignment model, the Euclidean distance between the visual features and language features of the positive sample pair is calculated using the triplet loss function and recorded as the positive sample distance. The Euclidean distance between the negative sample pairs is calculated and recorded as the negative sample distance. The result of subtracting the normalized negative sample distance from the normalized positive sample distance, plus the boundary parameter, is taken as the maximum value of the result and 0 as the loss value.

[0013] The quadtree space partitioning of the hierarchical attention processing takes the input drawing space region as the root node, determines whether the number of primitives in the root node exceeds the threshold of 256, and if it does, divides the root node space region into 4 sub-regions along the horizontal and vertical directions to obtain 4 sub-nodes. The partitioning is repeated until the number of primitives in all leaf nodes does not exceed the threshold of 256 or the partitioning depth reaches the maximum depth of 5.

[0014] The calculation of dense attention involves calculating the pairwise attention weights for all primitive nodes within each local window, performing a dot product operation on the query vector of the j-th primitive node and the key vector of the k-th primitive node, dividing by a scaling factor to obtain the attention score, normalizing it using the softmax function to obtain the attention weight, and then weighted summing it with the value vector of the k-th primitive node to obtain the output feature of the j-th primitive node.

[0015] In this mechanism, the sparse global token communication selects the three primitive nodes with the largest feature norm in each local window as global token nodes. All global token nodes perform cross-window attention calculations and broadcast the output features of the global token nodes back to all primitive nodes in their local window.

[0016] The parsing agent calls the fine-tuned multimodal large model and uses parsing prompts to extract information, while the verification agent calls the fine-tuned multimodal large model and uses verification prompts to perform consistency checks. The consistency confidence is calculated as the proportion of the number of primitives that pass the consistency check to the total number of primitives.

[0017] The minimum cut segmentation optimization adopts the Ford-Fulkerson algorithm. In the weighted graph structure representation, source nodes are connected to foreground seeds and sink nodes are connected to background seeds. Foreground seeds are primitive nodes with a consistency confidence higher than 0.95 in the initial recognition results, and background seeds are primitive nodes with a consistency confidence lower than 0.50. The minimum cut boundary is obtained by solving the maximum flow.

[0018] The Ford-Fulkerson algorithm involves initializing all edge flows to 0, using breadth-first search to find augmenting paths from the source to the sink, finding the minimum capacity on the augmenting path and recording it as the bottleneck flow, adding the bottleneck flow to the flow of all edges on the augmenting path and updating the flow of the reverse edges, and repeating the search for augmenting paths until no augmenting path from the source to the sink exists.

[0019] The complete drawing information extraction results include the type label, size parameter, and position coordinates of all geometric elements, the content, font size, and position coordinates of all text annotations, and the association table recording the geometric element number and semantic relationship type corresponding to each text annotation.

[0020] This invention constructs a cross-modal semantic alignment model that combines a spiral progressive network structure with multi-head cross-attention. This model projects the visual features of geometric figures and the linguistic features of text annotations onto a unified semantic representation space. The cross-layer connections of the spiral path allow local detail features at different scales to merge with global structural features during forward propagation. The multi-head cross-attention mechanism learns the bidirectional correlation mapping between visual and linguistic features from multiple representation subspaces. A contrastive learning framework optimizes the triplet loss for positive and negative sample pairs, minimizing the distance between the visual features of the same primitive and its associated text features in the semantic space, while maximizing the feature distance between different primitives. This invention addresses the semantic gap between geometric primitives and text annotations in the feature representation space, a problem inherent in traditional techniques. By establishing a precise semantic association between the visual and linguistic spaces through cross-modal projection and alignment, it eliminates matching errors caused by differences in cross-modal feature representations. In summary, this invention solves the technical problem mentioned in the background art: inaccurate cross-modal semantic alignment during CAD drawing information extraction leads to a high error rate in matching geometric primitives with text annotations. Attached Figure Description

[0021] Figure 1 This is a flowchart of the method of the present invention.

[0022] Figure 2 A visualization of the distribution of aligned feature vectors in two-dimensional space.

[0023] Figure 3 This is a schematic diagram of the dense attention calculation process for a local window.

[0024] Figure 4 A comparison chart of consistency confidence before and after minimum cut partitioning optimization. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below.

[0026] like Figure 1 The diagram shown is a flowchart of a method for extracting CAD drawing information based on a multimodal large model to construct a multi-agent system, provided by this invention. This method includes the following steps:

[0027] S01. Perform format conversion and standardization on the input CAD drawings, take the geometric primitives in the drawings as nodes and extract the spatial adjacency relationship between nodes as edges, calculate the edge weights based on visual similarity and semantic association strength, and construct a weighted graph structure representation.

[0028] S02. Input the weighted graph structure representation into the cross-modal semantic alignment model for multimodal feature extraction. The cross-modal semantic alignment model simultaneously processes the visual features of geometric figures and the linguistic features of text annotations. The features are projected and aligned across modally through a contrastive learning framework, and the aligned feature vector is output.

[0029] S03. Perform hierarchical attention processing on the alignment feature vector. First, divide the drawing into a quadtree space to obtain multiple local windows. Calculate dense attention in each local window. Establish long-range dependencies across windows using a sparse global token communication mechanism and output long-range dependency features.

[0030] S04. Input the long-range dependency features into the parsing agent and the verification agent respectively. The parsing agent is responsible for extracting the initial information and outputting the initial recognition result. The verification agent performs a consistency check on the initial recognition result and calculates the consistency confidence.

[0031] S05. Determine whether the consistency confidence is lower than a preset threshold of 0.85. If the consistency confidence is lower than the preset threshold of 0.85, perform minimum cut segmentation optimization on the weighted graph structure representation. In the weighted graph structure representation, set source nodes to connect foreground seeds and set sink nodes to connect background seeds. Solve the maximum flow using the Ford-Fulkerson algorithm to obtain the minimum cut boundary. Re-divide the primitives according to the minimum cut boundary and re-input them into the parsing agent for information extraction to obtain the optimized recognition result. If the consistency confidence is not lower than the preset threshold of 0.85, use the initial recognition result as the final recognition result.

[0032] S06. Format the optimized recognition result or the final recognition result, extract the type labels, size parameters and position coordinates of all geometric elements, extract the content, font size and position coordinates of all text annotations, establish a relationship table between geometric elements and text annotations, and output the complete drawing information extraction result.

[0033] The cross-modal semantic alignment model employs a spiral progressive combined network structure and a multi-head cross-attention feature fusion computation framework. The spiral progressive combined network structure includes an initial feature encoding layer, a spiral feature expansion layer, and a global feature fusion layer. The initial feature encoding layer contains a visual encoding branch and a language encoding branch. The visual encoding branch performs convolution operations on geometric figures to extract local spatial features, while the language encoding branch performs word embedding operations on text annotations to extract discrete semantic features. The spiral feature expansion layer contains four cascaded spiral expansion units. Each spiral expansion unit uses a spiral path connection method, with the output features of the i-th layer simultaneously passed to the (i+1)-th and (i+2)-th layers. The receptive field coverage of the spiral path is 6.25% of the original drawing area in the first spiral expansion unit, expands to 25% in the second spiral expansion unit, expands to 56.25% in the third spiral expansion unit, and reaches 100% receptive field coverage in the fourth spiral expansion unit, achieving global feature capture. The multi-head cross-attention mechanism comprises eight attention heads, each independently calculating the attention weights of visual features on linguistic features and linguistic features on visual features. A bidirectional association mapping between geometric primitives and text annotations is established through cross-attention operations. The global feature fusion layer concatenates the output features of the four spiral expansion units and the eight attention heads, outputting a 512-dimensional aligned feature vector after passing through two fully connected layers and residual connection operations.

[0034] The cross-modal semantic alignment model employs model quantization and operator fusion optimization techniques during deployment. Model quantization converts the weight parameters of the cross-modal semantic alignment model from 32-bit floating-point precision to 8-bit integer precision using symmetric quantization to map the 32-bit floating-point weight values ​​to the 8-bit integer range. Simultaneously, dynamic quantization is used for activation values, dynamically determining quantization parameters based on the activation value distribution of each batch of input data. The quantized model's storage space is compressed to 25% of the original model, while the computation speed is increased to three times that of the original model, and the quantization accuracy loss is controlled within 0.8%. The operator fusion optimization technique identifies consecutive convolution operations, batch normalization operations, and activation function operations in the computation graph, merging these consecutive operations into a single fusion operator. This fusion operator completes all calculations within a single GPU kernel function call, reducing memory accesses and lowering data transmission latency.

[0035] The steps for establishing the training dataset for the cross-modal semantic alignment model include collecting 10,000 CAD engineering drawings of different types, manually annotating the geometric primitives in each drawing to obtain primitive type labels, manually annotating each text annotation to obtain its associated geometric primitive number, constructing positive sample pairs that pair visual features of the same primitive in the dataset with their associated text features, constructing negative sample pairs that randomly pair visual features of different primitives in the dataset with text features, and performing data augmentation operations on each drawing, including rotation transformation, scaling transformation, and noise addition, to obtain an expanded training dataset of 50,000 drawings.

[0036] The training steps of the cross-modal semantic alignment model include initializing the weight parameters of the visual encoding branch using a pre-trained visual encoder, initializing the weight parameters of the language encoding branch using a pre-trained language encoder, setting the batch size to 32 and the learning rate to 0.0001, calculating the triplet loss function for the positive and negative sample pairs, wherein the triplet loss function is calculated as follows: first, the Euclidean distance between the visual and linguistic features of the positive sample pair is calculated and recorded as the positive sample distance; then, the Euclidean distance between the visual and linguistic features of the negative sample pair is calculated and recorded as the negative sample distance; the positive sample distance is divided by the unit length of the positive sample distance to obtain the normalized positive sample distance; the negative sample distance is divided by the unit length of the negative sample distance to obtain the normalized negative sample distance; the result of subtracting the normalized negative sample distance from the normalized positive sample distance is added to the boundary parameter 0.5 and the maximum value of the result is taken as the loss value; the weight parameters of the cross-modal semantic alignment model are updated through the backpropagation algorithm; the training epochs are 100 until the loss value converges.

[0037] The spiral progressive network structure expands the receptive field layer by layer through multi-level spiral path connections. Shallow networks maintain a small receptive field to capture local details of geometric primitives, including line thickness, angle information, and curvature variations. Deeper networks expand the receptive field to capture global structural features of the drawing, including spatial layout relationships between primitives and the overall topology. The cross-layer connections of the spiral path allow features at different scales to merge during network forward propagation, avoiding the loss of shallow features in deep networks, a problem common in traditional convolutional neural networks. The multi-head cross-attention mechanism learns the association patterns between visual and linguistic features from different representational subspaces through eight independent attention heads. Each attention head focuses on different types of semantic relationships, including the correspondence between dimension annotations and contours, the association between material annotations and primitives, and the matching relationship between tolerance annotations and geometric features. A dynamic weight allocation mechanism adaptively adjusts the contribution weights of different attention heads based on the content of the input drawing. When processing drawings with dense dimension annotations, the weight of the geometric dimension association head is increased; when processing drawings with abundant material annotations, the weight of the material primitive association head is increased. The aforementioned feature fusion computing framework enables the model to accurately establish cross-modal semantic alignment when extracting CAD drawing information, mapping geometric primitives in visual space and text annotations in linguistic space to a unified semantic representation space. This improves the matching accuracy of dimension annotations and corresponding contours. At the same time, it handles large engineering drawings containing hundreds of interrelated primitives through a balanced mechanism of global feature capture and local detail preservation, avoiding the loss of semantic information caused by differences in feature representation spaces. Furthermore, the spiral progressive learning strategy reduces the difficulty of model training and accelerates the convergence process.

[0038] The quadtree spatial partitioning method is as follows: First, the spatial region of the input drawing is recorded as the root node. It is determined whether the number of primitives in the root node exceeds the threshold of 256. When it exceeds the threshold of 256, the spatial region of the root node is divided into 4 sub-regions along the horizontal and vertical directions to obtain 4 child nodes. The judgment and partitioning operation is repeated for each child node until the number of primitives in all leaf nodes does not exceed the threshold of 256 or the partitioning depth reaches the maximum depth of 5. The local window corresponds to all leaf nodes obtained by the quadtree spatial partitioning.

[0039] The dense attention calculation method is as follows: within each local window, calculate the attention weights between all primitive nodes. Perform a dot product operation on the query vector of the j-th primitive node and the key vector of the k-th primitive node, and then divide by the scaling factor to obtain the attention score. Normalize the attention score through the softmax function to obtain the attention weight. Then, perform a weighted summation of the attention weight and the value vector of the k-th primitive node to obtain the output feature of the j-th primitive node.

[0040] The sparse global token communication mechanism selects the three primitive nodes with the largest feature norm in each local window as global token nodes. All global token nodes perform cross-window attention calculations and broadcast the output features of the global token nodes back to all primitive nodes in their local window. Long-range dependencies are established by using the global token nodes as a bridge for information transmission between different local windows.

[0041] The long-range dependency feature is a feature representation that includes dense association information within a local window and global dependency information across windows after the hierarchical attention processing.

[0042] The parsing agent invokes the fine-tuned multimodal large model and uses parsing prompts to extract information. The parsing prompts include instructions to identify the type of all geometric primitives and extract the size parameters of each geometric primitive, identify the content of all text annotations and extract the position coordinates of each text annotation, and establish the association between geometric primitives and text annotations.

[0043] The verification agent invokes the fine-tuned multimodal large model and uses verification prompts to perform consistency checks. The verification prompts include instructions to check whether the size parameters of geometric primitives are consistent with their associated text annotations, whether the topological relationships between primitives conform to engineering drawing specifications, and to calculate the overall consistency confidence of the extraction results.

[0044] The consistency confidence level is calculated by the proportion of the number of elements that pass the consistency check to the total number of elements, and the consistency confidence level ranges from 0 to 1.

[0045] The preset threshold is 0.85. When the consistency confidence is lower than 0.85, it indicates that there are many inconsistent items in the initial identification result and minimum cut segmentation optimization is required.

[0046] The foreground seed is a primitive node in the initial identification result with a consistency confidence score higher than 0.95, and the background seed is a primitive node in the initial identification result with a consistency confidence score lower than 0.50.

[0047] The execution steps of the Ford-Fulkerson algorithm include initializing the flow of all edges to 0, using breadth-first search to find an augmenting path to the sink from the source, finding the minimum capacity value on the augmenting path and recording it as the bottleneck flow, increasing the flow of all edges on the augmenting path by the bottleneck flow and updating the flow of the reverse edges, and repeating the operation of finding augmenting paths and updating flow until there is no augmenting path from the source to the sink. At this point, the boundary between the set of nodes reachable from the source and the set of nodes that are not reachable is the minimum cut boundary.

[0048] The minimum cut boundary is the set of edges that separate the foreground primitives and the background primitives in the weighted graph structure representation. The total edge weight corresponding to the minimum cut boundary is the minimum among all possible segmentation schemes, which ensures the global optimality of the segmentation result.

[0049] The association table records the geometric primitive number corresponding to each text annotation, the numbers of all text annotations corresponding to each geometric primitive, and the semantic relationship type between the text annotation and the geometric primitive.

[0050] The semantic relationship types include dimension annotation relationships, material annotation relationships, tolerance annotation relationships, process annotation relationships, and location annotation relationships.

[0051] The complete drawing information extraction results include the type labels, size parameters, and position coordinates of all geometric elements, the content, font size, and position coordinates of all text annotations, as well as the association table.

[0052] Alternatively, the present invention also provides a computer-based method for forming a CAD drawing information extraction system based on a multimodal large model to construct multiple intelligent agents. The computer is equipped with a storage medium that stores program instructions, which execute the above-described method when the computer is run.

[0053] The specific implementation methods of the above steps are described in detail below.

[0054] The specific implementation of step S01 involves first recognizing and converting the input CAD drawing, uniformly converting DWG, DXF, or PDF format drawings into standard vector graphics representations. During the conversion, the coordinate and attribute information of all geometric elements is preserved. Then, the converted drawing undergoes standardization processing, including coordinate normalization and dimension normalization. Coordinate normalization maps the coordinate values ​​of all geometric elements to the range of 0 to 1, while dimension normalization adjusts the aspect ratio of the drawing to a standard scale. Next, all geometric elements in the drawing are extracted as nodes of the graph structure. These geometric elements include lines, arcs, splines, annotation text, and symbolic graphics. The centroid coordinates and bounding box coordinates of each geometric element are calculated as node attributes. Then, the spatial adjacency relationships between nodes are calculated as edges of the graph structure. The criterion for determining spatial adjacency is that if the shortest distance between the bounding boxes of two geometric elements is less than a threshold of 20mm, they are considered to be adjacent. The shortest distance is obtained by calculating the minimum distance from all vertices of the two bounding boxes to the other bounding box. For each edge, an edge weight is calculated, consisting of two parts: visual similarity and semantic association strength. Visual similarity is calculated by comparing the shape and texture features of two geometric primitives, while semantic association strength is calculated by determining whether two geometric primitives belong to the same part or the same assembly relationship. Visual similarity and semantic association strength are normalized to the range of 0 to 1, respectively, and then weighted and summed using weight coefficients of 0.6 and 0.4 to obtain the final edge weight. This weighted graph structure representation fully describes all geometric primitives and their interrelationships in the CAD drawing, providing structured input data for subsequent multimodal feature extraction.

[0055] The specific implementation of step S02 involves inputting the weighted graph structure representation constructed in step S01 into a cross-modal semantic alignment model for processing. This model comprises two parallel processing paths: a visual encoding branch and a language encoding branch. The visual encoding branch receives the visual information of geometric primitives from the weighted graph structure representation and extracts the local spatial features of each geometric primitive using a convolutional neural network. This convolutional neural network contains five convolutional layers with kernel sizes of 7×7, 5×5, 3×3, 3×3, and 1×1, and channel numbers of 64, 128, 256, 512, and 512, respectively. Each convolutional operation is followed by batch normalization and a modified linear unit activation function. The language encoding branch receives the text annotation information from the weighted graph structure representation, converts each text annotation into a 300-dimensional word vector using a word embedding layer, and then extracts the contextual semantic features of the text using a bidirectional long short-term memory network. This bidirectional long short-term memory network contains two hidden layers, each with 256 neurons. The output features of the visual encoding branch and the language encoding branch are respectively input into a spiral feature expansion layer. This spiral feature expansion layer, through a spiral path connection, allows the output features of layer i to be simultaneously transmitted to layers i+1 and i+2. This spiral path connection enables the fusion of shallow local detail features and deep global structural features at different levels. Then, the visual and language features are input into a multi-head cross-attention mechanism for cross-modal alignment. This multi-head cross-attention mechanism contains eight attention heads. Each attention head independently calculates the attention weight matrix for visual features versus language features and the attention weight matrix for language features versus visual features. The attention weight matrix is ​​calculated using a scaled dot product attention mechanism. The attention weights are obtained by dividing the dot product of the query vector and the key vector by a scaling factor and then normalizing using a softmax function. Finally, the attention weights are weighted and summed with the value vectors to obtain the attention output. The output features of the eight attention heads are concatenated and then input into the global feature fusion layer. This layer comprises two fully connected networks: the first fully connected network has 1024 neurons, and the second fully connected network has 512 neurons. Residual connections are then performed after the fully connected networks to add the input and output features, ultimately outputting an aligned feature vector with a dimension of 512. This aligned feature vector aligns the visual features of geometric primitives with the linguistic features of text annotations in a unified semantic space, providing a cross-modal fusion feature representation for subsequent information extraction.

[0056] The specific implementation of step S03 involves performing hierarchical attention processing on the aligned feature vector output in step S02 to establish long-range dependencies. First, the CAD drawing is partitioned into a quadtree space. This partitioning starts with the complete spatial region of the drawing as the root node. It checks if the number of geometric primitives within the root node exceeds a threshold of 256. If the number exceeds 256, the spatial region of the root node is divided into four equal sub-regions along the horizontal and vertical midlines, each becoming a child node. The same judgment and partitioning operation is recursively performed on each child node. The recursion terminates when the number of geometric primitives within the child node does not exceed the threshold of 256 or the partitioning depth reaches the maximum depth of 5. The quadtree space partitioning divides the drawing into multiple local windows, each corresponding to a leaf node of the quadtree. The number of local windows is adaptively determined based on the distribution density of geometric primitives in the drawing. Then, dense attention is calculated within each local window. The calculation method involves calculating attention weights for every pairwise interaction between all geometric primitive nodes within the local window. Specifically, the alignment feature vector of the j-th primitive node is extracted as the query vector, and the alignment feature vector of the k-th primitive node is extracted as the key and value vectors. The query vector and key vector are multiplied to obtain an attention score. This attention score is then divided by a scaling factor of 8 and normalized using a softmax function to obtain the attention weight. The attention weight and value vector are then weighted and summed to obtain the dense attention output feature of the j-th primitive node. This dense attention effectively captures the spatial and semantic relationships between primitives within the local window. Next, a cross-window sparse global token communication mechanism is established. In each local window, the three primitive nodes with the largest L2 norm of their feature vectors are selected as global token nodes. These global token nodes represent key information within their respective local windows. Cross-window attention calculation is performed between the global token nodes of all local windows. The cross-window attention calculation method is the same as the dense attention calculation method, but attention weights are only calculated between global token nodes, resulting in a significantly lower computational complexity than global dense attention. After computation, the output features of each global token node are broadcast back to all primitive nodes within its local window. A feature addition operation is then used to fuse the output features of the global token nodes with the dense attention output features of other nodes within the local window. This sparse global token communication mechanism uses a small number of global token nodes as information bridges between different local windows, establishing long-range dependencies across regions while maintaining computational efficiency. The final output is a long-range dependency feature containing both local dense association information and global sparse dependency information.

[0057] The specific implementation of step S04 involves inputting the long-range dependency features output in step S03 into the parsing agent and the verification agent for collaborative processing. The parsing agent invokes the fine-tuned multimodal large model, which is based on the LLAVA architecture and undergoes supervised fine-tuning using data from 10,000 annotated CAD drawings. The fine-tuning process employs the cross-entropy loss function to optimize model parameters. The parsing agent inputs the long-range dependency features along with parsing prompts into the fine-tuned multimodal large model. The parsing prompts are structured text instructions, including identifying the type labels of all geometric primitives in the drawings, extracting the dimensional parameters of each geometric primitive (length, width, radius, and angle), extracting the position coordinates of each geometric primitive, identifying the content of all text annotations in the drawings, extracting the position coordinates and font size of each text annotation, and establishing the association between geometric primitives and text annotations. The fine-tuned multimodal large model parses the long-range dependency features according to the instructions of the parsing prompts and outputs an initial recognition result. The initial recognition result is in a structured data format, containing a list of geometric primitives, a list of text annotations, and a list of associations. The verification agent also invokes the fine-tuned multimodal large model, but uses different verification prompts. These prompts are structured text instructions that include checking whether the size parameters of each geometric primitive are consistent with their associated text annotation values, checking whether the topological relationships between primitives conform to engineering drawing standards (including parallel, perpendicular, and coaxial relationships), and calculating the overall consistency confidence score. The verification agent inputs long-range dependency features, initial recognition results, and verification prompts into the fine-tuned multimodal large model. The model checks the initial recognition results item by item, calculating a local consistency score for each geometric primitive. The calculation of the local consistency score considers three factors: the degree of matching between size parameters and text annotations, the correctness of the topological relationships between primitives, and the rationality of primitive attributes. These three factors are normalized to a range of 0 to 1 and then weighted and summed with weight coefficients of 0.5, 0.3, and 0.2 to obtain the local consistency score. Then, the number of geometric primitives with a local consistency score greater than a threshold of 0.7 is counted, and this number is divided by the total number of primitives to obtain the consistency confidence score. The collaborative mechanism between the parsing agent and the verification agent forms a closed-loop review link through two stages: initial extraction and cross-validation, which significantly improves the accuracy and reliability of information extraction.

[0058] The specific implementation of step S05 involves determining whether minimum cut segmentation optimization is needed based on the consistency confidence score calculated in step S04. First, the consistency confidence score is compared with a preset threshold of 0.85. When the consistency confidence score is less than 0.85, it indicates that there are many inconsistencies in the initial recognition result, requiring minimum cut segmentation optimization of the weighted graph representation to improve recognition quality. The first step of minimum cut segmentation optimization is to set a source node and a sink node in the weighted graph representation constructed in step S01. The source node is a virtual node connecting all foreground seed nodes. The foreground seed nodes are primitive nodes in the initial recognition result whose local consistency score is higher than the threshold of 0.95. The edge weight from the source node to the foreground seed node is set to infinity, indicating that the foreground seed node must belong to the foreground region. The sink node is a virtual node connecting all background seed nodes. The background seed nodes are primitive nodes in the initial recognition result whose local consistency score is lower than the threshold of 0.50. The edge weight from the background seed node to the sink node is set to infinity, indicating that the background seed node must belong to the background region. Then, the Ford-Fulkson algorithm is executed to solve for the maximum flow from the source to the sink. The Ford-Fulkson algorithm works as follows: First, the flow of all edges is initialized to 0. Then, starting from the source, a breadth-first search algorithm is used to find an augmenting path to the sink. This augmenting path is a path from the source to the sink where the remaining capacity of all edges is greater than 0. The minimum remaining capacity of all edges on the found augmenting path is recorded as the bottleneck flow. The flow of all edges on the augmenting path is increased by the bottleneck flow, while the bottleneck flow is decreased on the reverse edges to maintain flow conservation. This process of finding augmenting paths and updating flow is repeated until no augmenting path from the source to the sink exists, at which point the maximum flow state is reached. According to the maximum flow minimum cut theorem, the minimum cut corresponding to the maximum flow is the set of edges that separates the set of nodes reachable from the source from the set of nodes unreachable from the source. The boundary of the minimum cut has the minimum total edge weight among all possible partitioning schemes, ensuring the global optimality of the partitioning result. Based on the minimum cut boundary, the primitive nodes are re-divided into foreground and background primitives. The foreground primitives are re-inputted into the parsing agent for refined information extraction, while the background primitives are simplified to obtain an optimized recognition result. When the consistency confidence score is greater than or equal to 0.85, it indicates that the initial recognition result is of good quality, and the initial recognition result is directly output as the final recognition result in step S06. The minimum cut segmentation optimization method automatically identifies and corrects erroneous regions in the initial recognition through global optimization, avoiding the cumulative propagation of local decision errors.

[0059] The specific implementation of step S06 involves formatting the optimized or final recognition result output in step S05 to generate a complete, user-readable drawing information extraction result. First, the list of geometric elements in the recognition result is traversed, and for each geometric element, its type label, dimension parameters, and position coordinates are extracted. The type label includes categories such as line, circle, rectangle, polygon, spline curve, annotation line, and symbol. The dimension parameters, depending on the geometric element type, include attributes such as length, width, radius, start angle, and end angle. The position coordinates are the centroid coordinates or key point coordinates of the geometric element. Then, the list of text annotations in the recognition result is traversed, and for each text annotation, its content, font size, and position coordinates are extracted. The content is the string representation of the text annotation, the font size is the font size in points, and the position coordinates are the coordinates of the lower left corner of the text annotation. Next, a relationship table is constructed between geometric primitives and text annotations. This table is a two-dimensional table structure, where rows correspond to geometric primitives, columns correspond to text annotations, and cells record the semantic relationship type between the geometric primitives and text annotations. These semantic relationship types include dimensional annotation, material annotation, tolerance annotation, process annotation, and positional annotation. For each pairing of geometric primitives and text annotations, the semantic relationship type is determined by analyzing the content of the text annotation and the attributes of the geometric primitive. The determination rules are as follows: if the text annotation content contains numerical values ​​and unit symbols, it is determined to be a dimensional annotation relationship; if the text annotation content contains material names or material numbers, it is determined to be a material annotation relationship; if the text annotation content contains tolerance symbols, it is determined to be a tolerance annotation relationship; if the text annotation content contains process terms, it is determined to be a process annotation relationship; and if the text annotation content contains positional descriptive words, it is determined to be a positional annotation relationship. Finally, all extracted geometric primitive information, text annotation information, and relationship tables are integrated into a unified data structure, and the complete drawing information extraction results are output in JSON or XML format. The complete drawing information extraction results fully retain all the key information in the CAD drawings, providing structured data support for downstream applications such as engineering analysis, manufacturing planning, and quality inspection.

[0060] It should be noted that one of the key technical ideas of this invention is to construct a cross-modal semantic alignment model using a spiral progressive network structure and a multi-head cross-attention feature fusion computing framework. The receptive field is expanded layer by layer through spiral path connections. A smaller receptive field is maintained in shallow networks to capture local details of geometric primitives, while the receptive field is expanded in deep networks to capture global structural features of the drawing. The cross-layer connections of the spiral path allow features at different scales to fuse during the forward propagation of the network, avoiding the problem of shallow features being lost in deep networks, as is common in traditional convolutional neural networks. The multi-head cross-attention mechanism learns the association patterns between visual and linguistic features from different representational subspaces through multiple independent attention heads. A dynamic weight allocation mechanism adaptively adjusts the contribution weights of different attention heads according to the content of the input drawing, achieving accurate alignment of geometric primitives and text annotations in a unified semantic space, significantly reducing the loss of semantic information caused by differences in feature representation spaces.

[0061] The second key technical approach involves introducing a segmentation optimization mechanism based on the minimum cut algorithm. CAD drawings are modeled as weighted graph structures, and the minimum cut boundary is obtained by solving the maximum flow using the Fulkerson algorithm, achieving global optimization of the initial recognition results. The minimum cut algorithm guarantees the global optimality of the segmentation boundary and can automatically handle drawing layouts with complex topologies. Compared to traditional threshold segmentation and region growing methods, the minimum cut algorithm avoids the cumulative propagation of local decision errors through global optimization, exhibiting stronger robustness when processing engineering drawings containing multi-connected components and nested structures, and effectively improving the recognition accuracy of low-confidence regions.

[0062] The third key technical approach involves constructing a multi-agent collaborative review mechanism based on a parsing agent and a verification agent. The parsing agent is responsible for initial information extraction, while the verification agent performs consistency checks on the extraction results and calculates confidence levels. A closed-loop review chain is formed through message passing and iterative optimization between the two agents. This multi-agent collaborative mechanism overcomes the limitations of traditional single-path processing flows. By introducing dynamic cross-validation and self-correction capabilities, it detects and corrects errors in real time during information extraction, preventing errors from propagating from upstream to downstream stages. Compared to traditional methods that rely on static rule bases, the multi-agent collaborative mechanism has stronger autonomous reasoning capabilities and adaptability, and can handle new processes and non-standard annotation methods not covered by the rule base.

[0063] The synergistic effect of the three key technical approaches mentioned above provides an end-to-end optimization solution for CAD drawing information extraction. The spiral progressive combined with the network structure solves the semantic gap problem of multimodal feature alignment, providing high-quality cross-modal fusion features for subsequent information extraction. The minimum cut segmentation optimization mechanism corrects erroneous regions in the initial identification at the global level, and the multi-agent collaborative review mechanism establishes a dynamic quality control system at the process level. The three work together to form a complete technical chain from feature extraction and global optimization to quality verification. Compared with traditional linear processing flow and static rule matching methods, this invention significantly improves the accuracy and robustness in processing complex engineering drawings through multi-level optimization and verification mechanisms. At the same time, model quantization and operator fusion technology ensure the computational efficiency of real-time inference, providing a feasible technical path for the automated processing of large-scale CAD drawings in industrial applications.

[0064] It should be noted that this invention also solves the following technical problem: the high computational complexity leading to low information extraction efficiency when processing large engineering drawings containing a large number of interrelated primitives. This invention recursively decomposes the drawing space into multiple local windows using a quadtree spatial partitioning method. Each local window contains no more than 256 primitives. Intensive attention calculation within each local window reduces the computational complexity from quadratic globally to linear locally. A sparse global token communication mechanism selects the three primitive nodes with the largest feature norm in each local window as global token nodes. Long-range dependencies are established only through cross-window attention calculations between global token nodes, avoiding the high cost of global attention calculations for all primitives. Simultaneously, model quantization technology converts the weight parameters from 32-bit floating-point precision to 8-bit integer precision, compressing storage space to 25% of the original model while increasing computational speed by three times. Operator fusion optimization technology merges consecutive convolution operations, batch normalization operations, and activation function operations into a single fusion operator, reducing memory access frequency. Therefore, while maintaining semantic alignment accuracy, this significantly reduces computational complexity and improves information extraction efficiency.

[0065] Specifically, the principle of this invention is as follows: This invention can solve the technical problem of inaccurate cross-modal semantic alignment because it adopts a spiral progressive network structure to achieve layer-by-layer expansion of the receptive field and multi-scale feature fusion. In the shallow network, a small receptive field is maintained to capture local details of geometric primitives, such as line thickness and angle information. In the deep network, the receptive field is expanded to capture the global structural features of the drawing, such as the spatial layout relationships between primitives. The cross-layer connection of the spiral path avoids the loss of shallow features in the deep network. The multi-head cross-attention mechanism learns the association patterns between visual and linguistic features from different representation subspaces through eight independent attention heads. Each attention head focuses on different types of semantic relationships, including the correspondence between dimension labels and contours, and the association between material labels and primitives. The dynamic weight allocation mechanism adaptively adjusts the contribution weights of different attention heads according to the content of the input drawing. The contrastive learning framework optimizes the feature space distribution through a triplet loss function, minimizing the Euclidean distance between visual and textual features of the same primitive and maximizing the feature distance between different primitives. This establishes a precise bidirectional association mapping between the visual and linguistic spaces, fundamentally eliminating cross-modal feature representation differences.

[0066] The following provides a specific embodiment 1 of the present invention. The specific implementation methods of steps S02, S04 and S06 in this embodiment 1 are the same as those described above, and will not be repeated in detail here. The specific implementation methods of other steps are described in detail below.

[0067] The specific implementation of step S01 involves converting and standardizing the format of the input CAD drawing, using the geometric primitives in the drawing as nodes, extracting the spatial adjacency relationships between nodes as edges, and assigning edge weights. The calculation formula is expressed as follows:

[0068] ;

[0069] In the formula, For the first The primitive node and the first The edge weights between primitive nodes are dimensionless. This is the node number for the graphic element, with a value ranging from 1 to the total number of graphic elements in the drawing. This is the node number for the graphic element, with a value ranging from 1 to the total number of graphic elements in the drawing. For the first The primitive node and the first Visual similarity between primitive nodes, dimensionless, with a value range from 0 to 1; This is a reference value for visual similarity normalization; it is dimensionless and takes a value of 1. For the first The primitive node and the first The semantic association strength between primitive nodes is dimensionless and ranges from 0 to 1. This is a reference value for the normalization of semantic association strength, which is dimensionless and takes a value of 1. This is the visual similarity weight coefficient, dimensionless, with a default value of 0.6; This is the semantic association strength weight coefficient, dimensionless, with a default value of 0.4. Among them, visual similarity The semantic association strength is obtained by calculating the cosine similarity between the shape feature vectors of two geometric primitives. It is obtained by calculating the reciprocal of the Euclidean distance between the word embedding vectors of the corresponding text annotations of two primitives.

[0070] In a specific implementation of step S03, hierarchical attention processing includes dense attention computation within local windows and sparse global token communication across windows. The dense attention computation is performed by calculating the first [value] within each local window. Output features of primitive nodes The calculation formula is expressed as follows:

[0071] ;

[0072] In the formula, For the first The output feature vector of each primitive node has a dimension of 512; This is the node number of the element within the current local window, with a value ranging from 1 to... ; This represents the total number of element nodes within the current local window; For the first The query vector for each primitive node has a dimension of 64; For the first The key vector of each primitive node has a dimension of 64; This is the node number of the element within the current local window, with a value ranging from 1 to... ; For the first The value vector of each primitive node has a dimension of 512; For the summation index, the value range is 1 to... ; The dimension of the key vector, used for scaling factor calculation, defaults to 64; This is the scaling factor; The function is the natural exponential function. All leaf nodes obtained from the quadtree spatial partitioning correspond to local windows, and the number of primitives in each local window does not exceed 256. The sparse global token nodes are selected based on the feature norm. The calculation formula is expressed as follows:

[0073] ;

[0074] In the formula, For the first The eigenvectors of each primitive node are L2 norms and are dimensionless. For the first The output feature vector of each primitive node after dense attention computation ; For feature vectors The Dimensionless; This is the dimension index of the feature vector, with values ​​ranging from 1 to... ; This represents the total dimension of the feature vectors, which defaults to 512. Select within each local window. The three largest primitive nodes are used as global token nodes.

[0075] In the specific implementation of step S05, the consistency confidence level The calculation formula is expressed as follows:

[0076] ;

[0077] In the formula, The consistency confidence level is dimensionless and ranges from 0 to 1. The number of elements that passed the consistency check; This represents the total number of graphic elements. When... When performing minimum cut optimization, the edge capacity in the minimum cut algorithm is considered. The calculation formula is expressed as follows:

[0078] ;

[0079] In the formula, For the first The primitive node and the first The capacity of the edges between primitive nodes, dimensionless; The edge weights are dimensionless. For the first The consistency confidence feature value of each primitive node, which is dimensionless and ranges from 0 to 1; The average consistency confidence feature value of the foreground seed nodes is dimensionless and ranges from 0 to 1. The average consistency confidence feature value of the background seed nodes is dimensionless and ranges from 0 to 1. To prevent division by zero, the small constants are dimensionless and are usually taken as 1. The flow initialization formula for the Ford-Fulkerson algorithm is expressed as follows:

[0080] ;

[0081] In the formula, For the first From the first primitive node to the first The initial flow of each primitive node, dimensionless. Augmenting path. Bottleneck traffic The calculation formula is expressed as follows:

[0082] ;

[0083] In the formula, The bottleneck flow rate is dimensionless. For an augmented path from the source to the sink; For path The edge above represents the first edge. From the first primitive node to the first Connection of each element node; Let be the capacity of the edge, which is dimensionless; For the first The edge flow at the next iteration is dimensionless. This is the iteration count index, with values ​​ranging from 0 to the total number of iterations when the algorithm converges; This is a minimum value function. The flow update formula is expressed as follows:

[0084] ;

[0085] In the formula, For the first The flow of edges in the next iteration is dimensionless. The Ford-Fulkerson algorithm finds augmenting paths and updates the flow iteratively, eventually obtaining the minimum cut boundary.

[0086] Receptive field coverage of the spiral feature extension layer in cross-modal semantic alignment models The calculation formula is expressed as follows:

[0087] ;

[0088] In the formula, For the first Percentage of receptive field coverage of each spiral extended unit; This is the number of the spiral extension unit, with a value ranging from 1 to 4; For the first The area of ​​the drawing covered by the receptive field of each spiral extended unit, in units of ; This represents the total area of ​​the original drawings, in units of... The first spiral extension unit The second spiral extension unit The third spiral extension unit The 4th spiral extension unit .

[0089] The quantization mapping formula for weight parameters in model quantization techniques is expressed as follows:

[0090] ;

[0091] In the formula, The quantized integer 8-bit weight value is dimensionless and ranges from -128 to 127. This is the original 32-bit floating-point weight value, dimensionless; This is the zero-point offset, dimensionless, and defaults to 0 when used for symmetric quantization. This is a quantization scaling factor, dimensionless. This is a rounding function. Quantization scaling factor. The calculation formula is expressed as follows:

[0092] ;

[0093] In the formula, This is a quantization scaling factor, dimensionless. 127 represents the maximum absolute value of the original floating-point weights, dimensionless; 127 represents the maximum positive value of the 8-bit integer quantization. After quantization, the model's storage space is compressed to 25% of the original model, the computation speed is increased to 3 times that of the original model, and the quantization accuracy loss is controlled within 0.8%.

[0094] Among them, the triplet loss function in the training of cross-modal semantic alignment model The calculation formula is expressed as follows:

[0095] ;

[0096] In the formula, The triplet loss value is dimensionless. The Euclidean distance in the feature space between the visual and linguistic features of a positive sample pair is dimensionless. The Euclidean distance in the feature space between the visual and linguistic features of a negative sample pair is dimensionless. The L2 norm of the positive sample distance is used as a unit length normalization factor, which is dimensionless; The L2 norm of the negative sample distance is used as a unit length normalization factor, which is dimensionless; This is a boundary parameter, dimensionless, with a value of 0.5; This is the maximum value function. The model weight parameters are updated using the backpropagation algorithm, and the training iterations are 100 epochs until the loss value converges.

[0097] To better understand and implement this invention, the following is a specific application scenario of this invention, Example 2:

[0098] To verify the practical application effect of the present invention, technicians set up a test environment and evaluated the technical performance of the present invention by processing a dataset of engineering drawings accumulated in a mechanical manufacturing field. The test dataset contains 150 CAD engineering drawings of varying complexity, covering four types: shaft parts drawings, box-type parts drawings, gear parts drawings, and assembly drawings. The drawing formats include DWG and DXF, the drawing sizes range from A3 to A0, the number of geometric primitives in a single drawing ranges from 80 to 650, and the number of text annotations ranges from 40 to 380.

[0099] Technicians first preprocessed the drawings in the test dataset, converting all DWG and DXF format drawings into vector graphics representations. Geometric primitives, including lines, arcs, spline curves, and text annotations, were extracted from each drawing. Statistical results showed that the 150 drawings contained a total of 48,560 geometric primitives and 23,720 text annotations. A weighted graph representation was constructed for each drawing, calculating the spatial adjacency relationships between nodes. An adjacency threshold of 20mm was set, generating a total of 126,840 edges. Edge weights were calculated using a weighted combination of visual similarity and semantic association strength, with weight coefficients of 0.6 and 0.4, respectively.

[0100] Technicians input the constructed weighted graph structure representation into a cross-modal semantic alignment model for processing. This model employs a spiral progressive network structure and a multi-head cross-attention feature fusion computation framework. The visual encoding branch contains 5 convolutional layers, the language encoding branch contains 2 bidirectional long short-term memory layers, and the multi-head cross-attention mechanism includes 8 attention heads. The model outputs an aligned feature vector with a dimension of 512, as shown below. Figure 2 As shown, the distribution of the alignment feature vectors in two-dimensional space exhibits obvious clustering characteristics, with different types of geometric primitives forming clear boundaries in the feature space. Technicians performed hierarchical attention processing on the alignment feature vectors, using a quadtree spatial partitioning method to divide each drawing into multiple local windows. The partitioning threshold was set to 256 primitives, and the maximum partitioning depth was set to 5 layers. Statistical results show that 150 drawings were divided into an average of 12.6 local windows, with a minimum of 4 windows and a maximum of 28 windows.

[0101] like Figure 3As shown, dense attention is computed within each local window, with a scaling factor set to 8. A sparse global token communication mechanism is used across windows. The three primitive nodes with the largest feature norms in each local window are selected as global token nodes. Statistical results show that 5670 global token nodes are generated across 150 drawings, accounting for 11.7% of the total number of primitives. Technicians input long-range dependency features into the parsing agent and the verification agent, respectively. The parsing agent uses parsing prompts to extract initial information, while the verification agent uses verification prompts to perform consistency checks. The verification agent calculates a local consistency score for each geometric primitive, and the local consistency scores of all geometric primitives are statistically analyzed, as shown in Table 1.

[0102] Table 1. Statistical table of local consistency scores

[0103]

[0104] Technicians determined whether minimum cut segmentation optimization was needed based on the consistency confidence score of the verified agent's computation. A preset threshold of 0.85 was set. Statistical results showed that 32 out of 150 drawings had a consistency confidence score below 0.85, requiring minimum cut segmentation optimization. Source and sink nodes were set in a weighted graph structure representation for these 32 drawings. Foreground seed nodes were selected based on a local consistency score above 0.95, while background seed nodes were selected based on a local consistency score below 0.50. Statistical results showed that the 32 drawings contained an average of 186 foreground seed nodes and 15 background seed nodes. Technicians executed the Fort-Fulksen algorithm to solve for the maximum flow. The algorithm averaged 23 iterations, and the resulting minimum cut boundary contained an average of 42 edges. After re-partitioning the primitives based on the minimum cut boundary, refined information extraction was performed on the foreground primitives, such as... Figure 4 As shown, the optimized recognition results show that the consistency confidence level increased from an average of 0.78 before optimization to an average of 0.91 after optimization.

[0105] Technicians formatted all the recognition results, extracted the type label, size parameters and position coordinates of each geometric element, extracted the content, font size and position coordinates of each text annotation, and constructed a table of association between geometric elements and text annotations. The statistical results show that a total of 23,150 associations were established for 150 drawings, and the distribution of association types is shown in Table 2.

[0106] Table 2. Statistical Table of Distribution of Association Types

[0107]

[0108] Technicians performed model quantization and operator fusion optimization on the cross-modal semantic alignment model, converting the 32-bit floating-point weights to an 8-bit integer format. The quantized model's storage space was compressed from 1.2GB to 0.3GB, and the average processing time per drawing decreased from 8.6s before quantization to 2.8s after quantization. Technicians also analyzed the processing performance of different drawing types: the average processing time for shaft-type parts was 1.9s, for box-type parts 2.5s, for gear-type parts 3.1s, and for assembly drawings 4.2s. Processing time was positively correlated with drawing complexity.

[0109] Technicians analyzed the collaborative effect of the parsing agent and the verification agent. Statistical results showed that, after review by the verification agent, the accuracy rate for matching dimension annotations with geometric primitives was 96.8%, the accuracy rate for identifying topological relationships between primitives was 94.5%, and the accuracy rate for extracting text annotations was 98.2%. For 32 drawings requiring minimum cut segmentation optimization, the optimized matching accuracy increased to 98.1%, and the topological relationship identification accuracy increased to 97.3%, indicating that minimum cut segmentation optimization effectively improved the recognition quality of low-confidence regions. Technicians further analyzed the processing effect on drawings of different complexities. The average consistency confidence score for simple drawings containing 80 to 200 primitives was 0.93, the average consistency confidence score for medium-complexity drawings containing 200 to 400 primitives was 0.88, and the average consistency confidence score for high-complexity drawings containing 400 to 650 primitives was 0.82, indicating that the present invention has good adaptability to drawings of different complexities.

[0110] Technicians compared the performance of the method described in this invention with that of traditional methods when processing the same test dataset. Traditional methods, employing a single-path processing flow and a static rule base for information extraction, exhibited significant limitations when handling drawings containing non-standard annotations and complex topological relationships. This invention achieves effective fusion of multi-scale features through a spiral-progressive network structure, capturing local detail features in shallow networks and global structural features in deep networks. This avoids the problem of shallow features being lost in deep networks, as seen in traditional convolutional neural networks, significantly improving the cross-modal semantic alignment accuracy between geometric primitives and text annotations. The multi-head cross-attention mechanism adaptively adjusts the contribution weights of different attention heads through dynamic weight allocation, flexibly handling different types of semantic relationships based on the content characteristics of the input drawing. Compared to traditional fixed-rule matching methods, it has stronger generalization ability and adaptability. The minimum cut segmentation optimization mechanism automatically identifies and corrects erroneous regions in the initial identification through global optimization, ensuring the global optimality of the segmentation boundary and avoiding the cumulative propagation problem of local decision errors in traditional threshold segmentation and region growing methods. The multi-agent collaborative review mechanism, consisting of a parsing agent and a verification agent, achieves dynamic cross-validation and self-correction through a closed-loop quality control process. Compared to traditional single-path processing, it exhibits stronger robustness and reliability, demonstrating autonomous reasoning and decision-making capabilities when handling new processes and non-standard annotations not covered by the rule base. Model quantization and operator fusion optimization techniques significantly reduce model storage space and computation time, enabling the method of this invention to maintain high accuracy while possessing real-time processing capabilities. This provides a feasible technical path for the automated processing of large-scale CAD drawings in industrial applications.

[0111] It should be noted that the variables involved in this invention are explained in detail in Table 3.

[0112] Table 3. Variable Explanation Table

[0113]

[0114] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for extracting CAD drawing information based on a multimodal large model to construct a multi-agent system, characterized in that, include: The input CAD drawings are format-converted and standardized to construct a weighted graph structure representation; The weighted graph structure representation is input into the cross-modal semantic alignment model for multimodal feature extraction. The cross-modal semantic alignment model simultaneously processes the visual features of geometric figures and the linguistic features of text annotations. The features are projected and aligned across modally through a contrastive learning framework, and the aligned feature vector is output. The alignment feature vector is subjected to hierarchical attention processing. First, the drawing is divided into multiple local windows by quadtree space. Dense attention is calculated in each local window. A sparse global token communication mechanism is used to establish long-range dependencies across windows and output long-range dependency features. The long-range dependency features are input into the parsing agent and the verification agent, respectively. The parsing agent is responsible for initial information extraction and outputting the initial recognition result. The verification agent performs a consistency check on the initial recognition result and calculates the consistency confidence. When the consistency confidence is lower than a preset threshold, the weighted graph structure representation is optimized by minimum cut segmentation. In the weighted graph structure representation, source nodes are set to connect foreground seeds and sink nodes are set to connect background seeds. The minimum cut boundary is obtained by solving the maximum flow using the Ford-Fulkerson algorithm. The primitives are re-divided according to the minimum cut boundary and re-input into the parsing agent for information extraction to obtain the optimized recognition result. When the consistency confidence is not lower than the preset threshold, the initial recognition result is used as the final recognition result. The optimized recognition result or the final recognition result is formatted, the type label, size parameter and position coordinate of all geometric elements are extracted, the content, font size and position coordinate of all text annotations are extracted, a relationship table between geometric elements and text annotations is established, and the complete drawing information extraction result is output. The cross-modal semantic alignment model employs a spiral progressive combined network structure and a multi-head cross-attention feature fusion computation framework. The spiral progressive combined network structure includes an initial feature encoding layer, a spiral feature expansion layer, and a global feature fusion layer. The initial feature encoding layer contains a visual encoding branch and a language encoding branch. The visual encoding branch performs convolution operations on geometric figures to extract local spatial features, while the language encoding branch performs word embedding operations on text annotations to extract discrete semantic features. The multi-head cross-attention mechanism includes eight attention heads, each independently calculating the attention weights of visual features on language features and the attention weights of language features on visual features. Through cross-attention operations, a bidirectional association mapping between geometric primitives and text annotations is established. The spiral feature expansion layer contains four cascaded spiral expansion units. The output features of the i-th layer in each spiral expansion unit are simultaneously transmitted to the i+1 and i+2 layers.

2. The method according to claim 1, characterized in that, The construction of weighted graph structure representation refers to extracting the spatial adjacency relationships between geometric primitives in the drawing as edges, and calculating the edge weights based on visual similarity and semantic association strength.

3. The method according to claim 2, characterized in that, The receptive field coverage of the spiral path is 6.25% of the original drawing area in the first spiral extension unit, 25% in the second spiral extension unit, 56.25% in the third spiral extension unit, and reaches 100% in the fourth spiral extension unit.

4. The method according to claim 3, characterized in that, The global feature fusion layer concatenates the output features of 4 spiral expansion units and 8 attention heads, and outputs an aligned feature vector with a dimension of 512 after passing through two fully connected network layers and residual connection operations.

5. The method according to claim 4, characterized in that, When deploying the cross-modal semantic alignment model, model quantization technology is used to convert the weight parameters from floating-point 32-bit precision to integer 8-bit precision. Symmetric quantization is used to map the weight values, and dynamic quantization is used for the activation values ​​to dynamically determine the quantization parameters based on the distribution of activation values ​​in each batch of input data.

6. The method according to claim 5, characterized in that, The cross-modal semantic alignment model employs operator fusion optimization technology during deployment. This technology identifies continuous convolution operations, batch normalization operations, and activation function operations in the computation graph, merging these continuous operations into a single fusion operator. The fusion operator completes all computations within a single GPU kernel function call.

7. The method according to claim 6, characterized in that, The training of the cross-modal semantic alignment model uses a triplet loss function to calculate the Euclidean distance between visual and linguistic features of positive sample pairs, which is recorded as the positive sample distance. The Euclidean distance between negative sample pairs is calculated and recorded as the negative sample distance. The result of subtracting the normalized negative sample distance from the normalized positive sample distance, plus the boundary parameter, is taken as the maximum value of the result and zero, which is used as the loss value.

8. The method according to claim 7, characterized in that, The quadtree space partitioning for hierarchical attention processing takes the input drawing space region as the root node, and determines whether the number of primitives in the root node exceeds the threshold of 256. If it does, the root node space region is divided into 4 sub-regions along the horizontal and vertical directions to obtain 4 child nodes. The partitioning is repeated until the number of primitives in all leaf nodes does not exceed the threshold of 256 or the partitioning depth reaches the maximum depth of 5.