A system block diagram automatic analysis and task question and answer method and system based on multi-source fusion

By employing a multi-source fusion-based system block diagram automatic parsing and task question answering method, combined with target detection, image preprocessing, and visual large model fine-tuning, the problem of accurate identification of components and connections and restoration of topological relationships in complex engineering images is solved, improving the accuracy and reliability of the parsing results and meeting the engineering requirements of chip design.

CN121835929BActive Publication Date: 2026-06-16NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2026-03-13
Publication Date
2026-06-16

Smart Images

  • Figure CN121835929B_ABST
    Figure CN121835929B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of electronic design automation, and discloses a system block diagram automatic analysis and task question and answer method and system based on multi-source fusion, which carries out pretreatment on a system block diagram, uses a target detection model to recognize components of the pretreated image, adopts a multi-algorithm fusion strategy to extract line segments, constructs a physical topology, adopts a four-level strategy including loop closure and local consistency to determine a signal flow direction, creates a circuit knowledge question and answer data set, adopts a secondary fine-tuning method from overall reasoning to local reinforcement to fine-tune Qwen2.5-VL-3B, analyzes question and answer texts generated by a large model, and performs structured checking and path correction on a question and answer result based on connectivity of the physical topology. The application effectively solves the problems of text interference, line segment breakage and illusion connection of a large model in a block diagram, significantly improves the accuracy of complex circuit diagram analysis and the explainability of content through closed-loop fusion of visual hard constraints and semantic soft reasoning.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of Electronic Design Automation (EDA) technology, specifically relating to a method and system for automatic parsing of system block diagrams and task question answering based on multi-source fusion. Background Technology

[0002] In modern EDA workflows, system block diagrams serve as the core carrier describing chip module structure, signal flow, timing control, and logical connections. Their accuracy and completeness directly determine the efficiency and reliability of design intent understanding, design document review, functional verification, and simulation modeling. As chip system integration and complexity continue to increase, system block diagrams generally exhibit characteristics such as diverse component types, dense and intersecting connections, rich text annotations, and complex hierarchical structures. Furthermore, in real-world engineering scenarios, system block diagrams come from various sources, including hand-drawn sketches, scanned copies, drawings exported from different tools, and non-standardized drafting. The quality of these drawings varies significantly, exhibiting issues such as blurred lines, broken or adhered lines, text and graphics obscuring each other, and small, non-standard arrow shapes, further complicating automatic parsing.

[0003] Traditional system block diagram analysis schemes often rely on single image processing algorithms or object detection models. These methods lack robustness when dealing with complex connections, weak arrow features, text occlusion, and low-quality images, making it difficult to accurately identify module ports and reconstruct the true topological connections between modules. This leads to problems such as misidentification of connections, omission of ports, and incorrect signal flow identification. In large-scale, multi-level, multi-branched, and intertwined complex system block diagrams, existing methods are more significantly affected by image noise, port offsets, and sparse or blurred arrows, commonly resulting in lost connections, incomplete topological reconstruction, and logical errors.

[0004] Currently, some studies attempt to directly use multimodal large models to understand and answer questions about system block diagrams. However, due to the lack of structured analytical constraints, topological relationship verification mechanisms, and engineering verification processes for chip design, the models are prone to generating topological relationship "illusions" that are inconsistent with the actual drawings. This results in insufficient reliability of the question-and-answer results, which cannot meet the stringent requirements of chip design scenarios for analytical accuracy, result stability, and engineering credibility. Consequently, it is difficult to support the automatic analysis and intelligent question-and-answer tasks of complex system block diagrams. Summary of the Invention

[0005] To address the aforementioned technical problems, this invention provides a method and system for automatic parsing of system block diagrams and task question answering based on multi-source fusion. This solves the problem that existing single methods struggle to accurately identify components and connections, restore the true topology, and provide highly reliable task answers in complex and diverse engineering image environments. Simultaneously, it reduces the overhead of manual annotation and correction, and improves the verifiability and engineering applicability of the parsing results.

[0006] The present invention discloses a method for automatic parsing of system block diagrams and task question answering based on multi-source fusion, comprising the following steps:

[0007] Step 1: Obtain the original image of the system block diagram, preprocess the original image to obtain a denoised and enhanced structured image;

[0008] Step 2: Use a preset target detection model to identify components in the structured image to obtain the set of component categories and the corresponding component bounding box coordinates in the system block diagram;

[0009] Step 3: Perform line segment detection on the structured image based on a multi-algorithm fusion strategy to extract the set of line segments in the image;

[0010] Step 4: Based on the components, component input ports, output ports, and the detected set of line segments, construct a three-level relationship between components, ports, and line segments to generate the topological connection relationship of the system block diagram;

[0011] Step 5: Construct a self-annotated question-answering pair dataset, and use a secondary fine-tuning method from overall reasoning to local reinforcement to fine-tune the Qwen2.5-VL-3B visual large model to obtain a task question-answering large model adapted to the system block diagram scenario;

[0012] Step 6: Input the constructed topology and system block diagram image into the task question-and-answer model to generate system block diagram analysis results and corresponding question-and-answer answers.

[0013] Furthermore, step 1 specifically includes:

[0014] Optical Character Recognition (OCR) technology is used to extract the coordinates of all text regions in the image, and a binary mask with the same size as the original image is generated. The pixel values ​​of the covered area of ​​the binary mask are set to the background color to eliminate the interference of text characters on the line extraction.

[0015] Calculate the noise variance of a local region of the image. If the noise variance is lower than a preset threshold, Gaussian filtering is used for denoising; otherwise, median filtering is used for denoising.

[0016] A contrast-limited adaptive histogram equalization algorithm is used to enhance the edge contrast of lines in the image, and adaptive binarization is performed to obtain a structured image.

[0017] Furthermore, step 2 specifically involves:

[0018] Step 2-1: Construct a hybrid-driven data stream. The data stream is based on the system block diagram with real annotations, anchors real image features, and combines synthetic data generated in batches by scripts to expand sample diversity and cover edge scenes, thereby improving the model's generalization ability.

[0019] Step 2-2: The hybrid-driven data stream is input into the target detection model, and the component category set and the bounding box coordinates of each component are obtained through model inference.

[0020] Furthermore, step 3 specifically involves:

[0021] Step 3-1: Use probabilistic Hough transform to extract global long straight lines in the image, set an adaptive threshold of 20% of the minimum component size, and capture the main links;

[0022] Step 3-2: For the local small line segments missed by the Hough transform, the Line Segment Detector (LSD) algorithm is used to supplement the detection and form a full set of candidate line segments;

[0023] Step 3-3: The Density-based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is used to cluster and merge line segments in the candidate set of line segments based on an angle threshold of ±2° and a distance tolerance threshold of 5 pixels, thereby eliminating line segment overlap and breakage issues and restoring complete connections.

[0024] Step 3-4: Extend each end of each line segment obtained in Step 3-3 by 2.5% of the corresponding line segment length to remove interference lines inside the component frame and interference lines that are approximately parallel to the component frame, thereby obtaining a complete and noise-free set of line segments.

[0025] Furthermore, step 4 specifically involves:

[0026] On the bounding box of each component, the endpoints of the line segments are searched according to the principle of intersection between the bounding box and the line segments, and then the positions of the input and output ports of each component are inferred. Based on the correspondence between the ports and the line segments, an undirected physical topology is constructed.

[0027] A four-level directional reasoning strategy is used to determine the signal flow direction of each connection path, specifically:

[0028] The first level prioritizes searching for and closing feedback loops that can form strongly connected components, and determines the signal direction of such loops;

[0029] The second level calculates the distance between the midpoint of each connection path and the geometric center of the detected arrow. If the distance is less than a preset threshold, the connection path is assigned a signal flow direction consistent with the arrow.

[0030] The third level involves statistically analyzing the in-degree and out-degree characteristics of each component node. Combined with the arrow information obtained from the previous level, components with fewer inputs and more outputs are defined as source-type components, and components with more inputs and fewer outputs are defined as destination-type components. The signal flow direction is then determined from source-type components to destination-type components.

[0031] At the fourth level, for connection paths whose direction remains uncertain after the above three levels of reasoning, a default signal flow direction is assigned based on the heuristic principle of left to right and top to bottom, ensuring that all connection paths in the topology have a clear direction.

[0032] Furthermore, in step 5, the Qwen2.5-VL-3B visual large model is used as the base model. A low-rank adapter fine-tuning technique is selected, and a secondary fine-tuning method from overall reasoning to local reinforcement is used to construct a task-oriented question-answering large model, including:

[0033] The question-answer pair dataset is divided into a first question-answer pair dataset and a second question-answer pair dataset.

[0034] The first fine-tuning uses the first question-answer pair dataset, with the circuit system block diagram as the input sample, questions about the circuit system as the instruction input, and the complete inference chain of component analysis - loop analysis - system function analysis - question semantic parsing - question answer generation as the output, to achieve the model's adaptation to the global structure of the system block diagram.

[0035] The second fine-tuning used the second question-answer pair dataset, with the circuit system block diagram and corresponding components, loops, and system function analysis as input samples, and questions about the circuit system as instruction input. The inference chain from question semantic parsing to question answer generation was used as output, thus strengthening the model's ability to analyze local details and answer questions.

[0036] Furthermore, step 6 specifically includes:

[0037] Based on the fine-tuned task question-answering model, a structured prompt word engineering oriented towards circuit system analysis is adopted to guide the model to combine the original system block diagram image and the topological connection relationship obtained in step 4 to perform two-stage question-answering reasoning.

[0038] In the first stage, a hierarchical controlled generation strategy is adopted, setting a maximum generation length for each reasoning sub-stage. The preset maximum generation length, i.e., the token limit, is lower than the generation scale required for fully expanding the reasoning of the sub-stage. It is only used to complete the generation of core information, guiding the model to sequentially complete component function analysis, loop flow analysis, system function analysis, question analysis, and question answer derivation within the controlled information scale, generating a complete and traceable system reasoning chain; including:

[0039] 1) Component Function Analysis Phase: Within the maximum generation length (128 Tokens), generate core information related to the component's input-output relationship and functional category;

[0040] 2) Loop Flow Analysis Stage: Within the maximum generation length (256 Tokens), signal flow and loop structure analysis results are generated based on the topological connection relationship between components;

[0041] 3) System Function Analysis Phase: Within the maximum generation length (128 Tokens), the overall system functions are structurally summarized, and a brief explanation of the collaborative relationships between generation modules and the system-level mechanism of action is provided.

[0042] 4) Problem Analysis and Problem Answer Derivation Stage: Within the maximum generation length (256 Tokens), the aforementioned reasoning results are integrated and analyzed to generate the derivation logic corresponding to the problem.

[0043] Since the low token quantity set in the first stage for generating the inference chain may not be sufficient to generate a complete result, the set token limit covers the core information required to complete the thought chain reasoning, which is enough to support the large model in completing continuous logical expansion and causal deduction internally, and can also significantly reduce the time cost greatly increased by introducing the thought chain. At the same time, in case the limited information quantity leads to the inability to complete the question answer, a second stage is adopted. The inference chain formed in the first stage is used as context input to the model to guide the model to converge the results and extract key information, and force the output of the final answer. The maximum generation length is limited to 64 tokens, and only the final answer is allowed to be output. Finally, the system block diagram analysis results and corresponding question and answer answers are output using structured JavaScript object notation (JSON) to ensure that the output results are standardized and parsable.

[0044] The present invention also provides a system for automatic parsing of system block diagrams and task question answering based on multi-source fusion, characterized in that, for implementing the above method, it includes: a target detection unit, an image preprocessing unit, a line segment detection unit, a topology construction unit, a model fine-tuning unit, a question answering reasoning unit, and a result exporting unit;

[0045] The object detection unit is used to perform object detection on the input image and output bounding boxes of components and arrows;

[0046] The image preprocessing unit performs grayscale conversion, denoising, Contrast Limited Adaptive Histogram Equalization (CLAHE), binarization, and morphological optimization.

[0047] The line segment detection unit is used to progressively run Hough and LSD detection to obtain a candidate set of line segments, and use DBSCAN clustering, bidirectional extension, and elimination of invalid line segments to obtain the final optimized set of line segments;

[0048] The topology building unit, including the port automatic inference module, line segment path scoring module, bifurcation point detection module and loop identification module, is used to generate port sets, endpoint and port matching, path search and direction reasoning, and output the final topology structure;

[0049] The model fine-tuning unit is used to fine-tune the Qwen2.5-VL-3B visual large model using a self-constructed question-answering pair dataset and a secondary fine-tuning method from global reasoning to local reinforcement, so as to obtain a task question-answering large model that adapts to the system block diagram scenario.

[0050] The question-answering reasoning unit is used to accept topological structure and image features, call the large model to perform two-stage question-answering and return structured answers. It includes a prompt word management module, a thought chain generation module and a structured extraction module.

[0051] The Results Export unit is used to export the parsing results and question-and-answer answers as JSON, visualizations, and log files.

[0052] The present invention also provides an electronic device, including a processor, a memory, and a program executable on the processor, wherein the program, when executed, implements the above-described method.

[0053] The present invention also provides a computer-readable storage medium having a program stored thereon, wherein the program, when executed, implements the above-described method.

[0054] The beneficial effects of this invention are as follows:

[0055] 1) This invention adopts a multi-source architecture that integrates target detection, traditional image algorithms and large visual models, which enables the high detection rate of components and line segments in noisy and diverse engineering images, thereby significantly improving the accuracy of topology reconstruction.

[0056] 2) Introducing a three-level modeling strategy of components, ports, and line segments, and a four-level directional reasoning strategy, effectively reduces misconnections and omissions caused by line breaks, text interference, or port offsets, and improves the engineering usability of the parsing results.

[0057] 3) A secondary fine-tuning method from overall reasoning to local reinforcement was designed, and a circuit knowledge question-answering dataset was independently constructed to improve the ability of large models to answer knowledge in the circuit domain;

[0058] 4) By using a phased thinking chain, the risk of generating errors in large-scale model illusions is significantly reduced, making the task question-and-answer output both linguistically expressive and structurally verifiable;

[0059] 5) The system supports pipelined parallelization and lightweight model deployment, ensuring analytical accuracy while meeting engineering requirements for computing resources and latency;

[0060] 6) Output structured JSON and credibility score to facilitate subsequent automated verification, manual review and secondary use, and improve the efficiency of automated design review, knowledge extraction and test preparation in EDA scenarios. Attached Figure Description

[0061] Figure 1 This is a flowchart of the method described in this invention;

[0062] Figure 2 This is a schematic diagram of image preprocessing and line segment detection in an embodiment of the present invention;

[0063] Figure 3 This is a schematic diagram illustrating the extraction of topology and connection directions in an embodiment of the present invention;

[0064] Figure 4 This is a schematic diagram representing the final connection relationship in an embodiment of the present invention. Detailed Implementation

[0065] To make the content of this invention easier to understand, the invention will be further described in detail below with reference to specific embodiments and accompanying drawings.

[0066] like Figure 1 As shown, the present invention provides an automatic parsing method for system block diagrams and task question answering based on multi-source fusion, comprising the following steps:

[0067] Step 1: Obtain the original image of the system block diagram, preprocess the original image to obtain a denoised and enhanced structured image;

[0068] Step 2: Use a preset target detection model to identify components in the structured image to obtain the set of component categories and the corresponding component bounding box coordinates in the system block diagram;

[0069] Step 3: Perform line segment detection on the structured image based on a multi-algorithm fusion strategy to extract the set of line segments in the image;

[0070] Step 4: Based on the components, component input ports, output ports, and the detected set of line segments, construct a three-level relationship between components, ports, and line segments to generate the topological connection relationship of the system block diagram;

[0071] Step 5: Construct a self-annotated question-answering pair dataset, and use a secondary fine-tuning method from overall reasoning to local reinforcement to fine-tune the Qwen2.5-VL-3B visual large model to obtain a task question-answering large model adapted to the system block diagram scenario;

[0072] Step 6: Input the constructed topology and system block diagram image into the task question-and-answer model to generate system block diagram analysis results and corresponding question-and-answer answers.

[0073] In this embodiment, step 1 involves reading the input system block diagram image file through the image preprocessing module and performing grayscale conversion, noise filtering, local contrast enhancement, adaptive binarization, and multi-scale morphological opening and closing operations on the image to enhance the continuity of lines and reduce interference from text and symbols on structure recognition. This step aims to generate an image with a clean structure and prominent lines, providing unified high-quality input features for component detection and line segment detection.

[0074] In step 2, the object detection model is used to reason about the preprocessed image and identify the components in the image, including rectangular components, circular components, triangular components, trapezoidal components, pentagonal components, bounding boxes, and arrow and text targets; the category label and bounding box coordinates of each target are output.

[0075] To avoid the influence of text on line detection, this embodiment uses text regions as masks after detection, allowing the line segment detection module to obtain a more continuous and complete logical connection graphic structure. Non-Maximum Suppression (NMS) and category-independent NMS are used to further remove duplicate bounding boxes, and feature enhancement is performed in advance for the small target category of arrows to improve recall.

[0076] In this embodiment, the target detection model may employ a YOLO11n network, including a cross-stage component, a backbone network, and a path aggregation network. The input data stream is connected to the backbone network via the cross-stage component, where the gradient stream is segmented and processed separately. Deep semantic features are then extracted through feature concatenation. Subsequently, the path aggregation network performs top-down semantic transfer and bottom-up localization enhancement, achieving multi-scale feature fusion and ensuring the model can simultaneously capture large-scale components and minute arrow features from complex data. During the training phase, the complete intersection-union loss function is used to calculate the overlap area, center distance, and aspect ratio consistency between the predicted and ground truth bounding boxes, constraining the predicted boxes to regress to the ground truth labeled values.

[0077] The YOLO11n base model was trained using the above process, and the model's coefficient of determination reached 0.976, which can be considered as the predicted value being close to the true value. During deployment, overlapping detection boxes were filtered using an Intersection over Union (IoU) threshold of 0.7 to obtain the classification results of component categories and their corresponding region coordinates.

[0078] In step 3, the focus is on the effectiveness of line segment detection to achieve optimal connection relationship reasoning. Therefore, two algorithms, probabilistic Hough transform and LSD, are used to obtain the candidate set of line segments. Hough transform is more robust to long straight lines, while LSD performs better when handling thin lines and short segments. To fully utilize the advantages of both algorithms, the line segments obtained by the two algorithms are merged. Since some identified line segments have angular deviations, merging requires special consideration. Therefore, a merging algorithm based on angular similarity and endpoint distance clustering is used to fuse duplicate line segments.

[0079] To address potential text interference, bounding box interference, component border interference, and line segment interference within component areas that may appear in the system block diagram, this embodiment further implements the following optimizations:

[0080] 1) Line segment extension and breakpoint completion: Extend the line segment to both ends to cross the tiny gaps near the component boundary, and fix the incorrect line segment recognition caused by incomplete removal of component borders;

[0081] 2) Removal of internal line segments: Since lines inside the component border are not considered, lines located inside the component border will interfere with the overall result. Line segments located inside the component border need to be removed to prevent incorrect connections.

[0082] 3) False connection filtering: Remove redundant line segments that are parallel to the component boundary but do not constitute a real port connection. At the same time, perform DBSCAN clustering and merging. Multiple short lines that are almost collinear can be regarded as a single line and need to be clustered and merged.

[0083] like Figure 2As shown, the green areas represent the complete line segments identified after processing such as line segment extension and breakpoint completion. At the same time, you can see the corresponding areas where the components, arrows, and text mentioned in step 2 have been removed, providing a clean and complete base map for subsequent connection relationship reasoning.

[0084] Step 4: Construct the system block diagram topology based on the three-level relationship between components, ports, and line segments, including:

[0085] 1) Component port inference: In this embodiment, the component's geometric shape and the projection direction of the line segment are combined to automatically infer the component port position on each component boundary; if the line segment endpoint is close to the component boundary and satisfies the direction consistency, it is considered a valid port.

[0086] 2) Endpoint and port matching: Match the endpoints of the line segment with the inferred component ports. In this embodiment, a multi-factor scoring mechanism is used, which considers distance, direction consistency, path cleanliness, and whether the line segment crosses other components, to select the most reliable candidate connection and filter out erroneous connections caused by port offset.

[0087] 3) Path search and branching point identification: Path search is performed on the constructed port connectivity graph. Complex paths composed of multiple polylines are merged. Branching points in the system block diagram are identified by determining whether a node's degree is greater than or equal to 3. When a node's degree is greater than or equal to 3, it indicates that the node has split data flow, leading to different components, thus extracting the true topology. For example... Figure 3 As shown, the visual image on the left is mapped to the logical topology on the right, where the specific physical components 1, 2, 3, 4, and 8 are abstracted as topology nodes, and the physical connections are parsed as directed edges between nodes, thereby extracting the real topology.

[0088] 4) Direction determination and loop identification employ a four-level direction reasoning strategy: First, a closed-loop priority principle prioritizes identifying topologically closed paths. When multiple components can form a closed loop, the signal flow direction before and after the component can represent the overall signal flow direction. Second, arrow-weighted reasoning determines the arrow's direction by analyzing the pixel gradient changes and centroid offset direction in the arrow region. The side with the larger gradient change is the arrow's tail, and the side with the centroid offset is the arrow's direction. The distance between each connecting path segment and the arrow's geometric center is then calculated. If this distance is less than a preset threshold, the signal flow direction of that path is determined to be consistent with the arrow's direction. Third, local connectivity consistency infers the signal direction by combining in-degree, out-degree, and component functional attributes. Fourth, a geometric direction strategy performs backtracking inference based on the geometric projection direction in the absence of an arrow. Figure 4 As shown, in Figure 2 Based on this, a topology diagram containing component nodes, port nodes, connection edges, and signal directions is generated, which is the red part.

[0089] Step 5: Construct a dataset of circuit knowledge question-and-answer pairs and use it to fine-tune the large model.

[0090] This embodiment uses a self-constructed dataset of 40,000 question-answer pairs and fine-tunes the Qwen2.5-VL-3B visual model through a two-stage fine-tuning method from overall inference to local enhancement. The construction of the dataset calls the application programming interface (API) of the inference model qwen-vl-max.

[0091] During the first fine-tuning, a dataset of 10,000 self-created question-answer pairs was used, with 1,000 circuit system block diagrams as input. The qwen-vl-max API was called, and by setting up a prompt word project, 10 questions and the corresponding complete reasoning chain from component function analysis, loop flow analysis, system function recognition to question semantic parsing and question answer generation were obtained. By organizing the output of qwen-vl-max, a dataset of the overall reasoning process was constructed. After filtering and cleaning, it was used to fine-tune the Qwen2.5-VL-3B visual large model.

[0092] During the second fine-tuning, a self-created dataset of 30,000 question-answer pairs was used, along with the dataset built during the first fine-tuning. 1,000 circuit system block diagrams and corresponding component functions, loop flow, and system functions were used as input. The qwen-vl-max API was called, and by setting up a prompt word project, 30 questions and corresponding reasoning chains from question semantic parsing to question answer generation were obtained. By organizing the output of qwen-vl-max, a specialized reinforcement dataset for circuit knowledge question answering was constructed. After screening and cleaning, it was used to further fine-tune the Qwen2.5-VL-3B visual model.

[0093] Because the Qwen2.5-VL-3B model has relatively small parameters and a poor ability to understand circuit system block diagrams, it's necessary to ensure that Qwen2.5-VL-3B can analyze circuit diagrams, specifically component analysis, loop analysis, and system function analysis, before answering questions. Since this part is not a core task, it's assigned a small proportion: 10 questions. The subsequent question semantic parsing and answer generation are the core components; therefore, a larger dataset is used to ensure correct output from question parsing to answer generation.

[0094] In step 6, a phased question-and-answer strategy is adopted:

[0095] The first stage is used to generate a system-level interpretable reasoning chain, which is the logical foundation of the entire task question-and-answer process. This stage does not directly output the final answer, but constrains the analysis path of the large model through explicit reasoning process, reduces the risk of illusion, and improves the reliability of circuit understanding.

[0096] Specifically, the system first converts the "component-port-line segment" topology structure obtained in step 4 into a structured JSON description, which includes at least: the type and functional attributes of each component; the directed connection relationship between component ports; network-level signal flow paths and identified feedback loop information.

[0097] Subsequently, the system constructs phased prompt words that include loop analysis, system analysis, and user problem analysis. These prompt words, along with the original system block diagram image, are then input into the Qwen2.5-VL-3B Vision Language Model (VLM) after being fine-tuned by a low-rank adapter. The image provides intuitive structural layout information, while the topology JSON provides physical constraint information verified by geometric and connectivity algorithms. The two form a multi-source fusion input.

[0098] Guided by prompts, the large model first performs semantic-level parsing of the entire system, focusing on:

[0099] 1) Analyze the signal transmission path between different components based on topology;

[0100] 2) Identify the cascading relationships, parallel relationships, and potential feedback mechanisms of the components;

[0101] 3) Combining knowledge of the circuit field, make an overall judgment on the system functions and generate a system-level description;

[0102] 4) In response to user questions, review relevant theoretical knowledge and make step-by-step inferences based on the above system analysis results.

[0103] In this stage, the model explicitly outputs the complete intermediate reasoning process in natural language, namely the system reasoning chain. This reasoning chain describes in detail the causal logical relationship between circuit structure, signal flow and problem solving, but does not require the output of a standardized answer format, thereby ensuring the sufficiency and interpretability of the reasoning process.

[0104] The second stage, based on the reasoning chain generated in the first stage, outputs structured results that can be directly parsed and verified by the system, and is the final output stage of the task question answering.

[0105] In this stage, the system again uses the inference chain, image topology JSON, and user question generated in the first stage as input. A strictly constrained prompt template guides the fine-tuned Qwen2.5-VL-3B model to extract and format the output results. Depending on the question type, the system applies differentiated constraints to the output format, including:

[0106] 1) Multiple choice questions: Only the correct option identifiers are allowed to be output;

[0107] 2) Fill in the blanks: Select the only correct answer from the preset candidate answers and output it completely.

[0108] Through the above two-stage reasoning and output mechanism, this invention, while ensuring the reasoning capability of large models, realizes a task question-and-answer process from explanation to conclusion, which not only improves the reliability of circuit problem answers, but also meets the engineering requirements of structured interfaces and automated processing.

[0109] To better verify and illustrate the technical effects of the method used in this invention, this embodiment selects the traditional block diagram analysis process based on a single visual detection method as a comparison object, and compares the test results through quantifiable experiments to verify the effectiveness of the method of this invention through scientific demonstration.

[0110] Traditional methods generally include three stages: component detection, line segment detection, and heuristic connection recovery. The specific steps are as follows:

[0111] First, the input system block diagram image is directly fed into a single object detection model, such as using only the YOLO model to identify components, connection endpoints, and arrows. This method relies on the detector to classify all elements in the image, and its output component boxes, arrow boxes, and scattered line segment points serve as the basis for subsequent connection reconstruction. However, because this type of detection model has limited robustness to dense line segments, tiny arrows, and text interference, its recognition results are easily affected by image noise, compression artifacts, and differences in drawing style, resulting in incomplete connection information.

[0112] Subsequently, heuristic rules are used to match connections between the detected endpoints. Heuristic algorithms typically infer line segments based on a single metric such as the geometric distance or relative position of the endpoints. For example, if two endpoints are close in the horizontal or vertical direction, it is assumed that a connection may exist between them and they are automatically connected. Such methods can achieve acceptable connection recovery rates in simple structures, but in complex graphs containing numerous polylines, cross-component connections, and arrow direction variations, heuristic algorithms often struggle to accurately determine the true path and direction of the connection, easily leading to erroneous connections.

[0113] After the component connection information is completed, traditional methods directly input the resulting connection graph into a multimodal large model for question answering. However, due to the incomplete input topology and lack of structured verification mechanisms, VLM may produce phantom connections when processing logical reasoning tasks, i.e., giving incorrect answers based on statistical associations rather than the real image structure, affecting the reliability of question answering.

[0114] To verify the difference in effectiveness between the method of the present invention and traditional methods, this embodiment compares the following two types of methods:

[0115] 1) Based solely on image recognition models, such as methods that combine YOLO recognition and OCR model text removal, the number of components can be identified relatively effectively and connection relationships can be inferred based on algorithms and heuristic principles. However, since image recognition models essentially only perform visual target localization and classification, they do not have the ability to understand the functional structure of circuit systems, signal semantics, and system-level constraints. Therefore, they cannot perform semantic understanding and reasoning on connection directionality, functional hierarchy, and overall system topology, making it difficult to support automatic modeling and semantic reasoning tasks for EDA applications.

[0116] 2) The method of directly asking and answering questions using VLM has serious problems of illusionary connections and recognition errors. The current large model has serious errors in accurately identifying component functions and inferring connection relationships. As a result, subsequent semantic understanding and question answering reasoning will be based on incorrect structural representations, which further amplifies the initial error, increases the difficulty of answering related questions, and greatly reduces accuracy and stability.

[0117] The method of this invention was experimentally compared using 20 publicly available test cases from the 2025 China Postgraduate Innovation Competition - EDA Elite Challenge, with the results shown in Table 1. The balanced F-score (F1 value) was selected as the experimental metric to comprehensively consider both precision and recall. The experimental results show that the traditional image recognition model YOLO, lacking semantic understanding and question-answering capabilities, has an F1 score of 0 in the question-answering task; while the method using only VLM for direct question answering has an F1 score of 0.38. In topology and orientation, the F1 scores of both the traditional image recognition model YOLO and the method using only VLM for direct question answering are lower than the method of this invention (VLM+YOLO). Therefore, the method of this invention significantly outperforms the traditional visual detection and heuristic rule-based parsing process in three aspects: topology recovery, orientation determination, and structure-based question answering of complex system block diagrams, truly demonstrating the superiority of this invention in the field of automated engineering diagram understanding.

[0118] Table 1. Performance comparison of different parsing strategies in system architecture modeling and question answering tasks.

[0119]

[0120] This invention also provides an automatic system block diagram parsing and task question answering system based on multi-source fusion, comprising:

[0121] The image preprocessing unit is used to perform grayscale conversion, CLAHE, morphological optimization, etc.

[0122] The target detection unit is used to detect components, arrows, and text areas.

[0123] The line segment detection unit is used to perform Hough line detection and LSD line segment fusion;

[0124] Topology building units are used to perform port inference, connection restoration, and direction inference.

[0125] The model fine-tuning unit is used to fine-tune the large visual model using a self-constructed question-answering pair dataset;

[0126] The question-and-answer reasoning unit is used to generate reasoning chains and structured answers using a large model;

[0127] The results export unit is used to export results such as JSON and visualizations.

[0128] The electronic device of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When loaded into the processor, the computer program executes the system block diagram automatic parsing and task question answering method of the present invention, including steps such as image preprocessing, component and arrow detection and recognition, line segment multi-algorithm fusion extraction, topology construction, and topology-based task question answering reasoning.

[0129] The computer-readable storage medium of the present invention stores a computer program, which, when executed by a processor, implements all the steps of the system block diagram automatic parsing and task question answering method of the present invention, and is able to automatically parse the content structure of the system block diagram and generate corresponding functional analysis results or question answers.

[0130] The computer-readable storage medium may include, but is not limited to: RAM, ROM, EEPROM, CD-ROM or other optical disc storage media, disk storage devices or other magnetic storage devices, flash memory, or any other medium capable of storing instructions or data structures in the form of program code and which can be read by a computer.

[0131] The processor is used to execute computer programs stored in the memory. By running the program, it implements the various steps described in the above embodiments, including component detection, arrow recognition, line segment fusion, port matching, topology reconstruction, and large model question-answering reasoning, thereby achieving automatic parsing and semantic understanding of the system block diagram.

[0132] The above description is merely a preferred embodiment of the present invention and is not intended to further limit the present invention. All equivalent changes made based on the description and drawings of the present invention are within the protection scope of the present invention.

Claims

1. A method for automatic parsing of system block diagrams and task question answering based on multi-source fusion, characterized in that, Includes the following steps: Step 1: Obtain the original image of the system block diagram, preprocess the original image to obtain a denoised and enhanced structured image; The system block diagram is a circuit system block diagram; Step 2: Use a preset target detection model to identify components in the structured image to obtain the set of component categories and the corresponding component bounding box coordinates in the system block diagram; Step 3: Perform line segment detection on the structured image based on a multi-algorithm fusion strategy to extract the set of line segments in the image; Step 4: Based on the components, component input ports, output ports, and the detected set of line segments, construct a three-level relationship between components, ports, and line segments to generate the topological connection relationship of the system block diagram; Step 5: Construct a self-annotated question-answering pair dataset, and use a secondary fine-tuning method from overall reasoning to local reinforcement to fine-tune the Qwen2.5-VL-3B visual large model to obtain a task question-answering large model adapted to the system block diagram scenario; Step 6: Input the constructed topology and system block diagram image into the task question-and-answer model to generate system block diagram analysis results and corresponding question-and-answer answers; Step 4 specifically involves: On the bounding box of each component, the endpoints of the line segments are searched according to the principle of intersection between the bounding box and the line segments, and then the positions of the input and output ports of each component are inferred. Based on the correspondence between the ports and the line segments, an undirected physical topology is constructed. A four-level directional reasoning strategy is used to determine the signal flow direction of each connection path, specifically: The first level prioritizes searching for and closing feedback loops that can form strongly connected components, and determines the signal direction of such loops; The second level calculates the distance between the midpoint of each connection path and the geometric center of the detected arrow. If the distance is less than a preset threshold, the connection path is assigned a signal flow direction consistent with the arrow. The third level involves statistically analyzing the in-degree and out-degree characteristics of each component node. Combined with the arrow information obtained from the previous level, components with fewer inputs and more outputs are defined as source-type components, and components with more inputs and fewer outputs are defined as destination-type components. The signal flow direction is then determined from source-type components to destination-type components. At the fourth level, for connection paths whose direction remains uncertain after the above three levels of reasoning, a default signal flow direction is assigned based on the heuristic principle of left to right and top to bottom, ensuring that all connection paths in the topology have a clear direction.

2. The method for automatic parsing of system block diagrams and task question answering based on multi-source fusion according to claim 1, characterized in that, Step 1 is as follows: Optical character recognition technology is used to extract the coordinates of all text regions in the image, and a binary mask with the same size as the original image is generated. The pixel values ​​of the covered area of ​​the binary mask are set to the background color to eliminate the interference of text characters on the line extraction. Calculate the noise variance of a local region of the image. If the noise variance is lower than a preset threshold, Gaussian filtering is used for denoising; otherwise, median filtering is used for denoising. A contrast-limited adaptive histogram equalization algorithm is used to enhance the edge contrast of lines in the image, and adaptive binarization is performed to obtain a structured image.

3. The method for automatic parsing of system block diagrams and task question answering based on multi-source fusion according to claim 1, characterized in that, Step 2 is as follows: Step 2-1: Construct a hybrid-driven data stream. The data stream is based on a real-labeled system block diagram, anchored to real image features, and combined with synthetic data generated in batches by scripts to expand sample diversity and cover edge scenes. Step 2-2: The hybrid-driven data stream is input into the target detection model to obtain the component category set and the bounding box coordinates of each component.

4. The method for automatic parsing of system block diagrams and task question answering based on multi-source fusion according to claim 1, characterized in that, Step 3 specifically involves: Step 3-1: Use probabilistic Hough transform to extract global long straight lines in the image, set an adaptive threshold of 20% of the minimum component size, and capture the main links; Step 3-2: For the local small line segments missed by the Hough transform, the line segment detector algorithm is used to supplement the detection and form a full set of candidate line segments. Step 3-3: Use a density-based noisy spatial clustering algorithm to cluster and merge the line segments in the candidate set to eliminate the problems of line segment overlap and breakage and restore the complete connection. Step 3-4: Extend each end of each line segment obtained in Step 3-3 by 2.5% of the corresponding line segment length to remove interference lines inside the component frame and interference lines that are approximately parallel to the component frame, thereby obtaining a complete and noise-free set of line segments.

5. The method for automatic parsing of system block diagrams and task question answering based on multi-source fusion according to claim 1, characterized in that, In step 5, the Qwen2.5-VL-3B visual large model is used as the base model. A low-rank adapter fine-tuning technique is selected, and a secondary fine-tuning method from global inference to local reinforcement is used to construct the task-oriented question-answering large model, including: The question-answer pair dataset is divided into a first question-answer pair dataset and a second question-answer pair dataset. The first fine-tuning uses the first question-answer pair dataset, with the circuit system block diagram as the input sample, questions about the circuit system as the instruction input, and the complete inference chain of component analysis - loop analysis - system function analysis - question semantic parsing - question answer generation as the output. The second fine-tuning uses the second question-answer pair dataset, with the circuit system block diagram and corresponding components, loops, and system function analysis as input samples, questions about the circuit system as instruction input, and the inference chain generated from question semantic parsing to question answer generation as output.

6. The method for automatic parsing of system block diagrams and task question answering based on multi-source fusion according to claim 1, characterized in that, Step 6 specifically involves: Based on the fine-tuned task question-answering model, a structured prompt word engineering oriented towards circuit system analysis is adopted to guide the model to combine the original system block diagram image and the topological connection relationship obtained in step 4 to perform two-stage question-answering reasoning. In the first stage, a hierarchical controlled generation strategy is adopted, setting a maximum generation length for different reasoning sub-stages. The preset maximum generation length will be lower than the generation scale required to fully expand the reasoning of the sub-stages, and will only be used to complete the generation of core information. This guides the model to complete component function analysis, loop flow analysis, system function analysis, problem analysis and problem answer derivation in sequence within the controlled information scale, generating a system reasoning chain with a complete structure and traceable process. In the second stage, the reasoning chain formed in the first stage is used as context input to the model to guide the model to converge results and extract key information. Finally, the system block diagram analysis results and corresponding question-and-answer answers are output using structured JavaScript object representation.

7. A system for automatic parsing of system block diagrams and task question answering based on multi-source fusion, characterized in that, The method for implementing any one of claims 1-6 includes: a target detection unit, an image preprocessing unit, a line segment detection unit, a topology construction unit, a model fine-tuning unit, a question-answering reasoning unit, and a result exporting unit. The object detection unit is used to perform object detection on the input image and output bounding boxes of components and arrows; The image preprocessing unit performs grayscale conversion, denoising, contrast-limited adaptive histogram equalization, binarization, and morphological optimization. The line segment detection unit is used to progressively run Hough and LSD detection to obtain a candidate set of line segments, and use DBSCAN clustering, bidirectional extension, and elimination of invalid line segments to obtain the final optimized set of line segments; The topology building unit, including the port automatic inference module, line segment path scoring module, bifurcation point detection module and loop identification module, is used to generate port sets, endpoint and port matching, path search and direction reasoning, and output the final topology structure; The model fine-tuning unit is used to fine-tune the Qwen2.5-VL-3B visual large model using a self-constructed question-answering pair dataset and a secondary fine-tuning method from global reasoning to local reinforcement, so as to obtain a task question-answering large model that adapts to the system block diagram scenario. The question-answering reasoning unit is used to accept topological structure and image features, call the large model to perform two-stage question-answering and return structured answers. It includes a prompt word management module, a thought chain generation module and a structured extraction module. The Results Export unit is used to export the parsing results and question-and-answer answers as JSON, visualizations, and log files.

8. An electronic device comprising a processor, a memory, and a program executable on the processor, said program performing any one of claims 1–6.

9. A computer-readable storage medium having a program stored thereon, said program, when executed, implementing the method of any one of claims 1–6.