Apparatus quality detection system and method fusing natural language instructions with visual analysis

By integrating natural language commands and visual analysis into an equipment quality inspection system, and utilizing the multi-agent collaborative mechanism of ESLPA and VDDA, the system solves the problems of low efficiency and insufficient flexibility in existing technologies, and achieves efficient, accurate, and flexible automated inspection, adapting to complex backgrounds and new types of defects.

CN122243906APending Publication Date: 2026-06-19SI CHUAN KE RUI RUAN JIAN YOU XIAN ZE REN GONG SI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SI CHUAN KE RUI RUAN JIAN YOU XIAN ZE REN GONG SI
Filing Date
2026-03-13
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing equipment quality inspection methods are inefficient, highly subjective, and lack natural language interaction capabilities. Traditional machine vision systems are not flexible enough to cope with complex backgrounds and new defects, and lack collaborative mechanisms among multiple agents, resulting in high inspection costs and limited flexibility.

Method used

The equipment quality inspection system, which integrates natural language commands and visual analysis, uses an equipment system language processing agent (ESLPA) to parse user commands and generate task packages, a visual defect detection agent (VDDA) to perform real-time localization and hierarchical analysis, and a collaborative interface to generate natural language reports and digital archives, thus achieving multi-agent collaboration.

🎯Benefits of technology

It achieves efficient, accurate, and flexible automated quality inspection, has strong adaptability to unknown parts and new defects, and features intuitive and easy-to-use natural language interaction and dynamic task execution capabilities of multiple intelligent modules.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243906A_ABST
    Figure CN122243906A_ABST
Patent Text Reader

Abstract

This invention discloses an equipment quality inspection system and method integrating natural language commands and visual analysis, comprising: an equipment system language processing agent (ESLPA) for parsing key semantic elements in user commands, converting the parsing results into machine-readable JSON / Protobuf format task packages, and processing clarification requests from a visual defect detection agent, dynamically initiating user follow-up questions or intent corrections; a visual defect detection agent (VDDA) for achieving text-driven real-time part localization and performing hierarchical defect analysis; and a collaborative interface and output system for transmitting task packages and generating natural language reports and digital quality archives.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of equipment quality inspection, and in particular to an equipment quality inspection system and method that integrates natural language commands and visual analysis. Background Technology

[0002] In the fields of equipment manufacturing, operation and maintenance, and support, ensuring the quality of key components is the cornerstone of ensuring equipment reliability and safety. However, current mainstream inspection methods face severe challenges. While manual visual inspection is widely used, it suffers from low efficiency, strong subjectivity, and poor consistency. Especially when dealing with the massive number of parts in large and complex equipment, it is time-consuming and results are easily influenced by personnel experience and condition, leading to high rates of missed and false detections. Furthermore, defects are difficult to quantify and record accurately, resulting in high maintenance costs. Automated machine vision systems (AOI) can improve efficiency, but their inherent rigid programming model leads to a severe lack of flexibility. The system can only identify defects in preset locations and types. If the equipment model changes, the appearance of parts changes, or new defects need to be detected, it is necessary to rely on professionals to redevelop algorithms and debug parameters, resulting in slow response and high costs. In addition, traditional algorithms lack robustness in complex backgrounds, lighting changes, or scenarios requiring precise semantic understanding (such as "a specific bolt on the inside of the left front wheel"), and lack natural language interaction capabilities, making it impossible for users to intuitively and flexibly specify their inspection needs.

[0003] Despite significant advancements in artificial intelligence technologies, particularly deep learning and large language models (LLM), in their respective fields—object detection and segmentation enhancing the ability to identify specific defects, and LLM demonstrating powerful natural language understanding—integrating these technologies for equipment quality inspection still faces key bottlenecks. Current solutions often use computer vision (CV) models or LLM in isolation, resulting in a disconnect between "language" and "vision" capabilities: pure CV solutions cannot understand users' flexible and varied colloquial instructions, while pure text-based LLM cannot handle visual information; there is a lack of efficient and unambiguous collaborative mechanisms between the two. More importantly, dedicated CV models are typically trained on closed datasets, lacking generalization ability when faced with unseen part descriptions or novel defects (i.e., open-vocabulary problems), requiring extensive retraining with specific scenario-labeled data, leading to substantial costs. Equipment inspection tasks themselves involve a complex chain of "understanding intent -> locating the target -> identifying / quantifying defects -> assessing quality -> generating a report," making it difficult for a single model to efficiently and robustly complete the entire process. Existing integration solutions mostly employ fixed pipelines, lacking dynamic task decomposition, collaboration, and feedback capabilities between agents based on semantic understanding, thus limiting flexibility and scalability. Meanwhile, seamlessly integrating domain expertise (such as specific part acceptance criteria and failure modes) into end-to-end processes and effectively combining it with visual perception and language understanding remains a challenge.

[0004] Therefore, the field of equipment quality inspection urgently needs to break through the limitations of manual labor and the rigid constraints of traditional machine vision, and urgently needs an innovative solution that can deeply integrate language intelligence and visual intelligence. Summary of the Invention

[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide an equipment quality inspection system and method that integrates natural language commands and visual analysis. Through a multi-agent collaborative mechanism, it integrates the powerful cognitive capabilities of a large model, which can significantly improve the inspection efficiency and accuracy for different types of equipment.

[0006] The objective of this invention is achieved through the following technical solution: an equipment quality inspection system integrating natural language commands and visual analysis, comprising:

[0007] The equipment system language processing agent ESLPA is used to parse key semantic elements in user commands, convert the parsing results into machine-readable JSON / Protobuf format task packages, process clarification requests from the visual defect detection agent, and dynamically initiate user follow-up questions or intent corrections.

[0008] The Visual Defect Detection Agent (VDDA) is used to achieve real-time text-driven part localization and perform hierarchical defect analysis.

[0009] The collaborative interface and output system is used to transmit task packages and generate natural language reports and digital quality archives.

[0010] An equipment quality inspection method integrating natural language commands and visual analysis includes the following steps:

[0011] The user inputs equipment images, natural language commands, and quantitative standards, and generates a task command package containing structured fields;

[0012] VDDA initiates the open environment detection process based on this instruction package: First, it uses multimodal large model open vocabulary visual localization technology to map the semantic description of the part to the image coordinate space in real time, and accurately lock the target area;

[0013] Initiating hierarchical defect analysis: Utilizing the generalization features of the basic visual model to perceive abnormal regions, then driving a lightweight dedicated model to perform defect quantification calculations. The detection process dynamically adapts to the defect types and standards specified in the task instructions.

[0014] When encountering ambiguous positioning or insufficient detection confidence, VDDA automatically generates a clarification request with visual evidence. ESLPA then uses LLM contextual reasoning to re-parse the intent or guide user interaction, forming an intelligent closed loop of "parse-execute-verify".

[0015] ESLPA integrates the structured results from VDDA output to generate a dual-track output that combines natural language reports and digital quality archives.

[0016] The beneficial effects of this invention are: it has the intuitive ease of use of natural language interaction, strong adaptability to unknown parts and new defects (open world capability), and dynamic task execution capability of multi-intelligent module collaboration, which can realize efficient, accurate, flexible and user-friendly automated quality inspection. Attached Figure Description

[0017] Figure 1 This is a schematic diagram illustrating the principle of the present invention. Detailed Implementation

[0018] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings, but the scope of protection of the present invention is not limited to the following description.

[0019] like Figure 1 As shown, the equipment quality inspection system integrating natural language commands and visual analysis includes:

[0020] 1-1) Equipment System Language Processing Agent (ESLPA)

[0021] Instruction parsing module: Based on an LLM (such as GPT-4, LLaMA-3) natural language understanding engine, it extracts key semantic elements such as target parts, defect types, and quality thresholds from user instructions.

[0022] Structured generation module: Converts the parsed results into machine-readable JSON / Protobuf format task packages (including target_component location description, defect_types list, and quality_standards threshold).

[0023] Interactive control module: handles VDDA clarification requests and dynamically initiates user follow-up questions or intent corrections (such as "Please confirm whether the target part is area A in the diagram").

[0024] 1-2) Visual Defect Detection Agent (VDDA)

[0025] Open vocabulary localization module: integrates MLLM (such as Grounding DINO+Segment Anything Model) to realize text-driven real-time localization of parts (e.g., mapping "hydraulic cylinder piston rod surface" to pixel-level ROI).

[0026] Layered Defect Analysis Module:

[0027] General anomaly perception layer: Employs a visual base model (such as DINOv2) to extract multi-scale features of the ROI region and identify potential anomalies;

[0028] Dedicated Defect Quantization Layer: Calls pre-built lightweight models (such as scratch detection U-Net, corrosion classification ResNet) to accurately measure defects (such as depth reconstruction, area ratio calculation).

[0029] Confidence assessment module: Outputs the confidence score of defect detection and triggers the cross-agent clarification process in low-confidence scenarios.

[0030] 1-3) Collaborative Interface and Output System

[0031] Structured instruction channel: Transmits task packets based on gRPC / message queue, with fields including task_id, target_roi_coord, defect_metrics, etc.

[0032] Natural Language Report: ESLPA calls LLM to generate user-readable conclusions (e.g., "3 cracks were detected, with a maximum length of 8.2mm (63% exceeding the limit)").

[0033] Digital quality archives: Automatically generate JSON logs containing defect coordinates, quantitative data, and standard compliance.

[0034] 2) Technical Details

[0035] 2-1) Language-Vision Collaborative Technology

[0036] The part description text T parsed by ESLPA is encoded into a semantic vector using LLM. VDDA uses the CLIP model to divide image regions Encoded as visual vectors Location matching is calculated using cosine similarity.

[0037]

[0038] Regions with values ​​greater than a threshold are selected as target ROIs to ensure the accuracy of open vocabulary localization.

[0039] 2-2) Dynamic Defect Detection

[0040] Let the set of defect types in the task instruction package be... VDDA loads the pre-registered model library and performs parallel detection:

[0041]

[0042] Model output example:

[0043] Scratch length:

[0044] Corrosion area:

[0045] 2-3) Open Environment Adaptation Technology

[0046] The adapter fine-tunes the objective function used in few-shot defect transfer:

[0047]

[0048] Zero-sample part localization text-guided detection probabilistic model:

[0049]

[0050] in The softmax function

[0051] By aligning two vector spaces, human natural language commands are accurately mapped to the visual perception space, solving the problem of the separation between "language description" and "visual entity" in traditional systems. Through language-visual collaboration, accurate understanding of human intentions is achieved; dynamic routing builds a scalable detection capability library; and open transfer technology breaks through the scene boundaries of traditional vision systems, forming a three-in-one industrial AI detection paradigm.

[0052] A method for equipment quality inspection that integrates natural language commands and visual analysis achieves dynamic conversion from natural language commands to visual inspections through closed-loop collaboration between an equipment system language processing agent (ESLPA) and a visual defect detection agent (VDDA). The method includes the following steps:

[0053] After the user inputs an image of the equipment and natural language instructions (such as "check whether the depth of the longitudinal scratch on the surface of the hydraulic cylinder piston rod exceeds 0.1mm"), ESLPA relies on a large language model (LLM) to deeply analyze the semantics of the instructions: accurately extract the positioning features of the target part (such as "the surface of the hydraulic cylinder piston rod"), the defect type (such as "longitudinal scratch") and the quantification standard (such as "depth exceeds 0.1mm"), and generate a task instruction package containing structured fields (semantic description of the target part, list of defect types, quality threshold), completely eliminating the risk of natural language ambiguity being transmitted to the visual layer.

[0054] VDDA initiates the open environment detection process based on this instruction package: First, it employs open-vocabulary visual localization technology using Multimodal Large Model (MLLM) (such as Grounding DINO combined with SAM segmentation) to map the semantic description of the part to the image coordinate space in real time, accurately locating the target area. Then, it initiates hierarchical defect analysis—using the generalized features of the Visual Foundation Model (VFM) to perceive abnormal areas, and then drives a lightweight dedicated model (such as a pixel-level segmentation network for scratches) to perform defect quantification calculations (such as 3D reconstruction of scratch depth). The detection process dynamically adapts to the defect type and standards in the task instructions. When encountering ambiguity in localization or insufficient detection confidence, VDDA automatically generates a clarification request containing visual evidence. ESLPA then re-parses the intent or guides user interaction through LLM contextual reasoning, forming an intelligent closed loop of "parse-execute-verify".

[0055] Ultimately, ESLPA integrates the structured results from VDDA (including defect heatmaps, quantitative data, and threshold comparison status) to generate a dual-track output that combines natural language reports (e.g., "Three longitudinal scratches detected, maximum depth 0.15mm (50% exceeding standard), polishing recommended") with digital quality archives. This solution overcomes barriers to language-visual collaboration, open environment adaptation, dynamic task response, and industrial knowledge integration. It can be deployed in a cloud-edge-device architecture, providing core intelligent support for equipment lifecycle quality management.

[0056] In the embodiments of this application, the specific process can be divided into the following steps:

[0057] Step 1: User command input and preprocessing

[0058] The preprocessing section utilizes ESLPA's instruction parsing module. After receiving equipment images (RGB three-channel, resolution ≥1920×1080) and natural language instructions, the system first performs industrial-grade preprocessing. Image processing employs the Multi-Scale Retinex algorithm (MSRCR) for illumination normalization, extracting and reconstructing illumination components using three Gaussian kernels (σ=15, 80, 250), significantly improving image usability in environments with strong shadows or oil contamination. In a foundry workshop test, this increased the proportion of analyzable images from 68% to 95%. Simultaneously, the instruction text is processed by a bidirectional LSTM cleaning module, removing irrelevant stop words and standardizing dialectal terms (e.g., "scratch" → "scratch"). Entity annotation is then performed using the equipment knowledge graph, laying the foundation for subsequent parsing.

[0059] Input: Equipment image Natural language instructions

[0060] Image preprocessing: ImageNet normalization parameters are used to eliminate illumination differences.

[0061] Instructions for cleaning:

[0062]

[0063] Among them, the stop word list Includes 200+ industry-irrelevant words ("please", "of"), etc.

[0064] Step 2: Semantic parsing and structured task generation

[0065] The system utilizes ESLPA's instruction parsing and structured generation modules. The cleaned instruction input employs a dual-path parsing engine: the main path uses a domain-fine-tuned BERT-Large model (768 hidden layers), improving its understanding of industrial terminology through adversarial training on 50,000 sets of maintenance manual data; the auxiliary path activates over 120 regularization rules to match the ISO standard terminology library, automatically switching when the LLM output confidence falls below 0.8. The parser accurately extracts the target part's physical and spatial attributes (e.g., "bearing housing upper surface"), defect types, and measurement indicators (e.g., "crack length"), and integrates user thresholds and built-in standard libraries to output a machine-executable Protobuf task package.

[0066] LLM encoding:

[0067]

[0068] Model configuration: BERT-Large fine-tuning, 12 layers, 768-dimensional hidden layers.

[0069] Step 3: Visual localization and ROI extraction

[0070] The open vocabulary localization module of VDDA is used. Open vocabulary localization is initiated based on the task package: the Grounding DINO model (Swin-L backbone) calculates the cosine similarity between the text description and the image region, filtering candidate boxes that meet the threshold sim > 0.75; the SAM segmentation model receives spatial constraint parameters (such as "vertical direction") and generates pixel-level ROI masks, returning the top 2 candidate regions in symmetrical part scenes. The localization process is accelerated using TensorRT.

[0071] Primary localization uses Grounding DINO to generate candidate bounding boxes:

[0072]

[0073] The SAM model is used to accept text prompts, refine boundaries, and perform fine segmentation.

[0074]

[0075] If the task includes a spatial field, only areas that meet the orientation constraints will be retained:

[0076]

[0077] Step 4: Perform layered defect detection

[0078] After acquiring the target ROI region, the system initiates a cascaded defect detection process. First, a 1024-dimensional feature vector is extracted using the visual baseline model DINOv2 (ViT-L / 14 architecture), and anomaly scores are calculated. When the Sigmoid function output value exceeds the industrial safety threshold of 0.85, a dedicated defect quantification module is triggered. This module employs a multi-expert model collaborative architecture, dynamically invoking optimized algorithms for different defect types.

[0079] Crack detection uses a U-Net++ segmentation network to generate a high-precision binary mask, which is then processed by skeletonization and a pixel chain tracing algorithm to calculate the physical length. The core technical formula is as follows: This ensures that the measurement error is strictly controlled within 0.04mm.

[0080] The corrosion analysis combines the ResNet-18 classification model with the GrabCut segmentation algorithm to calculate the area ratio of the corroded region.

[0081] A hyperspectral material compensation mechanism is introduced, and dynamic calibration is performed based on the reflectivity of metal surfaces: the compensation coefficient γ=1.35 for cast iron, γ=1.02 for aluminum alloy, and γ=0.95 for titanium alloy, effectively overcoming the misjudgment problem under strong reflective conditions.

[0082] 3D deformation quantization relies on the PointNet++ point cloud processing architecture. It aligns the design model with the actual point cloud through the iterative nearest point (ICP) registration algorithm and calculates the maximum curvature deviation of key surfaces. .

[0083] Step 5: Testing Steps

[0084] Step 5-1: Core Module Unit Testing

[0085] 2,000 instructions covering dialects, abbreviations, and multilingual expressions (e.g., "inspect bearing seat cracks < threshold 1.5mm") were injected, and the accuracy of structured task package generation was measured. Results showed that 94.3% of the instructions were correctly parsed (confidence interval CI = 92.1%–95.8%), significantly better than the 67.5% of traditional regular expression matching schemes. Failure case analysis indicated that 7% of the errors stemmed from unregistered technical terms (e.g., "pockmarks"), which has been addressed through dynamic expansion using a knowledge graph.

[0086] In a gearbox image where 30% was obscured by oil, the open vocabulary localization module still locked onto the target with an accuracy of 82.7% mAP@0.5, a 38 percentage point improvement over the classic Faster R-CNN. Under vibration conditions (5G acceleration), the coordinate drift was controlled within ±3 pixels, meeting the ISO 9283 robot vision standard.

[0087] Step 5-2: Defect Quantization Accuracy Calibration

[0088] Crack measurement and testing

[0089] 200 repeated tests were performed on ASTM E290 standard test blocks containing pre-existing cracks (crack length 2.0-10.0 mm). The system outputs the length value. With coordinate measuring machine value The linear regression equation is:

[0090]

[0091] With a maximum absolute error of 0.038 mm, the material compensation mechanism reduces the measurement deviation of cast iron parts to 1 / 3 of that of the uncompensated system.

[0092] Rust area verification: System area ratio detection results Compared with the true value The relative error is only 1.8% (under the condition of 50% oil coverage).

[0093] The foregoing description illustrates and describes a preferred embodiment of the present invention. However, as previously stated, it should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the inventive concept described herein through the foregoing teachings or techniques or knowledge in related fields. Any modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.

Claims

1. An equipment quality inspection system integrating natural language commands and visual analysis, characterized in that: include: The equipment system language processing agent ESLPA is used to parse key semantic elements in user commands, convert the parsing results into machine-readable JSON / Protobuf format task packages, process clarification requests from the visual defect detection agent, and dynamically initiate user follow-up questions or intent corrections. The Visual Defect Detection Agent (VDDA) is used to achieve real-time text-driven part localization and perform hierarchical defect analysis. The collaborative interface and output system is used to transmit task packages and generate natural language reports and digital quality archives.

2. The equipment quality inspection system integrating natural language commands and visual analysis according to claim 1, characterized in that: The language processing agent of the equipment system includes: Instruction parsing module: Based on an LLM-based natural language understanding engine, it extracts key semantic elements from user instructions, including target parts, defect types, and quality thresholds; Structured generation module: Converts the parsed results into machine-readable JSON / Protobuf format task packages, which contain target_component location descriptions, defect_types lists, and quality_standards thresholds; Interactive control module: Used to handle clarification requests from the visual defect detection agent, dynamically initiating follow-up questions or intent corrections from the user.

3. The equipment quality inspection system integrating natural language commands and visual analysis according to claim 1, characterized in that: The visual defect detection agent includes: Open vocabulary localization module: integrates MLLM to achieve text-driven real-time part localization; The layered defect analysis module includes: a general anomaly perception layer, which uses a visual basic model to extract multi-scale features of the ROI region and identify potential anomalies; and a dedicated defect quantification layer, which calls a pre-built lightweight model to perform accurate defect measurement. Confidence assessment module: Outputs the confidence score of defect detection and triggers the cross-agent clarification process in low-confidence scenarios.

4. The equipment quality inspection system integrating natural language commands and visual analysis according to claim 1, characterized in that: The collaborative interface and output system include: Structured instruction channel: Task packets are transmitted based on gRPC / message queue. The fields of the task packet include task_id, target_roi_coord, and defect_metrics. Natural Language Report Generation Module: Generates user-readable conclusions by calling LLM via ESLPA; Digital quality archive generation module: Automatically generates JSON logs containing defect coordinates, quantitative data, and standard compliance.

5. The equipment quality inspection system integrating natural language commands and visual analysis according to claim 1, characterized in that: In the ESLPA language processing agent of the equipment system, the part name, specifications, material, inspection items, and tolerances / thresholds are extracted based on the input natural language instructions / work order information, and then concatenated according to a preset template to form a part description text T. The parsed part description text T is then encoded into a semantic vector using LLM. ; The VDDA module includes: a candidate region generation submodule, a visual feature encoding submodule, a text feature encoding submodule, and a cross-modal alignment / fusion submodule; wherein the visual / text feature encoding can be implemented by CLIP, BLIP-2, or other vision-language pre-trained models. VDDA integrates a vision-language model to analyze image regions. Encoded as visual vectors Semantic alignment is performed; the image region is a candidate Region of Interest (ROI) for the part or defect to be detected, generated from the input image by the object detection / instance segmentation network. The ROI can be a bounding box, a pixel-level mask, or a combination thereof; this module accepts the part description text parsed by ESLPA and encodes it into a semantic vector. Simultaneously, the image region is encoded into a visual vector. Location matching is calculated using cosine similarity: Select regions with values ​​greater than a threshold as target ROIs to ensure the accuracy of open vocabulary localization.

6. The equipment quality inspection system integrating natural language commands and visual analysis according to claim 1, characterized in that: In the dedicated defect quantification layer, let the set of defect types in the task instruction package be... VDDA loads the pre-registered model library and performs parallel detection: ; in, It is for defect types The pre-trained model; It is the target region image; It is a collection of all defect detection results; the pre-trained model is a defect recognition model obtained by supervised / self-supervised pre-training based on historical quality inspection image datasets and / or public defect datasets. The datasets contain normal samples and multiple types of defect samples and their category labels, and can be retrained or fine-tuned for target equipment scenarios.

7. The equipment quality inspection system integrating natural language commands and visual analysis according to claim 1, characterized in that: The adapter fine-tuning technique used by the visual defect detection agent in few-sample defect transfer has the following objective function: in: This indicates that it contains new defect samples. Small datasets, The feature extractor representing the pre-trained visual base model has its parameters frozen. This represents the Adapter module to be trained, and its parameters are: ; Represents a category or segment header; This represents the cross-entropy loss function; by minimizing this loss function, only the adapter parameters are optimized. This enables the model to quickly adapt to the detection of new defects; the small dataset is obtained by manually reviewing and labeling newly added defect samples, and the labels include defect category labels and optional defect location labels; the samples are input into the model containing the Adapter to obtain the output fφ(x), and the loss function is calculated with the label y, including classification cross-entropy loss and / or localization regression loss, segmentation loss, in order to train the Adapter parameters; Zero-sample part localization text-guided detection probabilistic model: in It is the softmax function; This indicates the part description text from the task package; This represents the input equipment image; and These are a text encoder and an image encoder, respectively. The weight matrix is ​​a learnable matrix; during the training phase, the learnable weight matrix is ​​obtained by minimizing the loss function and updating it using backpropagation and gradient descent, with the backbone network parameters fixed during training, and only the adapter parameters and the weight matrix are updated.

8. An equipment quality inspection method integrating natural language commands and visual analysis, based on the system described in any one of claims 1 to 7, characterized in that: include: The user inputs equipment images, natural language commands, and quantitative standards, and generates a task command package containing structured fields; VDDA initiates the open environment detection process based on this instruction package: First, it uses multimodal large model open vocabulary visual localization technology to map the semantic description of the part to the image coordinate space in real time, and accurately lock the target area; Initiating hierarchical defect analysis: Abnormal regions are perceived using the generalized features of the basic visual model, then a lightweight dedicated model is driven to perform defect quantification calculations. The detection process dynamically adapts to the defect types and standards in the task instructions. When encountering ambiguous positioning or insufficient detection confidence, VDDA automatically generates a clarification request with visual evidence. ESLPA then uses LLM contextual reasoning to re-parse the intent or guide user interaction, forming an intelligent closed loop of "parse-execute-verify". ESLPA integrates the structured results from VDDA output to generate a dual-track output that combines natural language reports and digital quality archives.