An AI vision-based robot fault detection system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using an AI vision-based robot fault detection system and an improved GroundingDINO model and Dempster-Shafer evidence theory algorithm, the accuracy and stability issues of robot fault detection in complex environments are solved, achieving high-precision and interference-resistant fault identification and detection.

CN122243983APending Publication Date: 2026-06-19SHANGHAI BAISEN NETWORK TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANGHAI BAISEN NETWORK TECH CO LTD
Filing Date: 2026-03-25
Publication Date: 2026-06-19

Application Information

Patent Timeline

25 Mar 2026

Application

19 Jun 2026

Publication

CN122243983A

IPC: G06T7/00; B25J19/00; G06V10/80; G06V10/77; G06V10/764; G06V10/72; G06V10/75; G06V10/82; G06N3/0455

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing robot fault detection methods are easily affected by changes in lighting, occlusion interference, and background noise in complex industrial environments. They are difficult to achieve accurate alignment and unified expression of multi-view data, and lack the ability to continuously model structural motion trajectories and deformation evolution, resulting in unstable fault detection results and high false detection and false negative rates.

⚗Method used

An AI vision-based robot fault detection system is adopted. The system acquires multi-view image sequences through a vision acquisition module, performs anomaly imprinting signature processing using the signature addressing layer in the improved GroundingDINO model, and fuses multi-source evidence using the Dempster-Shafer evidence theory algorithm to generate a structural semantic graph and continuous time series, thereby enabling the adjustment of the robot's operating state.

🎯Benefits of technology

It improves the accuracy and anti-interference ability of robot fault detection, enhances the ability to identify progressive faults and complex fault modes, reduces false detection and false negative rates, and improves the stability of fault detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122243983A_ABST

Patent Text Reader

Abstract

This invention discloses an AI vision-based robot fault detection system, comprising: a vision acquisition module for acquiring multi-view image sequences during robot operation; a feature generation module for generating structural semantic description sequences; a signature detection module for inputting an improved GroundingDINO model, performing anomaly imprinting signature processing through a signature addressing layer, and outputting target detection results; a graph construction module for constructing a structural semantic graph; an evolutionary modeling module for generating structural state evolution features; an evidence fusion module for generating fault detection results; and a control execution module for generating control commands to adjust the robot's operating state. This invention improves the accuracy and stability of robot fault detection, enhances its anti-interference capability in complex industrial environments, and achieves early warning and efficient control of faults.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision and fault detection technology, and in particular to a robot fault detection system based on AI vision. Background Technology

[0002] With the continuous improvement of industrial automation and the ongoing development of intelligent manufacturing scenarios, the application scale of industrial robots in assembly, handling, and precision machining is constantly expanding. Real-time monitoring and fault detection technologies for robot operation have received widespread attention. Existing robot fault detection methods mainly rely on vibration signal analysis, current signal monitoring, or detection methods based on single visual information for anomaly identification. However, these methods generally suffer from the following problems in practical applications: The acquired visual images are easily affected by changes in lighting, occlusion interference, and background noise in complex industrial environments. The boundaries of structural regions are blurred and local details are missing, resulting in decreased positioning accuracy of key structural components and difficulty in accurately extracting anomalies. There are temporal asynchrony and spatial calibration errors between multi-view image data. Existing methods are difficult to achieve accurate alignment and unified representation of multi-view data, resulting in inconsistent structural information and deviations in time-series analysis results. For the nonlinear and multi-state coupled changes that occur during robot operation, traditional methods based on single-frame images or simple feature matching lack the ability to continuously model the structural motion trajectory and deformation evolution, resulting in insufficient ability to identify progressive faults and complex fault modes. In terms of multi-source information fusion, existing methods mostly use simple weighting or rule-based judgment methods, which are difficult to effectively integrate structural features, topological relationships, and state evolution information, resulting in poor stability of fault detection results and high false detection and false negative rates.

[0003] Therefore, how to provide a robot fault detection system based on AI vision is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0004] One objective of this invention is to propose a robot fault detection system based on AI vision. This invention utilizes a vision acquisition module to acquire multi-view image sequences, a feature generation module to construct a structural semantic description sequence, and combines an improved GroundingDINO model with a signature-finding and addressing layer to complete anomaly imprinting signature processing, obtaining target detection results. Further, a graph construction module generates a structural semantic graph, an evolutionary modeling module constructs a continuous time series and extracts structural state evolution features, an evidence fusion module uses the Dempster-Shafer evidence theory algorithm to complete multi-source evidence fusion, and finally, a control execution module generates control commands to adjust the robot's operating state. This system possesses advantages such as high detection accuracy, strong anti-interference capability, and high fault identification stability.

[0005] According to an embodiment of the present invention, a robot fault detection system based on AI vision includes: The vision acquisition module is used to acquire multi-view image sequences during robot operation and generate standard vision sequences; The feature generation module is used to extract structural region features, local appearance change features, and motion change features from standard visual sequences, map them into candidate sets of structural components, candidate sets of anomaly representations, and candidate sets of operating states, respectively, and combine and encode them to generate a structural semantic description sequence. The signature detection module is used to input the standard visual sequence and the structural semantic description sequence into the improved GroundingDINO model, perform abnormal imprint signature processing through the signature addressing layer in the improved GroundingDINO model, generate signature matching results, filter to obtain cross-modal query vectors, and output target detection results. The graph construction module is used to extract the spatial location, scale information and semantic labels of key structural regions of the robot in the target detection results, and construct a structural semantic graph. The evolution modeling module is used to stack the structural semantic graphs generated frame by frame to construct a continuous time series; construct the structural motion trajectory based on the changes in the spatial position of nodes in the continuous time series; calculate the structural deformation energy based on the relative displacement changes between adjacent nodes; and construct the structural state stability domain based on the evolution trend of the structural motion trajectory to generate structural state evolution features. The evidence fusion module is used to construct a multi-source evidence set based on the signature matching results, the topological deviation in the structural semantic graph, and the structural state evolution characteristics. The Dempster-Shafer evidence theory algorithm is used to perform multi-source evidence fusion processing to generate fault detection results. The control execution module is used to output the fault prediction results to the execution end of the detection system. If the fault risk level in the fault detection results reaches the preset risk threshold, a control command is generated to adjust the robot's operating state.

[0006] Optionally, the visual acquisition module specifically comprises: Two acquisition perspectives are set up within the robot's operating area to acquire images of two areas: the robot's base area, link area, joint area, transmission area, and end effector area. Image data corresponding to each acquisition viewpoint is continuously acquired at a uniform sampling frequency, and the image data from different acquisition viewpoints are aligned at the frame level based on timestamps to obtain a multi-view synchronized image sequence. Each frame in the multi-view synchronized image sequence is sequentially subjected to noise suppression, lens distortion correction, and brightness normalization to obtain a preprocessed image sequence. Based on the intrinsic and extrinsic parameters corresponding to each acquisition viewpoint in the multi-view synchronized image sequence, the spatial correspondence of the preprocessed image sequence is calibrated, and the structural region correspondence between different acquisition views is established. The preprocessed image sequences that have completed spatial correspondence calibration are combined in chronological order to generate a standard visual sequence.

[0007] Optionally, the feature generation module specifically comprises: Read image frames arranged in chronological order from a standard visual sequence, and perform region segmentation on each image frame to obtain structural region images corresponding to the robot base region, link region, joint region, transmission region and end effector region; Structural region features are extracted based on the contour boundaries, region area, aspect ratio, edge distribution, and texture distribution in the structural region image, and a candidate set of structural components is generated according to the correspondence between the structural region features and the preset structural categories. Local appearance change features are extracted based on grayscale changes, texture changes, edge breakage changes, local brightness changes, and regional shape changes in the corresponding regions of adjacent image frames in the standard visual sequence. An anomaly representation candidate set is then generated according to the correspondence between the local appearance change features and the preset anomaly representation categories. Motion change features are extracted based on the position changes, displacement direction changes, velocity changes, and trajectory continuity changes of corresponding regions in adjacent image frames in the standard visual sequence, and a candidate set of operating states is generated according to the correspondence between motion change features and preset operating state categories. The candidate sets of structural components, anomaly representations, and operating states are associated within the same frame to form a correspondence between the candidates for structural components, anomaly representations, and operating states. The structural semantic description sequence is generated by combining and encoding the candidates for structural components, anomaly representations, and operating states in the order they are arranged.

[0008] Optionally, the step of combining and encoding the structural component candidates, anomaly characterization candidates, and operational status candidates in the order of arrangement to generate a structural semantic description sequence is as follows: Candidate structural components, anomaly representations, and operational status that are associated within the same image frame are sequentially arranged to obtain candidate arrangement groups. The candidate components are assembled in the following order: structural component candidate items first, abnormal characterization candidate items in the middle, and running status candidate items last, to form a single frame semantic unit. Write a structure category identifier for structural component candidates in a single frame semantic unit, write an anomaly category identifier for anomaly representation candidates, and write a status category identifier for running status candidates. The structure category identifier, anomaly category identifier, and state category identifier in a single frame semantic unit are connected according to a preset separation rule to generate a single frame description fragment; The single-frame description segments are arranged according to the chronological order of the image frames in the standard visual sequence to form a description segment sequence; The sequential combination of adjacent single-frame description fragments in the description fragment sequence is performed to generate a structural semantic description sequence.

[0009] Optionally, the improved GroundingDINO model specifically includes an image encoding layer, a text encoding layer, a feature enhancement layer, a signature addressing layer, and a cross-modal decoding layer; The image coding layer performs multi-scale feature extraction on image frames in the standard visual sequence to generate a visual feature sequence. The text encoding layer vectorizes the description fragments in the structural semantic description sequence to generate a semantic feature sequence. The feature enhancement layer receives visual feature sequences and semantic feature sequences and performs cross-modal feature interaction to generate cross-modal fusion features; The signature-finding and addressing layer sequentially splits the structural category identifier, anomaly category identifier, and state category identifier in the structural semantic description sequence to generate structural component atomic sequences, anomaly representation atomic sequences, and running state atomic sequences. Neighborhood traversal is performed on the local regions corresponding to the structural regions in the cross-modal fusion features to extract edge distribution, texture changes, brightness changes and shape offset information, and anomaly imprinting sequences are generated according to discrete coding rules; The atomic sequences of structural components, anomaly representations, and operating states are combined in a predetermined order to form an atomic combination sequence. The atomic combination sequence is then matched item by item with the anomaly imprinting sequence to generate a matching result sequence. Based on the matching result sequence, the corresponding cross-modal feature positions are selected to generate a cross-modal query vector. The cross-modal decoding layer performs decoding operations based on cross-modal query vectors and cross-modal fusion features, and outputs target detection results, which include structural region location, structural region scale, and semantic labels.

[0010] Optionally, the graph construction module specifically comprises: Obtain the location, scale, and semantic labels of structural regions from the target detection results, and arrange them according to the time order of the image frames; Based on the location and semantic labels of structural regions in the same image frame, determine the structural component nodes corresponding to the robot base region, link region, joint region, transmission region and end effector region; For each structural component node, write the node identifier, semantic tag, center coordinates, region width, and region height to form a node attribute set; Based on the spatial distance, relative orientation, and regional adjacency between nodes of different structural components in the same image frame, determine the structural connection relationship and spatial topological relationship; Establish connecting edges between two structural component nodes that satisfy the structural connection relationship, and establish topological edges between two structural component nodes that satisfy the spatial topological relationship; Write the edge type, start node identifier, end node identifier, distance between nodes, and relative orientation for each edge to form an edge attribute set; The set of node attributes and the set of edge attributes in the same image frame are combined to generate a single-frame structural semantic graph; the single-frame structural semantic graphs are arranged according to the time order of image frames in the standard visual sequence to form a frame-by-frame generated structural semantic graph.

[0011] Optionally, stacking the structural semantic maps generated frame by frame to construct a continuous time series specifically involves: Obtain the structural semantic graph generated frame by frame in chronological order of image frames, and extract the set of node attributes and the set of edge attributes from the structural semantic graph of each frame; The node attribute sets in adjacent image frames are mapped between frames according to node identifiers and semantic labels to form node association results. The edge attribute sets in adjacent image frames are mapped between frames according to the start node identifier, end node identifier, and edge type to form an edge association result; According to the time sequence of the image frames, the node attribute set and edge attribute set corresponding to the frames are written into the time axis position in turn to form the time sequence arrangement result; The node association results and edge association results in the time series arrangement are stacked sequentially to form the node time series set and edge time series set in the continuous time series. The node time series set and the edge time series set are synchronously combined to generate a continuous time series.

[0012] Optionally, the process of constructing the structural motion trajectory based on the changes in the spatial positions of nodes in a continuous time series, calculating the structural deformation energy based on the relative displacement changes between adjacent nodes, and constructing a structural state stability domain based on the evolution trend of the structural motion trajectory to generate structural state evolution characteristics, specifically includes: Read the center coordinate sequence corresponding to the same node identifier from the node time series in the continuous time series, connect the coordinate points in the center coordinate sequence in chronological order, and generate the structural motion trajectory. Based on the difference in center coordinates between adjacent moments in the structural motion trajectory, the displacement increment, displacement direction change, and trajectory deflection are calculated to form a trajectory change sequence. Read the distance and relative orientation of adjacent nodes between adjacent times from the edge time series in the continuous time series, and calculate the change in distance and relative orientation between nodes. Based on the changes in distance and relative orientation between nodes, the relative displacement between adjacent nodes is calculated to form a deformation change sequence; The relative displacement changes in the deformation sequence are accumulated according to the preset energy calculation rules to generate the structural deformation energy sequence. Based on the trajectory change sequence and the structural deformation energy sequence, the stable state interval and the deviation state interval corresponding to each time position in the continuous time sequence are determined to form the structural state stability domain. The structural motion trajectory, trajectory change sequence, structural deformation energy sequence, and structural state stability domain are combined to generate structural state evolution characteristics.

[0013] Optionally, the evidence fusion module specifically comprises: Obtain the signature matching results, topological deviation and structural state evolution features in the structural semantic map, and align them according to the time position corresponding to the same image frame; Based on the matching result sequence in the signature matching result, extract the abnormal characterization matching value and the running status matching value corresponding to the structural component to form the first evidence sequence; Based on the set of node attributes and the set of edge attributes in the structural semantic graph, the deviation values of structural connection relationships and spatial topological relationships between adjacent image frames are calculated to form a second evidence sequence. Based on the trajectory change sequence, structural deformation energy sequence, and structural stability domain in the structural state evolution characteristics, the trajectory offset value, deformation energy value, and stability domain deviation value at the corresponding time position are extracted to form a third evidence sequence. The first evidence sequence, the second evidence sequence, and the third evidence sequence are normalized according to a unified evidence representation format to generate a multi-source evidence set. Based on the fault category, fault location, and fault risk level corresponding to different evidence items in the multi-source evidence set, a basic probability allocation result is constructed. The basic probability allocation results are combined pairwise according to the Dempster-Shafer evidence theory algorithm, and the combined results are further fused sequentially to generate the fused probability allocation results. The fault detection results are determined by sorting the probability values of the corresponding fault category, fault location, and fault risk level in the fusion probability allocation results.

[0014] Optionally, the control execution module specifically comprises: Based on the comparison results between the fault risk level and the preset risk threshold, a risk assessment result sequence is generated; Write control trigger identifiers into the time positions in the risk assessment result sequence that meet the preset risk threshold for the fault risk level, and form a trigger position sequence. Based on the fault category and fault location corresponding to the trigger position sequence, a control command sequence is generated. The control command sequence includes deceleration command, stop command and path adjustment command. The instruction types in the control instruction sequence are assigned according to the fault category and fault location to obtain the execution instruction sequence; The sequence of execution instructions is output to the execution end of the detection system in chronological order. Based on the instruction reception status returned by the executor and the robot's running status update results, the sequence of executed instructions is recorded to form the control output results.

[0015] The beneficial effects of this invention are: In the improved GroundingDINO model, a signature-finding and addressing layer is introduced. Through anomaly imprinting signature processing, the structural category identifier, anomaly category identifier, and state category identifier are matched with cross-modal fusion features one by one to generate cross-modal query vectors and output target detection results. This improves the accuracy of structural component recognition and anomaly characterization and localization capabilities in complex scenarios, and reduces the impact of occlusion and noise interference. The graph construction module generates a structural semantic graph, and the evolution modeling module constructs a continuous time series. Combining the structural motion trajectory, structural deformation energy, and structural state stability domain, structural state evolution characteristics are generated, enabling continuous modeling and dynamic evolution analysis of the robot's operating state, and enhancing the ability to identify progressive and complex fault modes. The evidence fusion module uses the Dempster-Shafer evidence theory algorithm to fuse multi-source evidence such as signature matching results, topological deviation, and structural state evolution characteristics to generate stable fault detection results. Combined with the control execution module, control commands are generated to adjust the robot's operating state, thereby improving the reliability of fault detection results and reducing false detection and false negative rates. Attached Figure Description

[0016] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings: Figure 1 This is a schematic diagram of a robot fault detection system based on AI vision proposed in this invention; Figure 2 This is a schematic diagram of the improved GroundingDINO model proposed in this invention; Figure 3 This is a data flow diagram of a robot fault detection system based on AI vision proposed in this invention. Detailed Implementation

[0017] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0018] refer to Figures 1-3 A robot fault detection system based on AI vision, comprising: The vision acquisition module is used to acquire multi-view image sequences during robot operation and generate standard vision sequences; The feature generation module is used to extract structural region features, local appearance change features, and motion change features from standard visual sequences, map them into candidate sets of structural components, candidate sets of anomaly representations, and candidate sets of operating states, respectively, and combine and encode them to generate a structural semantic description sequence. The signature detection module is used to input standard visual sequences and structural semantic description sequences into the improved GroundingDINO model. It performs anomaly imprinting signature processing through the signature addressing layer in the improved GroundingDINO model, generates signature matching results, filters out cross-modal query vectors, and outputs target detection results. The graph construction module is used to extract the spatial location, scale information and semantic labels of key structural regions of the robot in the target detection results, and construct a structural semantic graph. The evolution modeling module is used to stack the structural semantic graphs generated frame by frame to construct a continuous time series; construct the structural motion trajectory based on the changes in the spatial position of nodes in the continuous time series; calculate the structural deformation energy based on the relative displacement changes between adjacent nodes; and construct the structural state stability domain based on the evolution trend of the structural motion trajectory to generate structural state evolution features. The evidence fusion module is used to construct a multi-source evidence set based on the signature matching results, the topological deviation in the structural semantic graph, and the structural state evolution characteristics. The Dempster-Shafer evidence theory algorithm is used to perform multi-source evidence fusion processing to generate fault detection results. The control execution module is used to output the fault prediction results to the execution end of the detection system. If the fault risk level in the fault detection results reaches the preset risk threshold, control commands are generated to adjust the robot's operating state.

[0019] In this embodiment, the visual acquisition module specifically comprises: Two acquisition perspectives are set up within the robot's operating area to acquire images of two areas: the robot base area, the link area, the joint area, the transmission area, and the end effector area. Image data corresponding to each acquisition perspective is continuously acquired at a uniform sampling frequency, and the image data from different acquisition perspectives are aligned at the frame level based on timestamps to obtain a multi-view synchronized image sequence. Each frame in the multi-view synchronized image sequence is sequentially subjected to noise suppression, lens distortion correction, and brightness normalization to obtain a preprocessed image sequence. Based on the intrinsic and extrinsic parameters corresponding to each acquisition viewpoint in the multi-view synchronous image sequence, the spatial correspondence of the preprocessed image sequence is calibrated to establish the structural region correspondence between different acquisition views; the preprocessed image sequences that have completed the spatial correspondence calibration are combined in chronological order to generate a standard visual sequence.

[0020] In this embodiment, the feature generation module specifically comprises: Read image frames arranged in chronological order from a standard visual sequence, and perform region segmentation on each image frame to obtain structural region images corresponding to the robot base region, link region, joint region, transmission region and end effector region; Structural region features are extracted based on the contour boundaries, region area, aspect ratio, edge distribution, and texture distribution in the structural region image, and a candidate set of structural components is generated according to the correspondence between the structural region features and the preset structural categories. Local appearance change features are extracted based on grayscale changes, texture changes, edge breakage changes, local brightness changes, and regional shape changes in the corresponding regions of adjacent image frames in the standard visual sequence. An anomaly representation candidate set is then generated according to the correspondence between the local appearance change features and the preset anomaly representation categories. Motion change features are extracted based on the position changes, displacement direction changes, velocity changes, and trajectory continuity changes of corresponding regions in adjacent image frames in the standard visual sequence, and a candidate set of operating states is generated according to the correspondence between motion change features and preset operating state categories. The candidate sets of structural components, anomaly representations, and operating states are associated within the same frame to form a correspondence between them. The candidates are then combined and encoded in the order of their arrangement to generate a structural semantic description sequence.

[0021] In this embodiment, the structural component candidates, anomaly characterization candidates, and operational status candidates are combined and encoded in the order of arrangement to generate a structural semantic description sequence, specifically: Candidate structural components, anomaly representations, and operational status that are associated within the same image frame are sequentially arranged to obtain candidate arrangement groups. The candidate components are assembled in the following order: structural component candidate items first, abnormal characterization candidate items in the middle, and running status candidate items last, to form a single frame semantic unit. Write a structure category identifier for structural component candidates in a single frame semantic unit, write an anomaly category identifier for anomaly representation candidates, and write a status category identifier for running status candidates. The structure category identifier, anomaly category identifier, and state category identifier in a single frame semantic unit are connected according to a preset separation rule to generate a single frame description fragment; The single-frame description segments are arranged according to the temporal order of the image frames in the standard visual sequence to form a description segment sequence; adjacent single-frame description segments in the description segment sequence are combined sequentially to generate a structural semantic description sequence.

[0022] In this embodiment, the improved GroundingDINO model specifically includes an image encoding layer, a text encoding layer, a feature enhancement layer, a signature lookup and addressing layer, and a cross-modal decoding layer; The image coding layer receives image frames from a standard visual sequence, reads multi-view synchronized images in chronological order of the image frames, divides each image frame into image blocks of fixed size, writes the center coordinates, width, height and pixel value distribution of each image block to form initial visual features, and then extracts edge response, texture response and region contour response from the initial visual features along three scales of low resolution, medium resolution and high resolution to form a visual feature sequence arranged by scale. The text encoding layer receives description fragments from the structural semantic description sequence, reads the structural category identifier, anomaly category identifier, and state category identifier in sequence according to the time position of the description fragment in the standard visual sequence, converts the structural category identifier, anomaly category identifier, and state category identifier into corresponding identifier vectors, and concatenates them in the order of structural category identifier first, anomaly category identifier in the middle, and state category identifier last to form a semantic feature sequence. The feature enhancement layer aligns the visual feature sequence and the semantic feature sequence according to their temporal positions. Within each temporal position, it maps the image patch features in the visual feature sequence to the label vectors in the semantic feature sequence one by one. For each corresponding unit, it writes the center coordinates, scale position, structure category label, anomaly category label, and state category label of the image patch. It also accumulates the correlation responses between the image patch features and the label vectors to obtain cross-modal fusion features. The correlation responses are obtained by summing the products of the image patch features and the label vectors in the same dimension. The signature-finding and addressing layer sequentially splits the structure category identifier, anomaly category identifier, and state category identifier in the description fragment according to their arrangement order. A single structure category identifier is recorded as a structure component atom, a single anomaly category identifier is recorded as an anomaly representation atom, and a single state category identifier is recorded as a running state atom, forming a sequence of structure component atoms, anomaly representation atoms, and running state atoms in sequence. The local region corresponding to the structural region image is read from the cross-modal fusion features. The local region is the image block corresponding to the center position within the structural region image range of the standard visual sequence. The neighboring image blocks are expanded upward, downward, leftward, and rightward by a fixed number of adjacent image blocks to form a neighborhood set. The edge distribution, texture change, brightness change and shape offset information of the image blocks in the neighborhood set are read one by one in the order from left to right and from top to bottom. The edge distribution, texture change, brightness change and shape offset information in the neighborhood set are divided into discrete intervals. Discrete tags are written for each interval. The discrete tags corresponding to edge distribution, texture change, brightness change and shape offset are arranged in the neighborhood traversal order to generate an abnormal imprint sequence. The atomic sequences of structural components, anomalous characterization, and operational status are connected end to end in the order of structural component atoms first, anomalous characterization atoms in the middle, and operational status atoms last to form an atomic combination sequence. The atomic combination sequence is then compared item by item with the anomalous imprint sequence. Write a matching value for each combination item. The matching value is obtained by adding the structure matching value, the anomaly matching value, and the state matching value. The structure matching value is 1 when the structure category identifier matches the local region location category, and 0 when the structure category identifier does not match the local region location category. The anomaly matching value is 1 when the anomaly category identifier matches the discrete label correspondence, and 0 when the correspondence does not match. The state matching value is 1 when the state category identifier matches the continuous change direction of the local region, and 0 when the state category identifier does not match the continuous change direction of the local region. Arrange the matching values corresponding to all combination items in order to form a matching result sequence. Read the positions of combination items with matching values greater than a preset matching threshold from the matching result sequence, extract the local region features of the corresponding positions from the cross-modal fusion features, arrange them in time and spatial order, and generate a cross-modal query vector; The cross-modal decoding layer receives cross-modal query vectors and cross-modal fusion features. It maps each vector unit in the cross-modal query vector to a local region feature in the cross-modal fusion feature. For each mapping result, it writes the structural region location, structural region scale, and semantic label. The structural region location is determined by the local region center coordinates, region width, and region height. The structural region scale is determined by the scale location. The semantic label is determined by the structural category identifier, anomaly category identifier, and state category identifier corresponding to the combination with the largest matching value. The results are arranged in chronological order to form the target detection results.

[0023] In this embodiment, both the improved GroundingDINO model and the original GroundingDINO model utilize a dual-encoding structure of image encoding and text encoding layers to jointly express visual and semantic features. Both models perform cross-modal feature interaction through a feature enhancement layer to generate fused features, and output target detection results based on the query vector in the cross-modal decoding layer. The improved GroundingDINO model introduces a signature-based addressing layer after the feature enhancement layer to sequentially split the structural category identifier, anomaly category identifier, and state category identifier in the structural semantic description sequence, constructing structural component atomic sequences, anomaly representation atomic sequences, and running state atomic sequences, thus refining the semantic expression into discrete semantic units that can be matched item by item. In the cross-modal fusion feature, the improved GroundingDINO model performs neighborhood traversal according to the structural region location, extracting edge distribution, texture changes, brightness changes, and shape shifts in the local region item by item, and generating anomaly imprint sequences through discrete interval division, transforming visual information into discrete units with spatial order. Anomaly Representation: The improved GroundingDINO model performs item-by-item matching between atomic combination sequences and anomaly imprint sequences. A matching result sequence is formed by accumulating structural matching values, anomaly matching values, and state matching values. High-matching positions are selected from this sequence to generate cross-modal query vectors, transforming the query vector generation process from semantic guidance to a dual constraint of semantics and visual imprinting. The improved GroundingDINO model couples atomic-level semantic decomposition with anomaly imprint sequences, transforming the cross-modal alignment process from overall feature correlation to fine-grained matching of local semantic units and local visual imprints, reducing semantic interference in complex backgrounds. The improved GroundingDINO model establishes a consistent matching relationship between the structural semantic description sequence and the visual sequence, enabling the detection results to have a unified expression capability for structural component localization, anomaly characterization recognition, and operational state determination. Through these improvements, the target detection results achieve higher accuracy in locating structural regions under complex conditions, enhanced consistency in anomaly characterization recognition, and improved stability in response to changes in operational state.

[0024] In this embodiment, the graph construction module specifically comprises: Obtain the location, scale, and semantic labels of structural regions from the target detection results, and arrange them according to the time order of the image frames; Based on the location and semantic labels of structural regions in the same image frame, determine the structural component nodes corresponding to the robot base region, link region, joint region, transmission region, and end effector region; write node identifier, semantic label, center coordinates, region width, and region height for each structural component node to form a node attribute set; Based on the spatial distance, relative orientation, and regional adjacency between nodes of different structural components in the same image frame, structural connection relationships and spatial topological relationships are determined; connecting edges are established for two structural component nodes that satisfy structural connection relationships, and topological edges are established for two structural component nodes that satisfy spatial topological relationships; for each edge, edge type, start node identifier, end node identifier, distance between nodes, and relative orientation are written to form an edge attribute set; The set of node attributes and the set of edge attributes in the same image frame are combined to generate a single-frame structural semantic graph; the single-frame structural semantic graphs are arranged according to the time order of image frames in the standard visual sequence to form a frame-by-frame generated structural semantic graph.

[0025] In this embodiment, the structural semantic graphs generated frame by frame are stacked to construct a continuous time series, specifically: Obtain the structural semantic graph generated frame by frame in chronological order of image frames, and extract the set of node attributes and the set of edge attributes from the structural semantic graph of each frame; Node attribute sets in adjacent image frames are mapped to each other according to node identifiers and semantic tags to form node association results; edge attribute sets in adjacent image frames are mapped to each other according to start node identifiers, end node identifiers and edge types to form edge association results. According to the chronological order of image frames, the corresponding node attribute sets and edge attribute sets between frames are sequentially written into the time axis positions to form a time-series arrangement result; the node association results and edge association results in the time-series arrangement result are stacked sequentially to form the node time-series set and edge time-series set in the continuous time series; the node time-series set and edge time-series set are synchronously combined to generate a continuous time series.

[0026] In this embodiment, the structural motion trajectory is constructed based on the changes in the spatial positions of nodes in a continuous time series, the structural deformation energy is calculated based on the relative displacement changes between adjacent nodes, and a structural state stability domain is constructed based on the evolution trend of the structural motion trajectory, generating structural state evolution characteristics, specifically: Read the center coordinate sequence corresponding to the same node identifier from the node time sequence set. Each coordinate point in the center coordinate sequence includes horizontal coordinate value and vertical coordinate value. Connect the center coordinate points corresponding to adjacent time positions in chronological order. The connection steps include reading the center coordinate point of the previous time position, reading the center coordinate point of the next time position, establishing a trajectory connection segment between the center coordinate point of the previous time position and the center coordinate point of the next time position, and then arranging all trajectory connection segments in chronological order to form the structural motion trajectory. Read the trajectory connection segments corresponding to adjacent time positions from the structural motion trajectory. For the center coordinate points of each pair of adjacent time positions, calculate the difference in horizontal coordinates and the difference in vertical coordinates. Add the square of the difference in horizontal coordinates and the square of the difference in vertical coordinates and take the square root to obtain the displacement increment. Subtract the trajectory direction angle of the previous time position from the trajectory direction angle of the later time position to obtain the change in displacement direction. Take the absolute value of the angle between two adjacent trajectory connection segments as the trajectory deflection. Then, write the displacement increment, the change in displacement direction, and the trajectory deflection into the corresponding time positions in chronological order to form a trajectory change sequence. Read the distance between adjacent nodes and their relative orientation between adjacent time positions from the edge time sequence set. The distance between nodes is determined by the Euclidean distance between the center coordinates of the starting node and the center coordinates of the ending node. The relative orientation is determined by the direction angle from the center coordinates of the starting node to the center coordinates of the ending node. Subtract the distance between nodes at two adjacent time positions to obtain the change in distance between nodes. Subtract the relative orientation between two adjacent time positions to obtain the change in relative orientation. Then write the change in distance between nodes and the change in relative orientation to the corresponding time position in chronological order. The relative displacement change between adjacent nodes is calculated based on the change in distance between nodes and the change in relative orientation. The relative displacement change is determined by the product of the absolute value of the change in distance between nodes and the absolute value of the change in relative orientation. The relative displacement changes corresponding to each time position are then arranged in chronological order to form a deformation change sequence. The relative displacement changes in the deformation change sequence are accumulated according to the preset energy calculation rules. The preset energy calculation rules include squaring the relative displacement change at each time position, adding the squared results corresponding to all time positions in the continuous time sequence in turn, and then writing the accumulated result into the corresponding time position to form a structural deformation energy sequence. Each energy value in the structural deformation energy sequence reflects the structural change intensity at the corresponding time position. Based on the trajectory change sequence and the structural deformation energy sequence, the stable state interval and the deviation state interval corresponding to each time position are determined. This includes reading the displacement increment, displacement direction change, trajectory deflection, and structural deformation energy value corresponding to the same time position, comparing the displacement increment with a preset displacement threshold, comparing the displacement direction change with a preset direction threshold, comparing the trajectory deflection with a preset deflection threshold, and comparing the structural deformation energy value with a preset energy threshold. When the displacement increment is not greater than the preset displacement threshold, the displacement direction change is not greater than the preset direction threshold, the trajectory deflection is not greater than the preset deflection threshold, and the structural deformation energy value is not greater than the preset energy threshold, the corresponding time position is written into the stable state interval. When any comparison result exceeds the corresponding threshold, the corresponding time position is written into the deviation state interval. The structural state stability domain is formed by all stable state intervals and all deviation state intervals. The preset displacement threshold, preset direction threshold, preset bending threshold, and preset energy threshold are obtained statistically based on historical operating data corresponding to the standard visual sequence. A stable operating interval is selected from the continuous time series of the historical operating data. The center coordinate sequence corresponding to the time series set of nodes within the stable operating interval is read, and the displacement increment is calculated according to the time order to obtain the displacement increment set. All values in the displacement increment set are sorted, and the value corresponding to the 90th percentile position after sorting is taken as the preset displacement threshold. The preset displacement threshold ranges from 0.5 pixels to 3 pixels. Based on the direction angle sequence corresponding to the structural motion trajectory within the stable operating interval, the direction angle of adjacent time positions is subtracted to obtain the direction. The system employs several methods: First, it sorts all values in the set of directional changes and uses the value at the 90th percentile after sorting as a preset directional threshold, with a range of 3 to 15 degrees. Second, it sorts all values in the trajectory deflection sequence within the stable operating range and uses the value at the 85th percentile after sorting as a preset deflection threshold, with a range of 5 to 25 degrees. Third, it sorts all values in the structural deformation energy sequence within the stable operating range and uses the value at the 95th percentile after sorting as a preset energy threshold, with a range of 10 to 200. The structural motion trajectory, trajectory change sequence, structural deformation energy sequence, and structural state stability domain are combined. This includes matching the trajectory connection segments in the structural motion trajectory, the displacement increment, displacement direction change, and trajectory bending amount in the trajectory change sequence, the energy value in the structural deformation energy sequence, and the interval marker in the structural state stability domain one by one according to the time position. The corresponding results are then arranged in chronological order to form the structural state evolution characteristics. Each time position in the structural state evolution characteristics corresponds to a set of trajectory information, a set of change information, an energy value, and a stability domain marker.

[0027] In this embodiment, the evidence fusion module specifically comprises: The signature matching results, topological deviation and structural state evolution features in the structural semantic graph are obtained, and a one-to-one correspondence is established according to the time position of the image frame. Data with the same time position are written into the same evidence unit to form a sequence of evidence units arranged in chronological order. Read the matching value corresponding to each structural component from the matching result sequence in the signature matching result. Extract the anomaly representation matching value and the running status matching value at the same time position. Arrange the anomaly representation matching value and the running status matching value in the order of the structural components to form the first evidence sequence. The anomaly representation matching value is taken from the matching value corresponding to the anomaly category identifier, and the running status matching value is taken from the matching value corresponding to the status category identifier. The node attribute set and edge attribute set are read from the structural semantic graph. The center coordinates, region width, region height and semantic label corresponding to the same node identifier are compared between adjacent image frames. The distance between nodes, relative orientation and edge type corresponding to the same side identifier are compared between adjacent image frames. The change values of structural connection relationship and change values of spatial topology relationship are written in chronological order to form the second evidence sequence. The topology deviation is determined by the change in distance between nodes and the change in relative orientation. The change in distance between nodes is the absolute value of the difference in distance between nodes corresponding to the same side in adjacent image frames. The change in relative orientation is the absolute value of the difference in relative orientation corresponding to the same side in adjacent image frames. The topology deviation is the sum of the change in distance between nodes and the change in relative orientation. The trajectory change sequence, structural deformation energy sequence, and structural stability domain are read from the structural state evolution characteristics. At each time point, the trajectory offset value, deformation energy value, and stability domain deviation value are extracted. The trajectory offset value is the combination result of displacement increment, displacement direction change, and trajectory bending amount arranged in a preset order. The deformation energy value is the value of the structural deformation energy sequence at the corresponding time point. The stability domain deviation value is 0 in the stable state interval and 1 in the deviation state interval. The trajectory offset value, deformation energy value, and stability domain deviation value are arranged in chronological order to form the third evidence sequence. The first evidence sequence, the second evidence sequence, and the third evidence sequence are converted into a unified evidence representation form. The conversion steps include reading all evidence values within the same time position, determining the maximum and minimum values among all evidence values, subtracting the minimum value from the current evidence value, dividing the result of the subtraction by the difference between the maximum and minimum values to obtain the normalized evidence value, writing the normalized evidence value as 1 when the maximum and minimum values are equal, and arranging all normalized evidence values in chronological order to form a multi-source evidence set. Based on the structural component location, anomaly characterization matching status, operational status matching status, topology deviation status, and stability domain deviation status corresponding to each evidence item in the multi-source evidence set, the basic probability allocation results corresponding to the fault category, fault location, and fault risk level are written respectively. In the basic probability allocation results, the probability value of a single evidence item for a certain fault category is the ratio of the normalized evidence value of the single evidence item to the sum of all normalized evidence values at the same time and location. The probability value of a single evidence item for a certain fault location is the ratio of the normalized evidence value corresponding to the structural component location to the sum of all normalized evidence values at the same time and location. The probability value of a single evidence item for a certain fault risk level is the ratio of the normalized evidence value corresponding to the risk level to the sum of all normalized evidence values for the risk level at the same time and location. The basic probability allocation results are combined pairwise, including reading the probability values of two evidence items to be combined for the same fault category, multiplying the probability value of the first evidence item to be combined with the probability value of the second evidence item to be combined to obtain the joint product value of the same fault category, accumulating the joint product values of all fault categories to obtain the total joint value, then reading the product values of two evidence items to be combined under the condition of inconsistent fault categories and accumulating them to obtain the conflict value, dividing the joint product value of the same fault category by 1 and subtracting the conflict value to obtain the combined fault category probability value. Fault location and fault risk level are combined in the same way to form the combined probability allocation result. The probability allocation results after combination are further fused sequentially according to time order. The sequential fusion steps include reading the probability allocation results after the previous round of combination and the basic probability allocation results corresponding to the next evidence item, repeating the steps of joint product value, conflict value and normalized division, until all evidence items in the same time position are fused to form a fused probability allocation result. Based on the probability values corresponding to fault category, fault location, and fault risk level in the fusion probability allocation results, the fault category probability values are arranged in descending order, and the fault category corresponding to the highest probability value is taken as the category output result. The fault location probability values are arranged in descending order, and the fault location corresponding to the highest probability value is taken as the location output result. The fault risk level probability values are arranged in descending order, and the fault risk level corresponding to the highest probability value is taken as the risk output result. The category output result, location output result, and risk output result are written into the fault detection result in the same time position order to form a fault detection result sequence arranged in the time order of image frames.

[0028] In this embodiment, the control execution module specifically comprises: The preset risk threshold is obtained by statistically analyzing historical operating data corresponding to the structural state evolution characteristics. Stable operating intervals and fault occurrence intervals are selected from the historical operating data. The structural deformation energy value, displacement increment, and topological deviation at the corresponding time positions are read and arranged in chronological order to form stable data sets and fault data sets, respectively. The values in the stable and fault data sets are weighted and summed with weights of 0.5, 0.3, and 0.2, respectively, to obtain stable risk value sequences and fault risk value sequences. The stable risk value sequences are sorted from smallest to largest, and the values corresponding to the 95th percentile position are taken as the upper limit of stability. The fault risk value sequences are sorted from smallest to largest, and the values corresponding to the 5th percentile position are taken as the lower limit of fault. The upper limit of stability and the lower limit of fault are added together and divided by 2 to obtain the preset risk threshold. Based on the comparison results between the fault risk level and the preset risk threshold, a risk determination result sequence is generated; control trigger identifiers are written into the time positions in the risk determination result sequence that meet the fault risk level reaching the preset risk threshold, forming a trigger position sequence; Based on the fault category and fault location corresponding to the trigger position sequence, a control command sequence is generated, which includes deceleration command, stop command, and path adjustment command. The command types in the control command sequence are assigned according to the fault category and fault location to obtain the execution command sequence. The execution command sequence is output to the execution end of the detection system in chronological order. Based on the command reception status and robot running status update results returned by the execution end, the execution command sequence is recorded to form the control output result.

[0029] Example 1: To verify the feasibility of this invention in practice, it was applied to a fault detection scenario for a 6-axis robot in a new energy vehicle parts assembly workshop. Two acquisition perspectives were set up in the workshop, covering the robot base area, joint area, transmission area, and end effector area, respectively. The sampling frequency was set to 25 frames / second, and production images were continuously acquired for 8 hours. The detection targets included four types of faults: joint misalignment, loose transmission parts, abnormal trajectory of the end effector area, and local wear. Image data of the robot under normal operating conditions were also acquired as baseline data. Traditional systems in this type of scenario are easily affected by oil stain reflection, tooling obstruction, and mechanical vibration, resulting in problems such as unstable structural area positioning, difficulty in accurately attributing abnormal characteristics, and lag in the identification of progressive faults. This invention generates a standard visual sequence through a visual acquisition module, forms a structural semantic description sequence through a feature generation module, and then outputs the target detection results by combining the signature addressing layer in the improved GroundingDINO model. Subsequently, a structural semantic graph, continuous time series, and structural state evolution features are constructed. Finally, the fault detection results are obtained through an evidence fusion module, and the control execution module outputs deceleration commands, path adjustment commands, or stop commands to the execution end.

[0030] In this embodiment, a preset risk threshold is first determined using stable operation data for 7 consecutive days. The upper limit of the risk value in the stable interval is statistically determined to be 0.68, and the lower limit of the risk value in the fault interval is 0.82. The average of these two values yields the preset risk threshold of 0.75. Subsequently, online detection is performed on 30 consecutive shifts, recording a total of 4280 valid detection samples, including 3520 normal samples and 760 fault samples. The detection system can output the risk level earlier in scenarios involving joint misalignment and trajectory abnormalities, providing an early warning approximately 8 to 19 seconds before a significant shutdown occurs due to a fault. This allows the execution end to decelerate or correct the path in advance, reducing cycle time loss caused by hard shutdowns.

[0031] Table 1 Comparison of Robot Fault Detection Performance

[0032] As shown in Table 1, the system of this invention significantly outperforms manual inspection and traditional systems in terms of structural region positioning accuracy and fault identification accuracy. The structural region positioning accuracy reaches 95.8%, an improvement of 7.2 percentage points compared to traditional systems, and the fault identification accuracy reaches 94.7%, an improvement of 8.4 percentage points compared to traditional systems. Simultaneously, the false detection rate decreases to 2.4%, and the false negative rate decreases to 3.1%, indicating that the joint matching of structural category identifiers, anomaly category identifiers, and status category identifiers by the signature-addressing layer can effectively suppress background interference. In continuous occlusion scenarios, the structural region positioning accuracy remains at 93.9%, demonstrating that the structural semantic graph and continuous time series have good compensation capabilities for locally missing information. Although the single-frame processing latency increases by 7 milliseconds compared to traditional systems, the average early warning lead time increases to 12.8 seconds, indicating that this invention, while meeting the real-time detection needs of the workshop, is more suitable for the early identification of progressive faults.

[0033] Table 2 Comparison of Robot Fault Handling Effects

[0034] As shown in Table 2, the system of this invention significantly reduced unplanned downtime and fault escalation during the handling phase. Taking the statistical results of 30 shifts as an example, the traditional system triggered 27 shutdowns with an unplanned downtime of 181 minutes, while the system of this invention triggered only 12 shutdowns, reducing the unplanned downtime to 83 minutes, a decrease of 54.1%. Simultaneously, the system of this invention triggered a significant increase in the number of deceleration and path adjustments, indicating that the control execution module can output more appropriate control commands in a timely manner before and after the fault risk level reaches the preset risk threshold, preventing rapid fault deterioration. The fault escalation decreased from 20.4% to 7.1%, and the shift output loss decreased from 5.2% to 2.1%, demonstrating that the evidence fusion module, by combining topological deviation and structural state evolution characteristics, not only improved the reliability of fault detection results but also enhanced the effectiveness of robot operation state adjustment.

[0035] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A robot fault detection system based on AI vision, characterized in that, include: The vision acquisition module is used to acquire multi-view image sequences during robot operation and generate standard vision sequences; The feature generation module is used to extract structural region features, local appearance change features, and motion change features from standard visual sequences, map them into candidate sets of structural components, candidate sets of anomaly representations, and candidate sets of operating states, respectively, and combine and encode them to generate a structural semantic description sequence. The signature detection module is used to input the standard visual sequence and the structural semantic description sequence into the improved GroundingDINO model, perform abnormal imprint signature processing through the signature addressing layer in the improved GroundingDINO model, generate signature matching results, filter to obtain cross-modal query vectors, and output target detection results. The graph construction module is used to extract the spatial location, scale information and semantic labels of key structural regions of the robot in the target detection results, and construct a structural semantic graph. The evolutionary modeling module is used to stack the structural semantic graphs generated frame by frame to construct a continuous time series; The structural motion trajectory is constructed based on the changes in the spatial position of nodes in a continuous time series, the structural deformation energy is calculated based on the relative displacement changes between adjacent nodes, and the structural state stability domain is constructed based on the evolution trend of the structural motion trajectory to generate structural state evolution characteristics. The evidence fusion module is used to construct a multi-source evidence set based on the signature matching results, the topological deviation in the structural semantic graph, and the structural state evolution characteristics. The Dempster-Shafer evidence theory algorithm is used to perform multi-source evidence fusion processing to generate fault detection results. The control execution module is used to output the fault prediction results to the execution end of the detection system. If the fault risk level in the fault detection results reaches the preset risk threshold, a control command is generated to adjust the robot's operating state.

2. The robot fault detection system based on AI vision according to claim 1, characterized in that, The visual acquisition module is specifically: Two acquisition perspectives are set up within the robot's operating area to acquire images of two areas: the robot's base area, link area, joint area, transmission area, and end effector area. Image data corresponding to each acquisition viewpoint is continuously acquired at a uniform sampling frequency, and the image data from different acquisition viewpoints are aligned at the frame level based on timestamps to obtain a multi-view synchronized image sequence. Each frame in the multi-view synchronized image sequence is sequentially subjected to noise suppression, lens distortion correction, and brightness normalization to obtain a preprocessed image sequence. Based on the intrinsic and extrinsic parameters corresponding to each acquisition viewpoint in the multi-view synchronized image sequence, the spatial correspondence of the preprocessed image sequence is calibrated, and the structural region correspondence between different acquisition views is established. The preprocessed image sequences that have completed spatial correspondence calibration are combined in chronological order to generate a standard visual sequence.

3. The robot fault detection system based on AI vision according to claim 1, characterized in that, The feature generation module is specifically as follows: Read image frames arranged in chronological order from a standard visual sequence, and perform region segmentation on each image frame to obtain structural region images corresponding to the robot base region, link region, joint region, transmission region and end effector region; Structural region features are extracted based on the contour boundaries, region area, aspect ratio, edge distribution, and texture distribution in the structural region image, and a candidate set of structural components is generated according to the correspondence between the structural region features and the preset structural categories. Local appearance change features are extracted based on grayscale changes, texture changes, edge breakage changes, local brightness changes, and regional shape changes in the corresponding regions of adjacent image frames in the standard visual sequence. An anomaly representation candidate set is then generated according to the correspondence between the local appearance change features and the preset anomaly representation categories. Motion change features are extracted based on the position changes, displacement direction changes, velocity changes, and trajectory continuity changes of corresponding regions in adjacent image frames in the standard visual sequence, and a candidate set of operating states is generated according to the correspondence between motion change features and preset operating state categories. The candidate sets of structural components, anomaly representations, and operating states are associated within the same frame to form a correspondence between the candidates for structural components, anomaly representations, and operating states. The structural semantic description sequence is generated by combining and encoding the candidates for structural components, anomaly representations, and operating states in the order they are arranged.

4. The robot fault detection system based on AI vision according to claim 3, characterized in that, The structural semantic description sequence is generated by combining and encoding the candidate structural components, candidate anomaly representations, and candidate operational states in the order of their arrangement. Specifically: Candidate structural components, anomaly representations, and operational status that are associated within the same image frame are sequentially arranged to obtain candidate arrangement groups. The candidate components are assembled in the following order: structural component candidate items first, abnormal characterization candidate items in the middle, and running status candidate items last, to form a single frame semantic unit. Write a structure category identifier for structural component candidates in a single frame semantic unit, write an anomaly category identifier for anomaly representation candidates, and write a status category identifier for running status candidates. The structure category identifier, anomaly category identifier, and state category identifier in a single frame semantic unit are connected according to a preset separation rule to generate a single frame description fragment; The single-frame description segments are arranged according to the chronological order of the image frames in the standard visual sequence to form a description segment sequence; The sequential combination of adjacent single-frame description fragments in the description fragment sequence is performed to generate a structural semantic description sequence.

5. The robot fault detection system based on AI vision according to claim 1, characterized in that, The improved GroundingDINO model specifically includes an image encoding layer, a text encoding layer, a feature enhancement layer, a signature lookup and addressing layer, and a cross-modal decoding layer; The image coding layer performs multi-scale feature extraction on image frames in the standard visual sequence to generate a visual feature sequence. The text encoding layer vectorizes the description fragments in the structural semantic description sequence to generate a semantic feature sequence. The feature enhancement layer receives visual feature sequences and semantic feature sequences and performs cross-modal feature interaction to generate cross-modal fusion features; The signature-finding and addressing layer sequentially splits the structural category identifier, anomaly category identifier, and state category identifier in the structural semantic description sequence to generate structural component atomic sequences, anomaly representation atomic sequences, and running state atomic sequences. Neighborhood traversal is performed on the local regions corresponding to the structural regions in the cross-modal fusion features to extract edge distribution, texture changes, brightness changes and shape offset information, and anomaly imprinting sequences are generated according to discrete coding rules; The atomic sequences of structural components, anomalous characterization, and operational status are combined in a predetermined order to form an atomic combination sequence. The atomic combination sequence is then matched item by item with the anomalous imprinting sequence to generate a matching result sequence. Based on the matching result sequence, the corresponding cross-modal feature positions are filtered to generate a cross-modal query vector; The cross-modal decoding layer performs decoding operations based on cross-modal query vectors and cross-modal fusion features, and outputs target detection results, which include structural region location, structural region scale, and semantic labels.

6. The robot fault detection system based on AI vision according to claim 1, characterized in that, The graph construction module is specifically as follows: Obtain the location, scale, and semantic labels of structural regions from the target detection results, and arrange them according to the time order of the image frames; Based on the location and semantic labels of structural regions in the same image frame, determine the structural component nodes corresponding to the robot base region, link region, joint region, transmission region and end effector region; For each structural component node, write the node identifier, semantic tag, center coordinates, region width, and region height to form a node attribute set; Based on the spatial distance, relative orientation, and regional adjacency between nodes of different structural components in the same image frame, determine the structural connection relationship and spatial topological relationship; Establish connecting edges between two structural component nodes that satisfy the structural connection relationship, and establish topological edges between two structural component nodes that satisfy the spatial topological relationship; Write the edge type, start node identifier, end node identifier, distance between nodes, and relative orientation for each edge to form an edge attribute set; The set of node attributes and the set of edge attributes in the same image frame are combined to generate a single-frame structural semantic graph; the single-frame structural semantic graphs are arranged according to the time order of image frames in the standard visual sequence to form a frame-by-frame generated structural semantic graph.

7. The robot fault detection system based on AI vision according to claim 1, characterized in that, The process of stacking the structural semantic maps generated frame by frame to construct a continuous time series is as follows: Obtain the structural semantic graph generated frame by frame in chronological order of image frames, and extract the set of node attributes and the set of edge attributes from the structural semantic graph of each frame; The node attribute sets in adjacent image frames are mapped between frames according to node identifiers and semantic labels to form node association results. The edge attribute sets in adjacent image frames are mapped between frames according to the start node identifier, end node identifier, and edge type to form an edge association result; According to the time sequence of the image frames, the node attribute set and edge attribute set corresponding to the frames are written into the time axis position in turn to form the time sequence arrangement result; The node association results and edge association results in the time series arrangement are stacked sequentially to form the node time series set and edge time series set in the continuous time series. The node time series set and the edge time series set are synchronously combined to generate a continuous time series.

8. The robot fault detection system based on AI vision according to claim 1, characterized in that, The structure's motion trajectory is constructed based on the changes in the spatial positions of nodes in a continuous time series. The structural deformation energy is calculated based on the relative displacement changes between adjacent nodes. Finally, a structural stability domain is constructed based on the evolution trend of the structural motion trajectory, generating structural state evolution characteristics. Specifically: Read the center coordinate sequence corresponding to the same node identifier from the node time series in the continuous time series, connect the coordinate points in the center coordinate sequence in chronological order, and generate the structural motion trajectory. Based on the difference in center coordinates between adjacent moments in the structural motion trajectory, the displacement increment, displacement direction change, and trajectory deflection are calculated to form a trajectory change sequence. Read the distance and relative orientation of adjacent nodes between adjacent times from the edge time series in the continuous time series, and calculate the change in distance and relative orientation between nodes. Based on the changes in distance and relative orientation between nodes, the relative displacement between adjacent nodes is calculated to form a deformation change sequence; The relative displacement changes in the deformation sequence are accumulated according to the preset energy calculation rules to generate the structural deformation energy sequence. Based on the trajectory change sequence and the structural deformation energy sequence, the stable state interval and the deviation state interval corresponding to each time position in the continuous time sequence are determined to form the structural state stability domain. The structural motion trajectory, trajectory change sequence, structural deformation energy sequence, and structural state stability domain are combined to generate structural state evolution characteristics.

9. The robot fault detection system based on AI vision according to claim 1, characterized in that, The evidence fusion module specifically comprises: Obtain the signature matching results, topological deviation and structural state evolution features in the structural semantic map, and align them according to the time position corresponding to the same image frame; Based on the matching result sequence in the signature matching result, extract the abnormal characterization matching value and the running status matching value corresponding to the structural component to form the first evidence sequence; Based on the set of node attributes and the set of edge attributes in the structural semantic graph, the deviation values of structural connection relationships and spatial topological relationships between adjacent image frames are calculated to form a second evidence sequence. Based on the trajectory change sequence, structural deformation energy sequence, and structural stability domain in the structural state evolution characteristics, the trajectory offset value, deformation energy value, and stability domain deviation value at the corresponding time position are extracted to form a third evidence sequence. The first evidence sequence, the second evidence sequence, and the third evidence sequence are normalized according to a unified evidence representation format to generate a multi-source evidence set. Based on the fault category, fault location, and fault risk level corresponding to different evidence items in the multi-source evidence set, a basic probability allocation result is constructed. The basic probability allocation results are combined pairwise according to the Dempster-Shafer evidence theory algorithm, and the combined results are further fused sequentially to generate the fused probability allocation results. The fault detection results are determined by sorting the probability values of the corresponding fault category, fault location, and fault risk level in the fusion probability allocation results.

10. The robot fault detection system based on AI vision according to claim 1, characterized in that, The control execution module is specifically: Based on the comparison results between the fault risk level and the preset risk threshold, a risk assessment result sequence is generated; Write control trigger identifiers into the time positions in the risk assessment result sequence that meet the preset risk threshold for the fault risk level, and form a trigger position sequence. Based on the fault category and fault location corresponding to the trigger position sequence, a control command sequence is generated. The control command sequence includes deceleration command, stop command and path adjustment command. The instruction types in the control instruction sequence are assigned according to the fault category and fault location to obtain the execution instruction sequence; The sequence of execution instructions is output to the execution end of the detection system in chronological order. Based on the instruction reception status returned by the executor and the robot's running status update results, the sequence of executed instructions is recorded to form the control output results.