A large model association fault positioning method and system for heterogeneous training scenarios
By combining periodic polling and packet capture techniques with a structural causal model, the problem of accurately locating faults in heterogeneous clusters was solved, achieving low-latency, high-performance fault monitoring and diagnosis, and improving the system's usability and reliability.
CN122220133APending Publication Date: 2026-06-16INESA (GRP) CO LTD
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INESA (GRP) CO LTD
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-16
Smart Images

Figure CN122220133A_ABST
Abstract
The application relates to a large model connection fault positioning method and system for a heterogeneous training scene, and the method comprises the following steps: periodically polling a cluster management platform to monitor and identify a new training task and metadata thereof, and parallelly capturing a connection data packet thereof and extracting connection key parameters, obtaining an original data stream after association, generating a structured feature sequence by performing real-time processing and feature extraction on the original data stream; comparing the structured feature sequence with an expected norm to determine and locate a fault node, classifying and differentially analyzing the fault node according to its propagation characteristics to obtain analysis strategy decisions of each connection fault; constructing a fault chain based on a specific node by adopting a hierarchical aggregation strategy, and performing causal analysis on the fault chain by a structure causal model according to the analysis strategy decisions of each connection fault to obtain fault roots of each connection fault. Compared with the prior art, the application realizes non-invasiveness of fault positioning, improves the accuracy of fault positioning and reduces the operation overhead.
Need to check novelty before this filing date? Find Prior Art