A subway door abnormal sound intelligent diagnosis method and device based on voiceprint recognition

By combining phase regularization modeling and an improved Jamba network, the problems of high false alarm rate and insufficient assessment of anomaly degree in subway door anomaly identification are solved, achieving high-precision abnormal noise identification and component localization, with strong noise resistance and quantifiable diagnostic results.

CN122290631APending Publication Date: 2026-06-26CHANGZHOU INST OF LIGHT IND TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHANGZHOU INST OF LIGHT IND TECH
Filing Date
2026-03-24
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies lack phase regularization mechanisms and cross-stage coupling structures in subway door anomaly detection, resulting in high false alarm and false negative rates, making it difficult to achieve stable detection in complex noise environments, and lacking quantitative assessment of the degree of anomaly.

Method used

Employing a hybrid mechanism of phase regularization modeling, physical consistency residual construction, and conditional routing experts, this study uses structured modeling of phase-labeled audio and state data, combined with an improved Jamba network for stage embedding and coupled updates, to generate anomalous potential energy sequences that are aligned and compared with normal baseline embedding trajectories, outputting the presence, type, and severity level of the anomalous sounds.

Benefits of technology

It achieves high-precision identification and component-level positioning of abnormal noises from subway doors, has strong noise resistance and can quantify the severity of anomalies, thus improving the interpretability and long-term operational adaptability of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122290631A_ABST
    Figure CN122290631A_ABST
Patent Text Reader

Abstract

This invention discloses an intelligent diagnostic method and device for subway car door noises based on voiceprint recognition, comprising the following steps: preprocessing acquired data to generate phase-labeled audio sequences and phase-labeled state sequences; extracting and generating voiceprint feature sequences and state feature sequences; constructing a phase encoding sequence and performing length warping and dynamic time warping to generate phase-warped voiceprint sequences and phase-warped state sequences; inputting an improved Jamba network to perform stage state updates and stage coupling, calculating dynamic consistency residuals and fusing them to obtain physically consistent embedding sequences; generating expert fusion representation sequences through a coupled gating expert hybrid layer; calculating abnormal potential energy sequences and triggering abnormal branches, outputting the existence result of the noise, the type of noise, the component location result, and the severity level, and updating the frozen normal baseline embedding trajectory according to the severity level. This invention achieves accurate identification and stable graded diagnosis of subway car door noises.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of track equipment monitoring technology, and in particular to an intelligent diagnostic method and device for abnormal noises from subway car doors based on voiceprint recognition. Background Technology

[0002] As the intensity of urban rail transit operations continues to increase, subway doors, as high-frequency opening and closing components, are constantly exposed to vibration, impact, and friction environments, making them prone to abnormal phenomena such as roller wear, guide rail dry friction, and locking mechanism jamming. Existing door status monitoring technologies mainly rely on manual inspection, motor current threshold judgment, or single acoustic acquisition methods for anomaly identification. Diagnostic methods based on current signals focus on changes in drive load and are difficult to distinguish acoustic anomalies caused by different components. Analysis methods based on acoustic signals mostly use fixed spectral features or conventional deep learning models for overall classification, without performing structured segmentation modeling for operation stages such as opening, closing, and locking, resulting in prominent feature aliasing problems between different phases.

[0003] Existing technologies for processing vehicle door operation audio and status signals generally lack phase warping mechanisms and cross-stage coupling structures, failing to establish dynamic alignment relationships between stage embedded trajectories and normal baseline embedded trajectories, making it difficult to achieve stable recognition in complex noise environments. Existing model structures often use uniform parameters to process entire time-series data, failing to incorporate physical consistency information to construct dynamic consistency residual constraints, and lacking conditional routing mechanisms for expert-level allocation of different phases and anomaly types, resulting in high false alarm and false negative rates. For anomaly severity assessment, most methods rely solely on a single amplitude threshold, lacking joint analysis of anomaly duration and energy peak values, thus failing to generate graded diagnostic results.

[0004] Therefore, how to provide a method and device for intelligent diagnosis of abnormal noises from subway doors based on voiceprint recognition is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] One objective of this invention is to propose an intelligent diagnostic method and device for subway car door noise based on voiceprint recognition. This invention integrates phase regularization modeling, physical consistency residual construction and conditional routing expert hybrid mechanism to achieve structured modeling and component-level positioning diagnosis of the car door operation cycle. It has the advantages of high recognition accuracy, strong noise resistance and quantifiable severity assessment.

[0006] A method for intelligent diagnosis of abnormal noises from subway car doors based on voiceprint recognition, according to an embodiment of the present invention, includes the following steps:

[0007] The system collects audio data of subway door opening and closing operation and door status data, preprocesses them to generate phase-labeled audio sequences and phase-labeled status sequences; extracts voiceprint features based on the phase-labeled audio sequences and aggregates them by phase to generate voiceprint feature sequences; performs denoising, normalization and window statistics based on the phase-labeled status sequences to generate status feature sequences.

[0008] Based on the phase segmentation results, each frame is mapped to a phase number, and a phase coding sequence is constructed. The voiceprint feature sequence and state feature sequence are then length-normalized and dynamically time-normalized according to the phase coding sequence, generating a phase-normalized voiceprint sequence and a phase-normalized state sequence. The phase-normalized voiceprint sequence, phase-normalized state sequence, and phase coding sequence are input into an improved Jamba network, and phase-normalized state updates are performed on segmented Mamba layers according to the phase coding sequence, resulting in a stage embedding sequence. This stage embedding sequence is then input into a cross-stage Transformer layer to perform stage coupling updates. Based on the phase-normalized voiceprint sequence and phase-normalized state sequence, a dynamic consistency residual sequence is calculated, and the dynamic consistency residual sequence is... The column and stage coupling results are fused to obtain a physically consistent embedding sequence. The physically consistent embedding sequence, phase encoding sequence, phase regularization state sequence and dynamic consistency residual sequence are input into the coupled gated expert hybrid layer to perform conditional routing, expert activation and weighted fusion to obtain an expert fusion representation sequence. Based on the expert fusion representation sequence, an abnormal potential energy sequence is calculated and aligned with the normal baseline embedding trajectory. When the abnormal potential energy sequence exceeds the threshold, an abnormal branch is triggered to calculate the abnormal enhanced representation sequence. The results of abnormal noise existence, abnormal noise type, component location and severity level are output to generate the final subway door abnormal noise diagnosis result. The normal baseline embedding trajectory is updated and frozen according to the severity level.

[0009] Optionally, the preprocessing steps for generating phase-labeled audio sequences and phase-labeled state sequences include: performing timestamp correction and unified sampling frequency resampling on the acquired audio data and door state data to construct a synchronized original audio sequence and a synchronized original state sequence; performing first-order difference operations and sliding window smoothing on the synchronized original state sequence to calculate the state change rate sequence; determining the start time of the door opening stage, the uniform speed stage interval, and the lock-up stage start time based on the zero-crossing points, peak points, and stable intervals in the state change rate sequence to generate a phase boundary time index set; performing time slicing on the synchronized original audio sequence according to the phase boundary time index set to obtain door opening audio segments, uniform speed audio segments, and lock-up audio segments; assigning a corresponding phase number to each audio segment to construct a phase-labeled audio sequence; and performing time slicing on the synchronized original state sequence according to the phase boundary time index set to obtain door opening state segments, uniform speed state segments, and lock-up state segments; assigning a corresponding phase number to each state segment to construct a phase-labeled state sequence.

[0010] Optionally, the step of generating the phase-normalized voiceprint sequence and the phase-normalized state sequence includes:

[0011] Based on the phase boundary time index set, the voiceprint feature sequence and the state feature sequence are synchronously sliced ​​to obtain the voiceprint segment sequence and the state segment sequence grouped by phase number;

[0012] For each phase number, the number of frames of the corresponding voiceprint segment is counted and the target length is determined. The target length is the median of the number of frames of the phase number within the preset history window. Interpolation resampling and truncation processing are performed on the voiceprint segment to make the length of the voiceprint segment equal to the target length, thus generating a voiceprint segment with regular length.

[0013] For the same phase number, count the number of sampling points of the corresponding state segment and determine the target length. Perform interpolation resampling and truncation processing on the state segment to make the length of the state segment equal to the target length, and generate a length-regular state segment.

[0014] Within the same phase number, using length-normalized voiceprint segments as reference sequences and length-normalized state segments as alignment sequences, a frame-level distance matrix is ​​constructed and dynamic time normalization is performed to obtain the alignment path index sequence;

[0015] Based on the alignment path index sequence, the length-normalized voiceprint segments and length-normalized state segments are synchronously rearranged and resampled to generate the phase-normalized voiceprint sequence and the phase-normalized state sequence.

[0016] Optionally, the improved Jamba network is composed of a segmented Mamba layer, a cross-stage Transformer layer, a coupled gated expert hybrid layer, and an anomaly branch layer stacked in hierarchical order. The segmented Mamba layer segments the phase-normalized voiceprint sequence and the phase-normalized state sequence according to the phase coding sequence. In the sub-layer corresponding to each phase number, selective state updates are performed and a stage embedding sequence is generated. The cross-stage Transformer layer performs self-attention calculation on the boundary frames corresponding to different phase numbers in the stage embedding sequence to complete the stage coupling update. The coupled gated expert hybrid layer calculates the routing weight based on the phase coding sequence, the phase-normalized state sequence, and the dynamic consistency residual sequence, and performs weighted fusion on the output of the expert sub-layer to generate an expert fusion representation sequence. When the anomaly potential sequence exceeds a preset threshold, the anomaly branch layer performs enhancement calculation on the expert fusion representation sequence and outputs an anomaly enhancement representation sequence.

[0017] Optionally, the step of allocating the phase-coded sequence to the segmented Mamba layer for phase state updates includes:

[0018] Based on the phase coding sequence, the phase-normalized voiceprint sequence and the phase-normalized state sequence are indexed and grouped, and the feature frames corresponding to the same phase number are divided into the same stage subsequence.

[0019] A separate set of state space parameters is preset for each phase number. The set of state space parameters includes the state transition matrix, the input mapping matrix, and the output mapping matrix.

[0020] Each stage subsequence is input into the state space parameter set corresponding to the phase number in chronological order, and a selective state update operation is performed. The selective state update operation includes calculating the update threshold based on the current input features and the hidden state of the previous time step, performing a gated linear transformation on the hidden state and generating the hidden state of the current time step.

[0021] After the state update of each stage subsequence is completed, the hidden states at each time step are extracted and recombined in the original time order to generate the stage embedding sequence.

[0022] Optionally, the step of generating the physically consistent embedded sequence includes:

[0023] Extract the embedding vectors corresponding to the last frame of the opening stage, the last frame of the closing stage, and the start frame of the locking stage from the stage embedding sequence, construct the stage boundary index set, and combine the embedding vectors corresponding to the stage boundary index set into the stage boundary embedding set.

[0024] Taking the stage boundary embedding set as input, a linear mapping is performed on the stage boundary embedding set within the cross-stage Transformer layer to obtain the query vector set, key vector set, and value vector set. The similarity matrix between the query vector set and the key vector set is calculated. The similarity matrix is ​​normalized by row to obtain the attention weight matrix. The attention weight matrix is ​​used to perform a weighted summation on the value vector set and then linearly mapped to obtain the stage coupled representation sequence.

[0025] The difference in voiceprint energy between adjacent frames is calculated using the time index to obtain the voiceprint energy change rate sequence. The difference in state change between adjacent sampling points is calculated using the time index to obtain the state change rate sequence.

[0026] The voiceprint energy change rate sequence and the state change rate sequence are multiplied pointwise using the same time index to obtain the dynamic consistency residual sequence. The dynamic consistency residual sequence is then standardized to obtain the standardized dynamic consistency residual sequence.

[0027] A linear mapping is performed on the standardized dynamic consistency residual sequence to obtain a residual embedding sequence. The residual embedding sequence is aligned with the stage coupling representation sequence according to the time index. For each time index, the residual embedding vector and the stage coupling representation vector are vector summed and linearly mapped to generate a physically consistent embedding sequence.

[0028] Optionally, the step of performing conditional routing, expert activation, and weighted fusion in the coupled gated expert hybrid layer includes:

[0029] The physical consistency embedding sequence, phase encoding sequence, phase regularization state sequence and dynamic consistency residual sequence are concatenated according to time index to form a routing decision feature vector;

[0030] At each time index, the phase determination score, the state anomaly determination score, and the dynamic consistency determination score are calculated respectively. The phase determination score, the state anomaly determination score, and the dynamic consistency determination score are linearly combined to obtain the comprehensive routing score.

[0031] Based on the comprehensive routing score, an expert activation mask vector is constructed, and the expert set is grouped and filtered. At each time index, only experts who meet the phase consistency condition and the dynamic consistency condition are allowed to enter the candidate set.

[0032] Within the candidate set, the top K experts are selected and activated according to their comprehensive routing scores, while the remaining experts remain frozen at the time index and do not participate in the forward calculation.

[0033] The physical consistency embedding sequence is input into the activated expert sub-layers respectively, and each expert sub-layer performs independent mapping calculation on the input features to generate expert output vectors;

[0034] Based on the comprehensive routing score at the corresponding time index, normalized weighted fusion is performed on the output vectors of each expert to generate an expert fusion representation sequence.

[0035] Optionally, the step of generating the final diagnosis result for subway door noise includes:

[0036] Extract the fusion embedding vector from the expert fusion representation sequence by time index, group the fusion embedding vector by phase according to the phase encoding sequence, and construct the current embedding trajectory by phase number;

[0037] Call the normal baseline embedding trajectory that matches the current phase encoding sequence, perform synchronous slicing on the normal baseline embedding trajectory according to the phase number, perform time alignment between the current embedding trajectory and the normal baseline embedding trajectory, and generate an aligned embedding trajectory pair;

[0038] For the aligned embedded trajectory pairs, the embedding bias is calculated at each time index. The abnormal potential energy sequence is generated by arranging them according to the time index. The abnormal potential energy sequence is then aggregated by sliding window to obtain the windowed abnormal potential energy sequence.

[0039] The abnormal potential energy sequence of the window is compared with the preset potential energy threshold. When there are consecutive windows exceeding the threshold in the abnormal potential energy sequence of the window, an abnormal trigger flag is generated. Based on the abnormal trigger flag, the fusion embedding vector corresponding to the time index range is extracted from the expert fusion representation sequence as the abnormal segment input.

[0040] When the anomaly trigger flag is true, the anomaly fragment is input into the anomaly branch to perform augmentation calculation and generate an anomaly augmentation representation sequence. When the anomaly trigger flag is false, the expert fusion representation sequence is used as the anomaly augmentation representation sequence.

[0041] The abnormal enhanced representation sequence is input into the diagnostic output layer, which outputs the results of the presence of abnormal noise, the type of abnormal noise, and the location of the component. The severity level is determined based on the peak amplitude of the abnormal potential energy sequence and the length of the continuous over-threshold window, and the final diagnosis result of the abnormal noise of the subway door is generated.

[0042] Optionally, the step of updating and freezing the normal baseline embedding trajectory according to the severity level is as follows: An update strategy is determined based on the severity level; baseline updating is performed when the severity level is mild, and freezing is performed when the severity level is moderate or severe. During baseline updating, the current embedding trajectory is aligned frame-by-frame with the normal baseline embedding trajectory according to phase number. At the same time index, the current embedding vector and the baseline embedding vector are weighted and averaged according to a preset update coefficient to generate an updated baseline embedding vector, which replaces the original baseline embedding vector. During freezing, the normal baseline embedding trajectory remains unchanged, and the deviation statistics between the current embedding trajectory and the normal baseline embedding trajectory are recorded for trend monitoring. The updated or frozen normal baseline embedding trajectory serves as the reference trajectory for calculating the abnormal potential energy sequence in the next running cycle.

[0043] According to an embodiment of the present invention, a smart diagnostic device for abnormal noises from subway doors based on voiceprint recognition includes the following modules:

[0044] The audio and status acquisition module is used to collect audio data of subway door opening and closing operation and door status data, and generate phase-labeled audio sequences and phase-labeled status sequences.

[0045] The feature extraction module is used to generate a voiceprint feature sequence based on the phase-annotated audio sequence and a state feature sequence based on the phase-annotated state sequence.

[0046] The phase warping module is used to construct the phase encoding sequence and perform length warping and dynamic time warping on the voiceprint feature sequence and the state feature sequence to generate a phase-warped voiceprint sequence and a phase-warped state sequence.

[0047] An improved Jamba computation module is used to perform phase state updates and phase coupling updates in segmented Mamba layers and cross-phase Transformer layers according to phase-encoded sequences;

[0048] The physical consistency calculation module is used to calculate the dynamic consistency residual sequence and generate the physical consistency embedded sequence.

[0049] The coupled gated expert hybrid module is used to perform conditional routing, expert activation, and weighted fusion to generate expert fusion representation sequences;

[0050] The anomaly detection module is used to calculate the abnormal potential energy sequence, trigger abnormal branches, and output the results of the existence of abnormal noise, the type of abnormal noise, the component location results, and the severity level.

[0051] The baseline management module is used to perform update freeze on the embedded trajectory of normal baselines based on the severity level.

[0052] The beneficial effects of this invention are:

[0053] (1) By constructing phase-labeled audio sequences, phase-labeled state sequences, phase-normalized voiceprint sequences, and phase-normalized state sequences, the structured segmentation modeling of the door opening stage, door closing stage, and locking stage is realized, avoiding the overlap of features in different operating stages and improving the stability of abnormal noise detection and the accuracy of stage recognition.

[0054] (2) By introducing a segmented Mamba layer, a cross-stage Transformer layer, and a coupled gated expert hybrid mechanism based on phase coding sequence, phase regularization state sequence and dynamic consistency residual sequence into the improved Jamba network, the collaborative modeling of acoustic features and state features is realized, thereby improving the accuracy of abnormal noise type identification and component positioning accuracy.

[0055] (3) By constructing an abnormal potential energy sequence, abnormal branch enhancement calculation and severity level determination mechanism, and combining the normal baseline embedded trajectory update freezing strategy, the quantitative analysis of abnormal amplitude and duration is realized, and a graded diagnostic result is formed, thereby improving the interpretability and long-term operational adaptability of the system. Attached Figure Description

[0056] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:

[0057] Figure 1 This is an overall flowchart of a method for intelligent diagnosis of abnormal noises from subway doors based on voiceprint recognition, as proposed in this invention.

[0058] Figure 2 This is a schematic diagram illustrating the generation process of the phase-labeled audio sequence and the phase-labeled state sequence proposed in this invention;

[0059] Figure 3 This is a schematic diagram of the dynamic time-normalization processing structure of the phase-normalized voiceprint sequence and the phase-normalized state sequence proposed in this invention. Detailed Implementation

[0060] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0061] refer to Figures 1 to 3 A method for intelligent diagnosis of abnormal noises from subway car doors based on voiceprint recognition includes the following steps:

[0062] The system collects audio data of subway door opening and closing operation and door status data, preprocesses them to generate phase-labeled audio sequences and phase-labeled status sequences; extracts voiceprint features based on the phase-labeled audio sequences and aggregates them by phase to generate voiceprint feature sequences; performs denoising, normalization and window statistics based on the phase-labeled status sequences to generate status feature sequences.

[0063] Specifically, the phase-labeled audio sequence is divided into frames with a fixed frame length and frame shift. Hamming window weighting and fast Fourier transform are performed on each frame to calculate the amplitude spectrum and perform logarithmic compression to obtain the logarithmic spectrum matrix.

[0064] A Mel filter bank transform is applied to the logarithmic spectrum matrix to calculate the Mel band energy and perform a discrete cosine transform to obtain the cepstral coefficient matrix. At the same time, the spectral centroid, spectral bandwidth and spectral kurtosis are calculated to form the basic feature matrix of the voiceprint.

[0065] The basic feature matrix of voiceprint is grouped according to phase number. Within each phase, a sliding window statistical operation is performed to calculate the mean, variance and maximum value, and generate a voiceprint feature sequence arranged in chronological order.

[0066] The phase-labeled state sequence is denoised and normalized, and segmented statistically analyzed according to a fixed window length. The mean value of motor current, the rate of change of motor current, the rate of change of displacement, and the door opening and closing time are calculated to form a basic state feature matrix.

[0067] The state feature matrix is ​​grouped by phase number and arranged in chronological order to generate a state feature sequence.

[0068] Based on the phase segmentation results, each frame is mapped to a phase number, and a phase coding sequence is constructed. The voiceprint feature sequence and state feature sequence are then length-normalized and dynamically time-normalized according to the phase coding sequence, generating a phase-normalized voiceprint sequence and a phase-normalized state sequence. The phase-normalized voiceprint sequence, phase-normalized state sequence, and phase coding sequence are input into an improved Jamba network, and phase-normalized state updates are performed on segmented Mamba layers according to the phase coding sequence, resulting in a stage embedding sequence. This stage embedding sequence is then input into a cross-stage Transformer layer to perform stage coupling updates. Based on the phase-normalized voiceprint sequence and phase-normalized state sequence, a dynamic consistency residual sequence is calculated, and the dynamic consistency residual sequence is... The column and stage coupling results are fused to obtain a physically consistent embedding sequence. The physically consistent embedding sequence, phase encoding sequence, phase regularization state sequence and dynamic consistency residual sequence are input into the coupled gated expert hybrid layer to perform conditional routing, expert activation and weighted fusion to obtain an expert fusion representation sequence. Based on the expert fusion representation sequence, an abnormal potential energy sequence is calculated and aligned with the normal baseline embedding trajectory. When the abnormal potential energy sequence exceeds the threshold, an abnormal branch is triggered to calculate the abnormal enhanced representation sequence. The results of abnormal noise existence, abnormal noise type, component location and severity level are output to generate the final subway door abnormal noise diagnosis result. The normal baseline embedding trajectory is updated and frozen according to the severity level.

[0069] In this embodiment, the preprocessing steps for generating phase-labeled audio sequences and phase-labeled state sequences include: performing timestamp correction and unified sampling frequency resampling on the acquired audio data and door state data to construct a synchronized original audio sequence and a synchronized original state sequence; performing first-order difference operations and sliding window smoothing on the synchronized original state sequence to calculate the state change rate sequence; determining the start time of the door opening stage, the uniform speed stage interval, and the lock-up stage start time based on the zero-crossing points, peak points, and stable intervals in the state change rate sequence to generate a phase boundary time index set; performing time slicing on the synchronized original audio sequence according to the phase boundary time index set to obtain door opening audio segments, uniform speed audio segments, and lock-up audio segments; assigning a corresponding phase number to each audio segment to construct a phase-labeled audio sequence; and performing time slicing on the synchronized original state sequence according to the phase boundary time index set to obtain door opening state segments, uniform speed state segments, and lock-up state segments; assigning a corresponding phase number to each state segment to construct a phase-labeled state sequence.

[0070] In this embodiment, the steps of generating the phase-normalized voiceprint sequence and the phase-normalized state sequence include:

[0071] Based on the phase boundary time index set, the voiceprint feature sequence and the state feature sequence are synchronously sliced ​​to obtain the voiceprint segment sequence and the state segment sequence grouped by phase number;

[0072] For each phase number, the number of frames of the corresponding voiceprint segment is counted and the target length is determined. The target length is the median of the number of frames of the phase number within the preset history window. Interpolation resampling and truncation processing are performed on the voiceprint segment to make the length of the voiceprint segment equal to the target length, thus generating a voiceprint segment with regular length.

[0073] For the same phase number, count the number of sampling points of the corresponding state segment and determine the target length. Perform interpolation resampling and truncation processing on the state segment to make the length of the state segment equal to the target length, and generate a length-regular state segment.

[0074] Within the same phase number, using length-normalized voiceprint segments as reference sequences and length-normalized state segments as alignment sequences, a frame-level distance matrix is ​​constructed and dynamic time normalization is performed to obtain the alignment path index sequence;

[0075] Based on the alignment path index sequence, the length-normalized voiceprint segments and length-normalized state segments are synchronously rearranged and resampled to generate the phase-normalized voiceprint sequence and the phase-normalized state sequence.

[0076] In this embodiment, the steps for performing dynamic time warping include:

[0077] Using a regularized voiceprint segment as the reference sequence and a regularized state segment as the alignment sequence, let the length of the reference sequence be . The alignment sequence length is The size of the construction is Frame-level distance matrix , where matrix elements Indicates the reference sequence number Frame and Alignment Sequence Euclidean distance between frames;

[0078] Construct a cumulative cost matrix of the same size ,initialization The first row and the first column are calculated by single-step recursive accumulation;

[0079] The elements within the matrix are calculated using the following recursive formula:

[0080] ;

[0081] in , ;

[0082] from Begin backtracking along the path of minimum cost to obtain the aligned path index sequence. ;

[0083] Based on the alignment path index sequence, the length-normalized voiceprint segments and length-normalized state segments are indexed, rearranged, and synchronously resampled to generate phase-normalized voiceprint sequences and phase-normalized state sequences.

[0084] In this embodiment, the improved Jamba network is composed of a segmented Mamba layer, a cross-stage Transformer layer, a coupled gated expert hybrid layer, and an anomaly branch layer stacked in hierarchical order. The segmented Mamba layer segments the phase-normalized voiceprint sequence and the phase-normalized state sequence according to the phase coding sequence. In the sub-layer corresponding to each phase number, selective state updates are performed and a stage embedding sequence is generated. The cross-stage Transformer layer performs self-attention calculation on the boundary frames corresponding to different phase numbers in the stage embedding sequence to complete the stage coupling update. The coupled gated expert hybrid layer calculates the routing weight based on the phase coding sequence, the phase-normalized state sequence, and the dynamic consistency residual sequence, and performs weighted fusion of the outputs of the expert sub-layers to generate an expert fusion representation sequence. When the anomaly potential sequence exceeds a preset threshold, the anomaly branch layer performs enhancement calculation on the expert fusion representation sequence and outputs an anomaly enhancement representation sequence.

[0085] In this embodiment, the step of allocating the phase-coded sequence to the segmented Mamba layer for execution phase state update includes:

[0086] Based on the phase coding sequence, the phase-normalized voiceprint sequence and the phase-normalized state sequence are indexed and grouped, and the feature frames corresponding to the same phase number are divided into the same stage subsequence.

[0087] A separate set of state space parameters is preset for each phase number. The set of state space parameters includes the state transition matrix, the input mapping matrix, and the output mapping matrix.

[0088] Each stage subsequence is input into the state space parameter set corresponding to the phase number in chronological order, and a selective state update operation is performed. The selective state update operation includes calculating the update threshold based on the current input features and the hidden state of the previous time step, performing a gated linear transformation on the hidden state and generating the hidden state of the current time step.

[0089] The specific steps for performing selective state update operations include:

[0090] Let the stage subsequence corresponding to a certain phase number be . The hidden state at the previous moment was Phase selection weights are constructed based on the current input features and phase encoding. ;in, For a moment The phase encoding vector;

[0091] Construct the phase-conditional state transition matrix ;in, Based on the state transition matrix, This is the phase modulation matrix;

[0092] State update is performed based on the phase-conditional state transition matrix. ;in, For the input mapping matrix; to Iterative calculations yield the stage embedding sequence.

[0093] After the state update of each stage subsequence is completed, the hidden states at each time step are extracted and recombined in the original time order to generate the stage embedding sequence.

[0094] In this embodiment, the step of generating a physically consistent embedded sequence includes:

[0095] Extract the embedding vectors corresponding to the last frame of the opening stage, the last frame of the closing stage, and the start frame of the locking stage from the stage embedding sequence, construct the stage boundary index set, and combine the embedding vectors corresponding to the stage boundary index set into the stage boundary embedding set.

[0096] Taking the stage boundary embedding set as input, a linear mapping is performed on the stage boundary embedding set within the cross-stage Transformer layer to obtain the query vector set, key vector set, and value vector set. The similarity matrix between the query vector set and the key vector set is calculated. The similarity matrix is ​​normalized by row to obtain the attention weight matrix. The attention weight matrix is ​​used to perform a weighted summation on the value vector set and then linearly mapped to obtain the stage coupled representation sequence.

[0097] The difference in voiceprint energy between adjacent frames is calculated using the time index to obtain the voiceprint energy change rate sequence. The difference in state change between adjacent sampling points is calculated using the time index to obtain the state change rate sequence.

[0098] Among them, the voiceprint energy difference is the inter-frame energy change obtained by first-order difference of the logarithmic spectral energy sum of each time frame in the phase-normalized voiceprint sequence according to the time order, and the state change difference is the inter-frame state change obtained by first-order difference of the motor current characteristics and displacement characteristics in the phase-normalized state sequence according to the time order and weighted summation.

[0099] The voiceprint energy change rate sequence and the state change rate sequence are multiplied pointwise using the same time index to obtain the dynamic consistency residual sequence. The dynamic consistency residual sequence is then standardized to obtain the standardized dynamic consistency residual sequence.

[0100] A linear mapping is performed on the standardized dynamic consistency residual sequence to obtain a residual embedding sequence. The residual embedding sequence is aligned with the stage coupling representation sequence according to the time index. For each time index, the residual embedding vector and the stage coupling representation vector are vector summed and linearly mapped to generate a physically consistent embedding sequence.

[0101] In this embodiment, the steps of performing conditional routing, expert activation, and weighted fusion in the coupled gated expert hybrid layer include:

[0102] The physical consistency embedding sequence, phase encoding sequence, phase regularization state sequence and dynamic consistency residual sequence are concatenated according to time index to form a routing decision feature vector;

[0103] At each time index, the phase determination score, the state anomaly determination score, and the dynamic consistency determination score are calculated respectively. The phase determination score, the state anomaly determination score, and the dynamic consistency determination score are linearly combined to obtain the comprehensive routing score.

[0104] Specifically, in each time index For each expert number Calculate the phase determination score Abnormal Status Judgment Score Consistency score with dynamics The comprehensive routing score is obtained by linear combination. The method is as follows:

[0105] Extracting the phase-coded vector from the phase-coded sequence and with expert phase weight vector The phase determination score is obtained by performing the inner product. ;

[0106] Extracting the state feature vector from the phase-normalized state sequence and compared with the pre-established normal state baseline vector Calculate the deviation vector ;

[0107] Then, with the expert state weight vector The score for determining state abnormality is obtained by performing the inner product. Extract the residual scalar from the dynamically consistent residual sequence. Then take its absolute value and compare it with the expert residual weight scalar. Multiply to obtain the dynamic consistency judgment score ;

[0108] The comprehensive routing score is obtained by linearly combining the scores of the three categories of judgments with fixed weights. ;in , , and The comprehensive route score is calculated using preset constants or learnable parameters. This serves as the basis for expert activation and sorting at the time index.

[0109] Based on the comprehensive routing score, an expert activation mask vector is constructed, and the expert set is grouped and filtered. At each time index, only experts who meet the phase consistency condition and the dynamic consistency condition are allowed to enter the candidate set.

[0110] Specifically, indexing at each time point First, based on the comprehensive routing score... Construct expert activation mask vector The mask vector corresponds one-to-one with the expert set;

[0111] For the phase consistency condition, take the phase encoding vector corresponding to the current time index. Calculate the corresponding phase number, and pre-establish the set of allowed phases for each expert number. Match with the current phase number, when the current phase number belongs to The time marker phase consistency flag is set to 1, otherwise it is set to 0;

[0112] For the dynamic consistency condition, take the dynamic consistency residual scalar corresponding to the current time index. ,Will Consistency threshold with preset power When comparing, The time stamp is set to 1 for dynamic consistency; otherwise, it is set to 0.

[0113] Only when both the phase consistency flag and the dynamic consistency flag are 1, let Otherwise This forms the expert activation mask vector. ;

[0114] Only retain those that satisfy the criteria from the expert set. The expert IDs constitute the candidate set, and the remaining experts are masked at the current time index and do not participate in the calculation.

[0115] Within the candidate set, the top K experts are selected and activated according to their comprehensive routing scores, while the remaining experts remain frozen at the time index and do not participate in the forward calculation.

[0116] The physical consistency embedding sequence is input into the activated expert sub-layers respectively, and each expert sub-layer performs independent mapping calculation on the input features to generate expert output vectors;

[0117] Based on the comprehensive routing score at the corresponding time index, normalized weighted fusion is performed on the output vectors of each expert to generate an expert fusion representation sequence.

[0118] In this embodiment, the step of generating the final diagnosis result of subway door noise includes:

[0119] Extract the fusion embedding vector from the expert fusion representation sequence by time index, group the fusion embedding vector by phase according to the phase encoding sequence, and construct the current embedding trajectory by phase number;

[0120] Call the normal baseline embedding trajectory that matches the current phase encoding sequence, perform synchronous slicing on the normal baseline embedding trajectory according to the phase number, perform time alignment between the current embedding trajectory and the normal baseline embedding trajectory, and generate an aligned embedding trajectory pair;

[0121] For the aligned embedded trajectory pairs, the embedding bias is calculated at each time index. The abnormal potential energy sequence is generated by arranging them according to the time index. The abnormal potential energy sequence is then aggregated by sliding window to obtain the windowed abnormal potential energy sequence.

[0122] The steps for calculating the embedding bias are as follows: In the aligned embedding trajectory pairs, take the fused embedding vector of the current embedding trajectory at a certain time index one by one, and subtract it element by element from the baseline embedding vector of the normal baseline embedding trajectory at the same time index to obtain the vector difference; perform a summation of squares on the vector difference and take the square root operation to obtain the scalar bias at the time index, and arrange all the scalar biases in time order to form an anomalous potential energy sequence.

[0123] The steps for performing sliding window aggregation are as follows: set a fixed window length and sliding step size, move the window in time order on the abnormal potential energy sequence with the sliding step size, calculate the arithmetic mean and maximum value of several consecutive scalar deviations in each window, and then perform linear weighted summation of the arithmetic mean and maximum value according to preset weights to obtain the corresponding window abnormal potential energy value, and arrange all window abnormal potential energy values ​​in time order to form a window abnormal potential energy sequence.

[0124] The abnormal potential energy sequence of the window is compared with the preset potential energy threshold. When there are consecutive windows exceeding the threshold in the abnormal potential energy sequence of the window, an abnormal trigger flag is generated. Based on the abnormal trigger flag, the fusion embedding vector corresponding to the time index range is extracted from the expert fusion representation sequence as the abnormal segment input.

[0125] When the anomaly trigger flag is true, the anomaly fragment is input into the anomaly branch to perform augmentation calculation and generate an anomaly augmentation representation sequence. When the anomaly trigger flag is false, the expert fusion representation sequence is used as the anomaly augmentation representation sequence.

[0126] The steps for performing enhanced computation to generate anomaly enhanced representation sequences are as follows: when the anomaly trigger flag is true, the corresponding fusion embedding vector is extracted from the expert fusion representation sequence according to the anomaly time index range to form an anomaly segment embedding sequence. The anomaly segment embedding sequence is scaled up according to the corresponding anomaly potential value to enhance the feature response at the high deviation time index. A fixed-width local temporal convolution operation is performed on the scaled-up anomaly segment embedding sequence to strengthen the temporal correlation between consecutive anomaly frames. Then, the convolution result is nonlinearly mapped and normalized to generate anomaly enhanced embedding vectors. The anomaly enhanced embedding vectors replace the corresponding fusion embedding vectors in the expert fusion representation sequence according to the original time index to form anomaly enhanced representation sequences.

[0127] The abnormal enhanced representation sequence is input into the diagnostic output layer, which outputs the results of the presence of abnormal noise, the type of abnormal noise, and the location of the component. The severity level is determined based on the peak amplitude of the abnormal potential energy sequence and the length of the continuous over-threshold window, and the final diagnosis result of the abnormal noise of the subway door is generated.

[0128] In this embodiment, the abnormal noise existence result is a binary determination result of whether an abnormal acoustic event occurs in the current door operation cycle. The result is obtained by performing global average pooling on the abnormal enhancement representation sequence in the time dimension to obtain a periodic feature vector. The periodic feature vector is input into the existence determination layer to calculate the existence score. When the existence score is greater than the preset existence threshold, it is determined that there is an abnormal noise; otherwise, it is determined that there is no abnormal noise.

[0129] The abnormal noise type result is the abnormal category label corresponding to the current door operation cycle. The result is calculated by inputting the abnormal enhanced representation sequence into the type classification layer to calculate the probability distribution of each predefined abnormal noise type, and the category with the highest probability value is selected as the abnormal noise type result.

[0130] The component location result is the component number that caused the anomaly in the current door operation cycle. The result is calculated by inputting the anomaly enhancement representation sequence into the component location layer to calculate the location score of each component, and the component number with the highest score is selected as the component location result.

[0131] The severity level is a classification of the current degree of abnormality. It is determined by extracting the maximum peak amplitude and the length of the continuous over-threshold window from the abnormal potential energy sequence, comparing the maximum peak amplitude and the length of the continuous over-threshold window with the preset amplitude interval threshold and length interval threshold, respectively, and determining the mild, moderate or severe level based on the corresponding interval combination.

[0132] In this embodiment, the step of updating and freezing the normal baseline embedding trajectory according to the severity level is as follows: An update strategy is determined based on the severity level; baseline updating is performed when the severity level is mild, and freezing is performed when the severity level is moderate or severe. During baseline updating, the current embedding trajectory is aligned frame-by-frame with the normal baseline embedding trajectory according to phase number. At the same time index, the current embedding vector and the baseline embedding vector are weighted and averaged according to a preset update coefficient to generate an updated baseline embedding vector, which replaces the original baseline embedding vector. During freezing, the normal baseline embedding trajectory remains unchanged, and the deviation statistics between the current embedding trajectory and the normal baseline embedding trajectory are recorded for trend monitoring. The updated or frozen normal baseline embedding trajectory serves as the reference trajectory for calculating the abnormal potential energy sequence in the next running cycle.

[0133] According to an embodiment of the present invention, a smart diagnostic device for abnormal noises from subway doors based on voiceprint recognition includes the following modules:

[0134] The audio and status acquisition module is used to collect audio data of subway door opening and closing operation and door status data, and generate phase-labeled audio sequences and phase-labeled status sequences.

[0135] The feature extraction module is used to generate a voiceprint feature sequence based on the phase-annotated audio sequence and a state feature sequence based on the phase-annotated state sequence.

[0136] The phase warping module is used to construct the phase encoding sequence and perform length warping and dynamic time warping on the voiceprint feature sequence and the state feature sequence to generate a phase-warped voiceprint sequence and a phase-warped state sequence.

[0137] An improved Jamba computation module is used to perform phase state updates and phase coupling updates in segmented Mamba layers and cross-phase Transformer layers according to phase-encoded sequences;

[0138] The physical consistency calculation module is used to calculate the dynamic consistency residual sequence and generate the physical consistency embedded sequence.

[0139] The coupled gated expert hybrid module is used to perform conditional routing, expert activation, and weighted fusion to generate expert fusion representation sequences;

[0140] The anomaly detection module is used to calculate the abnormal potential energy sequence, trigger abnormal branches, and output the results of the existence of abnormal noise, the type of abnormal noise, the component location results, and the severity level.

[0141] The baseline management module is used to perform update freeze on the embedded trajectory of normal baselines based on the severity level.

[0142] Example 1:

[0143] To verify the feasibility of this invention in practice, it was applied to a subway car door operation status monitoring scenario in an urban rail transit depot. In this scenario, trains operate at high frequency daily, with each side door opening and closing more than a thousand times per day. After long-term operation, problems such as dry friction of the guide rails, uneven wear of the rollers, fluctuations in motor load, and jamming of the locking mechanism have emerged. Traditional methods of manual auscultation and current threshold determination are insufficient to distinguish the source of abnormal noise, especially in situations with high background noise levels inside the car, leading to significant false alarms and missed alarms. In actual operation, minor abnormal noises go undetected, and severe jamming is not warned in advance, resulting in delayed maintenance response and increased risk of downtime.

[0144] In this scenario, an audio acquisition device for the car door operation and motor current and displacement sensors are deployed to continuously collect operational data during the complete opening, closing, and locking phases. The acquired audio and status data are first processed through time alignment and periodic positioning to generate phase-annotated audio and status sequences. Voiceprint features are extracted from the phase-annotated audio sequences and aggregated by phase to generate a voiceprint feature sequence. Denoising, normalization, and window statistics are performed on the phase-annotated status sequences to generate a status feature sequence. Subsequently, a phase encoding sequence is constructed, and length and dynamic time warping are performed on the voiceprint and status feature sequences to obtain a phase-warped voiceprint sequence and a phase-warped status sequence.

[0145] Phase-normalized acoustic signature sequences, phase-normalized state sequences, and phase-encoded sequences are input into an improved Jamba network. Segmented Mamba layers perform independent state updates under different phase numbers to obtain stage embedding sequences. A cross-stage Transformer layer couples and models the stage boundaries, calculates and fuses the dynamic consistency residual sequences by combining the phase-normalized acoustic signature sequences and phase-normalized state sequences to generate a physically consistent embedding sequence. The physically consistent embedding sequence, phase-encoded sequence, phase-normalized state sequence, and dynamic consistency residual sequence are input into a coupled gated expert hybrid layer to perform conditional routing and expert activation, resulting in an expert fusion representation sequence. An anomalous potential energy sequence is calculated based on the expert fusion representation sequence and aligned with the normal baseline embedding trajectory. When the anomalous potential energy continuously exceeds the threshold, anomaly branch enhancement calculation is triggered, outputting the presence result of the abnormal noise, the type of abnormal noise, the component location result, and the severity level. Simultaneously, the normal baseline embedding trajectory is updated and frozen according to the severity level.

[0146] After collecting operational data for 30 consecutive days, a total of 36,000 sets of complete door operation cycle data were recorded, of which 3,200 sets were manually confirmed to have abnormal noises of varying degrees. The traditional current threshold method and the conventional convolutional neural network acoustic classification method were compared. The method of this invention achieved an accuracy rate of 97.6 percentage points in identifying the presence of abnormal noises, an improvement of 18 percentage points compared to the current threshold method and 9 percentage points compared to the conventional acoustic model. In terms of abnormal noise type classification, the average classification accuracy of the six predefined fault categories reached 95.8 percentage points, and the accuracy rate in distinguishing between roller wear and guide rail friction, two similar acoustic features, improved to 93 percentage points. Comparing the component location results with the disassembly and inspection results, the location accuracy reached 94.3 percentage points. The consistency rate between the severity level determination and the maintenance personnel's experience level reached 92 percentage points.

[0147] With a 10 percentage point increase in noise interference intensity, the recognition accuracy of the method of this invention decreases by less than 2 percentage points, while that of the conventional acoustic model decreases by more than 7 percentage points. Abnormal potential energy sequences are positively correlated with actual maintenance costs; when the peak value of the abnormal potential energy exceeds the preset high range, the corresponding maintenance and replacement probability reaches 85 percentage points. Through the baseline update freezing strategy, under minor anomalies, the normal baseline embedding trajectory can slowly adapt to the aging trend of the equipment. After 15 days of continuous operation, the baseline drift is controlled within 3%, while the baseline drift of the comparative method without the freezing mechanism exceeds 10%.

[0148] The above results demonstrate that this invention achieves stage-based structured modeling, collaborative analysis of acoustic and state information, and expert-level conditional routing diagnosis in complex operating environments. It effectively solves the problems of stage aliasing, high false alarm rate, and unquantifiable severity in existing technologies, exhibiting high stability, high recognition accuracy, and strong interpretability. Specific details are shown in Table 1 below.

[0149] Table 1: Comparison and Statistical Results of Diagnostic Performance for Abnormal Noises from Subway Doors

[0150] Indicator Name Current threshold method Conventional acoustic model Method of the present invention Total number of samples (groups) 36000 36000 36000 Number of abnormal noise samples (groups) 3200 3200 3200 Accuracy rate of abnormal noise detection (%) 79.4 88.5 97.6 Abnormal noise presence false negative rate (%) 14.2 7.8 2.1 Abnormal noise presence false alarm rate (%) 6.4 3.7 1.3 Average classification accuracy (%) of abnormal noise types 71.8 86.4 95.8 Roller wear identification accuracy (%) 74.2 88.1 94.7 Guide rail friction recognition accuracy (%) 69.5 84.3 93.0 Accuracy rate of locking mechanism jamming identification (%) 75.8 87.6 96.1 Component positioning accuracy (%) 68.9 82.5 94.3 Consistency rate of severity level (%) 65.4 83.7 92.0 The percentage decrease in accuracy after noise enhancement. 9.3 7.1 1.8 Baseline drift magnitude (15 days) (%) 10.6 8.4 2.9 Average processing time per cycle (milliseconds) 42 58 64 Single-cycle computing resource utilization rate (%) 23 37 41

[0151] In terms of sample size, all three methods were validated based on 36,000 sets of complete operating cycle data, including 3,200 manually confirmed abnormal noise samples, ensuring a consistent sample base and providing a basis for horizontal comparison. Regarding the accuracy of abnormal noise presence identification, the current threshold method achieved 79.4%, the conventional acoustic model achieved 88.5%, and the method of this invention reached 97.6%, with the false negative rate reduced to 2.1% and the false positive rate reduced to 1.3%. This demonstrates that the structured method based on phase regularization and physical consistency modeling improves the stability and reliability of anomaly identification.

[0152] In terms of abnormal noise classification capabilities, the average classification accuracy of the method in this invention reaches 95.8%, which is 24 percentage points higher than the current threshold method and 9 percentage points higher than the conventional acoustic model. For three typical fault types—roller wear, guide rail friction, and locking mechanism jamming—the identification accuracy all exceed 93%, demonstrating the advantages of the segmented Mamba layer and coupled gating expert hybrid mechanism in distinguishing similar acoustic features. The component positioning accuracy reaches 94.3%, significantly better than the comparison methods, indicating that the expert-level conditional routing mechanism effectively achieves refined component-level diagnosis.

[0153] In severity level assessment, the consistency rate between the method of this invention and manual judgment reached 92.0%, far exceeding the 65.4% of the current threshold method, indicating that the joint judgment mechanism of abnormal potential energy sequence and continuous over-threshold window can truly reflect the intensity and duration of the anomaly. Under enhanced noise interference conditions, the accuracy of the method of this invention decreased by only 1.8%, while the comparative method decreased by more than 7%, demonstrating the stabilizing effect of physical consistency residual constraints on noise resistance.

[0154] In terms of long-term operational stability, the baseline drift was controlled at 2.9%, significantly lower than the comparative methods, indicating that the baseline update freezing strategy effectively suppressed model drift. Regarding computational performance, the single-cycle processing time was 64 milliseconds, with a resource utilization rate of 41%, meeting the requirements for real-time diagnosis. Overall results show that this invention achieves significant technical improvements in recognition accuracy, localization capability, severity assessment, and long-term operational stability.

[0155] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A method for intelligent diagnosis of abnormal noises from subway car doors based on voiceprint recognition, characterized in that, Includes the following steps: The system collects audio data of subway door opening and closing operation and door status data, preprocesses it to generate phase-labeled audio sequences and phase-labeled status sequences; extracts voiceprint features based on the phase-labeled audio sequences, aggregates them by phase, and generates a voiceprint feature sequence. Denoising, normalization, and window statistics are performed on the phase-labeled state sequence to generate a state feature sequence; Based on the phase segmentation results, each frame is mapped to a phase number, and a phase coding sequence is constructed. The voiceprint feature sequence and state feature sequence are then length-normalized and dynamically time-normalized according to the phase coding sequence, generating a phase-normalized voiceprint sequence and a phase-normalized state sequence. The phase-normalized voiceprint sequence, phase-normalized state sequence, and phase coding sequence are input into an improved Jamba network, and phase-normalized state updates are performed on segmented Mamba layers according to the phase coding sequence, resulting in a stage embedding sequence. This stage embedding sequence is then input into a cross-stage Transformer layer to perform stage coupling updates. Based on the phase-normalized voiceprint sequence and phase-normalized state sequence, a dynamic consistency residual sequence is calculated, and the dynamic consistency residual sequence is... The column and stage coupling results are fused to obtain a physically consistent embedding sequence. The physically consistent embedding sequence, phase encoding sequence, phase regularization state sequence and dynamic consistency residual sequence are input into the coupled gated expert hybrid layer to perform conditional routing, expert activation and weighted fusion to obtain an expert fusion representation sequence. Based on the expert fusion representation sequence, an abnormal potential energy sequence is calculated and aligned with the normal baseline embedding trajectory. When the abnormal potential energy sequence exceeds the threshold, an abnormal branch is triggered to calculate the abnormal enhanced representation sequence. The results of abnormal noise existence, abnormal noise type, component location and severity level are output to generate the final subway door abnormal noise diagnosis result. The normal baseline embedding trajectory is updated and frozen according to the severity level.

2. The intelligent diagnostic method for subway door noise based on voiceprint recognition according to claim 1, characterized in that, The preprocessing steps for generating phase-labeled audio sequences and phase-labeled state sequences include: performing timestamp correction and unified sampling frequency resampling on the acquired audio data and door state data to construct a synchronous original audio sequence and a synchronous original state sequence; performing first-order difference operations and sliding window smoothing on the synchronous original state sequence to calculate the state change rate sequence; determining the start time of the door opening stage, the uniform speed stage interval, and the lock-up stage based on the zero-crossing points, peak points, and stable intervals in the state change rate sequence to generate a phase boundary time index set; performing time slicing on the synchronous original audio sequence according to the phase boundary time index set to obtain door opening audio segments, uniform speed audio segments, and lock-up audio segments, assigning a corresponding phase number to each audio segment, and constructing a phase-labeled audio sequence; and performing time slicing on the synchronous original state sequence according to the phase boundary time index set to obtain door opening state segments, uniform speed state segments, and lock-up state segments, assigning a corresponding phase number to each state segment, and constructing a phase-labeled state sequence.

3. The subway door abnormal sound intelligent diagnosis method based on voiceprint recognition according to claim 2, characterized in that, The steps for generating the phase-normalized voiceprint sequence and the phase-normalized state sequence include: Based on the phase boundary time index set, the voiceprint feature sequence and the state feature sequence are synchronously sliced ​​to obtain the voiceprint segment sequence and the state segment sequence grouped by phase number; For each phase number, the number of frames of the corresponding voiceprint segment is counted and the target length is determined. The target length is the median of the number of frames of the phase number within the preset history window. Interpolation resampling and truncation processing are performed on the voiceprint segment to make the length of the voiceprint segment equal to the target length, thus generating a voiceprint segment with regular length. For the same phase number, count the number of sampling points of the corresponding state segment and determine the target length. Perform interpolation resampling and truncation processing on the state segment to make the length of the state segment equal to the target length, and generate a length-regular state segment. Within the same phase number, using length-normalized voiceprint segments as reference sequences and length-normalized state segments as alignment sequences, a frame-level distance matrix is ​​constructed and dynamic time normalization is performed to obtain the alignment path index sequence; Based on the alignment path index sequence, the length-normalized voiceprint segments and length-normalized state segments are synchronously rearranged and resampled to generate the phase-normalized voiceprint sequence and the phase-normalized state sequence.

4. The subway door abnormal sound intelligent diagnosis method based on voiceprint recognition according to claim 3, characterized in that, The improved Jamba network consists of a segmented Mamba layer, a cross-stage Transformer layer, a coupled gated expert hybrid layer, and an anomaly branch layer stacked in hierarchical order. The segmented Mamba layer segments the phase-normalized voiceprint sequence and the phase-normalized state sequence according to the phase coding sequence. It performs selective state updates and generates a stage embedding sequence within the sub-layer corresponding to each phase number. The cross-stage Transformer layer performs self-attention calculation on the boundary frames corresponding to different phase numbers in the stage embedding sequence to complete the stage coupling update. The coupled gated expert hybrid layer calculates the routing weights based on the phase coding sequence, the phase-normalized state sequence, and the dynamic consistency residual sequence, and performs weighted fusion of the outputs of the expert sub-layers to generate an expert fusion representation sequence. When the anomaly potential sequence exceeds a preset threshold, the anomaly branch layer performs enhancement calculation on the expert fusion representation sequence and outputs an anomaly enhancement representation sequence.

5. The subway door abnormal sound intelligent diagnosis method based on voiceprint recognition according to claim 4, characterized in that, The step of allocating the phase-encoded sequence to the segmented Mamba layer for execution phase state update includes: Based on the phase coding sequence, the phase-normalized voiceprint sequence and the phase-normalized state sequence are indexed and grouped, and the feature frames corresponding to the same phase number are divided into the same stage subsequence. A separate set of state space parameters is preset for each phase number. The set of state space parameters includes the state transition matrix, the input mapping matrix, and the output mapping matrix. Each stage subsequence is input into the state space parameter set corresponding to the phase number in chronological order, and a selective state update operation is performed. The selective state update operation includes calculating the update threshold based on the current input features and the hidden state of the previous time step, performing a gated linear transformation on the hidden state and generating the hidden state of the current time step. After the state update of each stage subsequence is completed, the hidden states at each time step are extracted and recombined in the original time order to generate the stage embedding sequence.

6. The subway door abnormal sound intelligent diagnosis method based on voiceprint recognition according to claim 5, characterized in that, The step of generating the physically consistent embedded sequence includes: Extract the embedding vectors corresponding to the last frame of the opening stage, the last frame of the closing stage, and the start frame of the locking stage from the stage embedding sequence, construct the stage boundary index set, and combine the embedding vectors corresponding to the stage boundary index set into the stage boundary embedding set. Taking the stage boundary embedding set as input, a linear mapping is performed on the stage boundary embedding set within the cross-stage Transformer layer to obtain the query vector set, key vector set, and value vector set. The similarity matrix between the query vector set and the key vector set is calculated. The similarity matrix is ​​normalized by row to obtain the attention weight matrix. The attention weight matrix is ​​used to perform a weighted summation on the value vector set and then linearly mapped to obtain the stage coupled representation sequence. The difference in voiceprint energy between adjacent frames is calculated using the time index to obtain the voiceprint energy change rate sequence. The difference in state change between adjacent sampling points is calculated using the time index to obtain the state change rate sequence. The voiceprint energy change rate sequence and the state change rate sequence are multiplied pointwise using the same time index to obtain the dynamic consistency residual sequence. The dynamic consistency residual sequence is then standardized to obtain the standardized dynamic consistency residual sequence. A linear mapping is performed on the standardized dynamic consistency residual sequence to obtain a residual embedding sequence. The residual embedding sequence is aligned with the stage coupling representation sequence according to the time index. For each time index, the residual embedding vector and the stage coupling representation vector are vector summed and linearly mapped to generate a physically consistent embedding sequence.

7. The subway door abnormal sound intelligent diagnosis method based on voiceprint recognition according to claim 6, characterized in that, The steps of performing conditional routing, expert activation, and weighted fusion in the coupled gated expert hybrid layer include: The physical consistency embedding sequence, phase encoding sequence, phase regularization state sequence and dynamic consistency residual sequence are concatenated according to time index to form a routing decision feature vector; At each time index, the phase determination score, the state anomaly determination score, and the dynamic consistency determination score are calculated respectively. The phase determination score, the state anomaly determination score, and the dynamic consistency determination score are linearly combined to obtain the comprehensive routing score. Based on the comprehensive routing score, an expert activation mask vector is constructed, and the expert set is grouped and filtered. At each time index, only experts who meet the phase consistency condition and the dynamic consistency condition are allowed to enter the candidate set. Within the candidate set, the top K experts are selected and activated according to their comprehensive routing scores, while the remaining experts remain frozen at the time index and do not participate in the forward calculation. The physical consistency embedding sequence is input into the activated expert sub-layers respectively, and each expert sub-layer performs independent mapping calculation on the input features to generate expert output vectors; Based on the comprehensive routing score at the corresponding time index, normalized weighted fusion is performed on the output vectors of each expert to generate an expert fusion representation sequence.

8. The subway door abnormal sound intelligent diagnosis method based on voiceprint recognition according to claim 7, characterized in that, The steps for generating the final diagnosis result of subway door noise abnormality include: Extract the fusion embedding vector from the expert fusion representation sequence by time index, group the fusion embedding vector by phase according to the phase encoding sequence, and construct the current embedding trajectory by phase number; Call the normal baseline embedding trajectory that matches the current phase coding sequence, perform synchronous slicing on the normal baseline embedding trajectory according to the phase number, perform time alignment between the current embedding trajectory and the normal baseline embedding trajectory, and generate an aligned embedding trajectory pair; For the aligned embedded trajectory pairs, the embedding bias is calculated at each time index. The abnormal potential energy sequence is generated by arranging them according to the time index. The abnormal potential energy sequence is then aggregated by sliding window to obtain the windowed abnormal potential energy sequence. The window abnormal potential energy sequence is compared with the preset potential energy threshold. When there are consecutive windows exceeding the threshold in the window abnormal potential energy sequence, an abnormal trigger flag is generated. Based on the abnormal trigger flag, the fusion embedding vector corresponding to the time index range is extracted from the expert fusion representation sequence as the abnormal segment input. When the anomaly trigger flag is true, the anomaly fragment is input into the anomaly branch to perform enhanced computation and generate an anomaly enhanced representation sequence. When the anomaly trigger flag is false, the expert fusion representation sequence is used as the anomaly enhanced representation sequence. The abnormal enhanced representation sequence is input into the diagnostic output layer, which outputs the results of the presence of abnormal noise, the type of abnormal noise, and the location of the component. The severity level is determined based on the peak amplitude of the abnormal potential energy sequence and the length of the continuous over-threshold window, and the final diagnosis result of the abnormal noise of the subway door is generated.

9. The intelligent diagnostic method for subway door noise based on voiceprint recognition according to claim 8, characterized in that, The steps for updating and freezing the normal baseline embedding trajectory according to severity level are as follows: determine the update strategy according to the severity level, perform baseline update when the severity level is mild, and perform freezing when the severity level is moderate or severe; when performing baseline update, align the current embedding trajectory with the normal baseline embedding trajectory frame by frame according to the phase number, and at the same time index, perform a weighted average of the current embedding vector and the baseline embedding vector according to the preset update coefficient to generate the updated baseline embedding vector, and replace the original baseline embedding vector; During the freeze process, the normal baseline embedding trajectory remains unchanged, and the deviation statistics between the current embedding trajectory and the normal baseline embedding trajectory are recorded for trend monitoring. The updated or frozen normal baseline embedding trajectory serves as the reference trajectory for calculating the abnormal potential energy sequence in the next running cycle.

10. A smart diagnostic device for abnormal noises from subway doors based on voiceprint recognition, comprising executing the smart diagnostic method for abnormal noises from subway doors based on voiceprint recognition as described in any one of claims 1 to 9, characterized in that, Includes the following modules: The audio and status acquisition module is used to collect audio data of subway door opening and closing operation and door status data, and generate phase-labeled audio sequences and phase-labeled status sequences. The feature extraction module is used to generate a voiceprint feature sequence based on the phase-annotated audio sequence and a state feature sequence based on the phase-annotated state sequence. The phase warping module is used to construct the phase encoding sequence and perform length warping and dynamic time warping on the voiceprint feature sequence and the state feature sequence to generate a phase-warped voiceprint sequence and a phase-warped state sequence. An improved Jamba computation module is used to perform phase state updates and phase coupling updates in segmented Mamba layers and cross-phase Transformer layers according to phase-encoded sequences; The physical consistency calculation module is used to calculate the dynamic consistency residual sequence and generate the physical consistency embedded sequence. The coupled gated expert hybrid module is used to perform conditional routing, expert activation, and weighted fusion to generate expert fusion representation sequences; The anomaly detection module is used to calculate the abnormal potential energy sequence, trigger abnormal branches, and output the results of the existence of abnormal noise, the type of abnormal noise, the component location results, and the severity level. The baseline management module is used to perform update freeze on the embedded trajectory of normal baselines based on the severity level.