Methods and apparatus for cleaning multi-source data from gas turbine testing

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a data cleaning method based on multi-scale decomposition and feature extraction, combined with variational autoencoders, generative adversarial networks, and long short-term memory networks, the problems of low data processing efficiency, incomplete noise removal, and inaccurate timestamp alignment in gas turbine experiments were solved, achieving high-quality data cleaning and analysis support.

CN121935489BActive Publication Date: 2026-06-30CHINA UNITED GAS TURBINE TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA UNITED GAS TURBINE TECH CO LTD
Filing Date: 2026-03-31
Publication Date: 2026-06-30

Application Information

Patent Timeline

31 Mar 2026

Application

30 Jun 2026

Publication

CN121935489B

IPC: G06F18/10; G06F18/15; G06F18/214; G06F18/2433; G06F18/213; G06F18/22; G06N3/0464; G06N3/0455; G06N3/048; G06N3/0442; G06N3/0475; G06N3/045; G06N3/094; G06N5/01; G06N20/20

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies in gas turbine testing suffer from low data processing efficiency, poor noise and hidden outlier removal, inaccurate data completion, and insufficient timestamp alignment accuracy, resulting in unstable data quality and failing to meet the requirements for real-time and high-quality analysis of multi-source data.

Method used

By employing techniques such as multi-scale decomposition, feature extraction, a completion model that integrates variational autoencoders and generative adversarial networks, and a long short-term memory network model, multi-source heterogeneous data is standardized, noise is removed, missing values are filled in, and timestamps are aligned, forming a closed-loop quality control mechanism.

Benefits of technology

High-quality and highly available data cleaning was achieved, ensuring the integrity and accuracy of gas turbine test data and providing reliable data support for performance verification and intelligent analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121935489B_ABST

Patent Text Reader

Abstract

This disclosure relates to a method and apparatus for cleaning multi-source data from gas turbine testing. The method includes: acquiring raw test data collected from multiple heterogeneous subsystems during gas turbine testing; standardizing the raw test data; performing multi-scale decomposition on the standardized data to obtain low-frequency approximation coefficients and high-frequency detail coefficients; extracting features from the low-frequency approximation coefficients and high-frequency detail coefficients to obtain multi-dimensional feature vectors; identifying and removing noisy data from the standardized data based on the multi-dimensional feature vectors; using a completion model fused with a variational autoencoder and a generative adversarial network to complete the standardized data with missing values; and performing initial temporal matching of the standardized data through a sliding window, correcting and aligning timestamp deviations in the standardized data frames within the sliding window. This scheme transforms multi-source heterogeneous raw test data from gas turbines into high-quality, highly usable standardized data.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of gas turbine technology, and in particular to a method and apparatus for cleaning multi-source data in gas turbine testing. Background Technology

[0002] In related technologies, heavy-duty gas turbines, as core equipment for energy conversion, involve extreme coupling of multiple disciplines such as aerodynamics, combustion, heat transfer, and control during whole-machine testing. This requires multi-dimensional data acquisition from over 3,000 measuring points across 22 core parameters, including temperature, pressure, vibration, clearance, stress, and strain, under complex operating conditions of high temperature, high pressure, and high speed. This data serves as the core basis for forward design verification and performance optimization of gas turbines. Currently, this field generally adopts an architecture of "multi-system independent storage + local visualization," mainly composed of heterogeneous subsystems such as testing systems, unit control systems, vibration monitoring systems, and combustion dynamic diagnostic systems. Each system is equipped with a dedicated server and storage unit, supporting only data queries within its own system. Cross-system data integration requires manual file export and timestamp alignment. Furthermore, for test data of large rotating components, traditional wired transmission is prone to signal interruption due to friction between the rotor and stator. Multiple types of sensors correspond to multiple communication protocols, and the heterogeneity of these protocols leads to poor data transmission compatibility, making it difficult to guarantee data integrity under transient conditions. Summary of the Invention

[0003] To overcome the problems existing in related technologies, this disclosure provides a method and apparatus for cleaning multi-source data from gas turbine tests.

[0004] According to a first aspect of the present disclosure, a method for cleaning multi-source data from gas turbine testing is provided, comprising:

[0005] The raw test data collected from multiple heterogeneous subsystems during the gas turbine test are obtained, and the raw test data are standardized to obtain standardized data.

[0006] The standardized data is decomposed into multiple scales to obtain the low-frequency approximation coefficients corresponding to the effective signal and the high-frequency detail coefficients corresponding to the noise signal.

[0007] Feature extraction is performed on low-frequency approximation coefficients and high-frequency detail coefficients to obtain multi-dimensional feature vectors. Noise data in the standardized data is identified and removed based on the multi-dimensional feature vectors.

[0008] The standardized data is filled with missing values using a completion model that combines variational autoencoders and generative adversarial networks.

[0009] The standardized data is initially matched temporally using a sliding window, and the timestamp deviation of the standardized data frames within the sliding window is corrected and aligned based on a long short-term memory network model.

[0010] The standardized data that has undergone noise removal, missing value completion, and timestamp alignment is subjected to quality verification. If the quality verification passes, the cleaning of the original experimental data is considered complete.

[0011] According to a second aspect of the present disclosure, a multi-source data cleaning apparatus for gas turbine testing is provided, comprising:

[0012] The standardization unit is used to acquire raw test data collected from multiple heterogeneous subsystems during gas turbine testing, and to standardize the raw test data to obtain standardized data.

[0013] The decomposition unit is used to perform multi-scale decomposition on the standardized data to obtain the low-frequency approximation coefficients corresponding to the effective signal and the high-frequency detail coefficients corresponding to the noise signal.

[0014] The denoising unit is used to extract features from low-frequency approximation coefficients and high-frequency detail coefficients to obtain a multi-dimensional feature vector, and to identify and remove noise data in the standardized data based on the multi-dimensional feature vector;

[0015] The completion unit is used to perform missing value completion processing on the standardized data using a completion model that integrates variational autoencoder and generative adversarial network;

[0016] The alignment unit is used to perform initial temporal matching of the standardized data through a sliding window, and to correct and align the timestamp deviation of the standardized data frames within the sliding window based on a long short-term memory network model.

[0017] The verification unit is used to perform quality verification on the standardized data that has completed noise removal, missing value completion, and timestamp alignment. If the quality verification passes, it is determined that the cleaning of the original experimental data is complete.

[0018] According to a third aspect of the present disclosure, an electronic device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method as described in any one of the first aspects.

[0019] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method as described in any one of the first aspects.

[0020] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the method as described in any one of the first aspects.

[0021] The technical solutions provided by the embodiments of this disclosure can include the following beneficial effects: By acquiring and standardizing the raw test data collected from multiple heterogeneous subsystems in the gas turbine test, the format basis of multi-source data is unified. The standardized data is decomposed into low-frequency approximation coefficients corresponding to the effective signal and high-frequency detail coefficients corresponding to the noise signal, thus achieving coarse separation between the effective signal and the noise signal. Multi-dimensional feature vectors are obtained by feature extraction of the low-frequency approximation coefficients and high-frequency detail coefficients, and noise data in the standardized data is identified and removed based on the feature vectors. By using a completion model that integrates variational autoencoders and generative adversarial networks to complete the missing values of the standardized data, high-fidelity completed data that conforms to the operating mechanism of the gas turbine is generated. The standardized data is initially matched in time sequence through a sliding window, and the time stamp deviation of the standardized data frames in the sliding window is corrected and aligned based on a long short-term memory network model, thus solving the problem of time asynchrony of multi-source heterogeneous data. By performing quality verification on the standardized data that has completed noise removal, missing value completion, and time stamp alignment, and determining that data cleaning is completed after the verification is passed, a closed-loop quality control mechanism is formed. This method transforms raw test data from multi-source heterogeneous gas turbines into high-quality, highly available standardized data, providing reliable data support for the performance verification and intelligent analysis of gas turbines.

[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0023] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.

[0024] Figure 1 This is a flowchart illustrating a multi-source data cleaning method for gas turbine testing according to an exemplary embodiment.

[0025] Figure 2 This is a schematic diagram of a multi-source data cleaning process for gas turbine testing proposed in an embodiment of this application.

[0026] Figure 3 This is a block diagram illustrating a multi-source data cleaning apparatus for gas turbine testing, according to an exemplary embodiment.

[0027] Figure 4 This is a block diagram illustrating an apparatus for a multi-source data cleaning method for gas turbine testing, according to an exemplary embodiment.

[0028] Figure Labels

[0029] 301 - Standardization unit; 302 - Decomposition unit; 303 - Noise reduction unit; 304 - Completion unit; 305 - Alignment unit; 306 - Verification unit; 400 - Device; 402 - Processing component; 404 - Memory; 406 - Power component; 408 - Multimedia component; 410 - Audio component; 412 - I / O interface; 414 - Sensor component; 416 - Communication component; 420 - Processor. Detailed Implementation

[0030] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.

[0031] The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the present disclosure. The singular forms “a” and “the” as used in this disclosure and the appended claims are also intended to include the plural forms, unless the context clearly indicates otherwise.

[0032] It should be understood that although the terms first, second, third, etc., may be used to describe various information in embodiments of this disclosure, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first information may also be referred to as second information without departing from the scope of embodiments of this disclosure, and similarly, second information may also be referred to as first information. Depending on the context, the words “if” and “suppose” as used herein may be interpreted as “when”, “when”, or “in response to a determination”.

[0033] Furthermore, various forms of processes shown in the embodiments of this disclosure can be used to reorder, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and no limitation is imposed herein.

[0034] It should be noted that the collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0035] In related technologies, the whole-unit testing of heavy-duty gas turbines involves multiple heterogeneous systems such as testing systems, unit control systems, and vibration monitoring systems. This generates TB-level multi-source heterogeneous data encompassing 22 categories of parameters, including environmental conditions, thermodynamic performance, mechanical health, and combustion status, with over 3,000 measurement points. The data covers the entire operating process from start-up, warm-up, load operation, full-load operation to shutdown. This data forms the core foundation for forward design verification of gas turbines and AI model training (such as power efficiency prediction and fault early warning), and its quality directly determines the model's accuracy and application effectiveness.

[0036] Currently, existing technical solutions in the field of gas turbine test data cleaning mainly rely on traditional data processing methods. These technologies have been widely applied in gas turbine testing and industrial data processing. Their core components and workflows are as follows:

[0037] The existing technical solution mainly consists of a data preprocessing module, a noise filtering module, a missing value imputation module, and a data alignment module. The data preprocessing module is responsible for format conversion and preliminary screening of the raw data; the noise filtering module uses traditional algorithms such as fixed threshold filtering and moving average to remove obvious outliers and random noise from the data; the missing value imputation module uses nearest neighbor interpolation, linear interpolation, or polynomial interpolation methods to complete missing data caused by sensor failure; and the data alignment module corrects the temporal deviation of data from multiple systems through a simple timestamp matching algorithm.

[0038] The specific workflow is as follows: First, the TB-level raw data is standardized to convert heterogeneous data from different systems into a unified format; then, a fixed threshold method is used to identify and remove outliers that exceed the preset range, and the moving average method is used to smooth data noise; for missing data, linear interpolation is performed to complete it based on data from adjacent times or similar operating conditions; finally, the time sequence of data from multiple systems is manually or automatically adjusted by comparing timestamps to complete data alignment.

[0039] The existing technical solutions have the following defects and problems:

[0040] The data processing efficiency is extremely low and cannot be adapted to large-scale data scenarios: Existing technologies use serial processing logic, which requires several days or even weeks to clean the massive data from gas turbine tests. In addition, the number of measurement points is too large (more than 3,000), and the processing time is very long, which cannot meet the real-time requirements of gas turbine test data "acquisition-cleaning-application".

[0041] Poor removal of complex noise and latent outliers: Gas turbine data noise is affected by high-temperature vibration, electromagnetic interference, etc., exhibiting non-stationary and strongly coupled characteristics, and there are transient sudden outliers caused by combustion instability. Traditional fixed threshold and moving average methods can only remove simple random noise and obvious extreme values, and are insufficient in identifying complex noise and latent outliers that are strongly correlated with operating conditions, resulting in a still low signal-to-noise ratio after data cleaning.

[0042] In gas turbine testing, sensor failures and transmission interruptions can easily lead to data loss, with the duration of the loss ranging from a few seconds to several minutes. Furthermore, there are strict physical coupling relationships between the data (such as the thermodynamic relationship between temperature and pressure). Traditional interpolation methods only rely on temporal continuity for data completion, without considering physical laws. This results in significant deviations between the completed data and the actual operating conditions, making it unsuitable for multiphysics coupling analysis.

[0043] Insufficient timestamp alignment accuracy and poor timing coordination among multiple systems: The sampling frequencies of the multiple heterogeneous systems of the gas turbine vary greatly (1Hz-100Hz) and there is a transmission delay. Existing technologies use simple timestamp comparison or linear interpolation alignment, which cannot meet the timing coordination analysis requirements of multi-source data under transient conditions (such as start-up and speed-up), resulting in the failure of cross-system data association.

[0044] Poor automation and adaptability, unstable data quality: Existing solutions require manual setting of thresholds and filtering of interpolation windows, and lack the ability to adaptively adjust to the data characteristics of different steady-state and transient operating conditions of gas turbines; in addition, no customized strategies are designed for gas turbine-specific data types such as combustion dynamic pressure and rotor vibration displacement, resulting in large fluctuations in data quality and low integrity after cleaning, making it difficult to support large-scale AI model training.

[0045] To address the aforementioned issues, this disclosure provides a method and apparatus for cleaning multi-source data from gas turbine testing. By acquiring and standardizing raw test data from multiple heterogeneous subsystems during gas turbine testing, the format of the multi-source data is unified. Multi-scale decomposition of the standardized data yields low-frequency approximation coefficients corresponding to effective signals and high-frequency detail coefficients corresponding to noise signals, achieving coarse separation between effective and noise signals. Feature extraction of the low-frequency approximation coefficients and high-frequency detail coefficients yields multi-dimensional feature vectors, which are then used to identify and remove noise data from the standardized data. A completion model fused with variational autoencoders and generative adversarial networks is used to complete missing values in the standardized data, generating high-fidelity completed data that conforms to the operating mechanism of the gas turbine. A sliding window is used for initial temporal matching of the standardized data, and a long short-term memory network model is used to correct and align timestamp deviations in the standardized data frames within the sliding window, resolving the time synchronization problem of multi-source heterogeneous data. A closed-loop quality control mechanism is formed by performing quality verification on the standardized data after noise removal, missing value completion, and timestamp alignment, and confirming data cleaning completion upon successful verification. This method transforms raw test data from multi-source heterogeneous gas turbines into high-quality, highly available standardized data, providing reliable data support for the performance verification and intelligent analysis of gas turbines.

[0046] Figure 1 This is a flowchart illustrating a multi-source data cleaning method for gas turbine testing according to an exemplary embodiment, such as... Figure 1 As shown, it should be noted that the gas turbine test multi-source data cleaning method of this disclosure embodiment is applied in a gas turbine test multi-source data cleaning device. For example... Figure 1 As shown, the method may include the following steps:

[0047] Step 101: Obtain the raw test data collected from multiple heterogeneous subsystems during the gas turbine test, and standardize the raw test data to obtain standardized data.

[0048] In this embodiment of the disclosure, a heterogeneous subsystem can refer to multiple dedicated systems with different technical systems, communication protocols, data formats, and time bases that are independently deployed to complete different monitoring and control tasks in gas turbine testing.

[0049] Specifically, these subsystems include, but are not limited to: the TCS (Turbine Control System) responsible for operational logic control and recording of operating parameters; the VMS (Vibration Monitoring System) focused on monitoring mechanical health; the Combustion Dynamics Monitoring System for dynamic diagnosis of combustion state parameters; and testing systems for collecting environmental and thermodynamic performance parameters. Each subsystem is provided by a different vendor, employs proprietary hardware architectures and software platforms, supports incompatible communication protocols, uses different data storage formats, and relies on its own independent system clock for timestamps, resulting in physically isolated data silos. This disclosure addresses the multi-source data generated by these heterogeneous subsystems by using standardized processing, protocol parsing, and time alignment techniques to achieve cross-system data fusion and unified cleaning.

[0050] In one embodiment, raw test data collected by multiple heterogeneous subsystems during gas turbine testing can be acquired. These heterogeneous subsystems include the unit control system, vibration monitoring system, and combustion dynamic diagnostic system, each employing different communication protocols, data formats, and time bases. To address the protocol heterogeneity issue of multi-source data, a multi-source data adaptation unit can import terabyte-level multi-source heterogeneous data from gas turbine testing in batches and in parallel, covering data types such as environmental conditions, thermodynamic performance, mechanical health, and combustion status from over 3000 measurement points, compatible with protocols such as OPC UA (OPC Unified Architecture) and Modbus TCP (Modbus Transmission Control Protocol). Subsequently, a format standardization unit follows the GB / T 30559-2014 standard to unify data formats, naming conventions, and units, eliminating data barriers between heterogeneous systems. Finally, a preliminary screening unit uses the ±3σ statistical threshold method to quickly eliminate extreme outliers and invalid data, initially improving data purity and laying the foundation for subsequent deep cleaning.

[0051] In some embodiments of this disclosure, the raw test data includes environmental operating condition data, thermodynamic performance data, mechanical health data, and combustion state data of the gas turbine;

[0052] The environmental operating condition data includes atmospheric pressure, atmospheric temperature, and ambient relative humidity; the thermodynamic performance data includes compressor inlet flow rate, compressor outlet pressure, turbine inlet gas temperature, turbine exhaust temperature, fuel supply flow rate, and compressor inlet guide vane opening; the mechanical health data includes rotor radial vibration amplitude, rotor axial displacement, support bearing temperature, real-time shaft speed, and lubricating oil supply pressure; and the combustion status data includes combustion chamber flame tube wall temperature, combustion pulsation pressure, flue gas oxygen content, and nitrogen oxide emission concentration.

[0053] Step 102: Perform multi-scale decomposition on the standardized data to obtain the low-frequency approximation coefficients corresponding to the effective signal and the high-frequency detail coefficients corresponding to the noise signal.

[0054] In this embodiment of the disclosure, the standardized data is decomposed into multiple scales to achieve preliminary separation of effective signals from noise signals.

[0055] In some embodiments of this disclosure, step 102 may specifically include the following steps:

[0056] The db4 wavelet basis pairs are used to decompose the standardized data into one layer of low-frequency approximation coefficients corresponding to the effective signal and five layers of high-frequency detail coefficients corresponding to the noise signal.

[0057] In this embodiment, the db4 wavelet basis is used to perform a 5-level multi-scale decomposition on the standardized data, decomposing the original signal into one level of low-frequency approximation coefficients and five levels of high-frequency detail coefficients. The low-frequency approximation coefficients correspond to the main components of the effective signal, such as the steady-state variation trend of the compressor outlet pressure and the linkage variation law of the gas temperature before the turbine with the load, reflecting key characteristics of the gas turbine operating mechanism. The high-frequency detail coefficients correspond to various noise components, including high-frequency spikes of electromagnetic interference from the frequency converter of the electrical system and random white noise in the rotor vibration signal.

[0058] In one embodiment, the parameter combination of the db4 wavelet basis and the 5-level decomposition can be customized to the characteristics of the sampling frequency of 1Hz-1000Hz and the amplitude fluctuation range of ±0.01%FS signal in heavy-duty gas turbine test data. This can effectively avoid the problem of effective signal attenuation or incomplete noise separation caused by improper parameter selection in conventional decomposition schemes, and lay the foundation for subsequent accurate noise identification and removal based on feature vectors.

[0059] Step 103: Extract features from the low-frequency approximation coefficients and high-frequency detail coefficients to obtain multi-dimensional feature vectors. Based on the multi-dimensional feature vectors, identify and remove noisy data from the standardized data.

[0060] In this embodiment, the low-frequency approximation coefficients obtained in step 102 are concatenated with the five high-frequency detail coefficients to generate a multi-dimensional feature vector that integrates the main features of the effective signal and the detail features of noise. This multi-dimensional feature vector comprehensively represents the time-frequency domain characteristics of the original signal. The multi-dimensional feature vector is then input into a pre-defined convolutional neural network for deep feature mining. This network is configured with two convolutional layers with 3×3 convolutional kernels and ReLU (Rectified Linear Unit) activation functions, as well as one max-pooling layer. It can automatically learn and extract the discriminative features between noise and effective signals in higher dimensions, ultimately outputting a 64-dimensional target feature vector. The target feature vector is then input into a pre-trained Isolation Forest algorithm. This algorithm is trained and optimized based on gas turbine test sample data covering the entire process of startup, steady state, variable operating conditions, and shutdown. The outlier threshold is set to 0.85, enabling it to accurately identify and isolate random noise, electromagnetic interference noise, and latent outliers that are difficult to detect using traditional methods.

[0061] It should be noted that through the above cascaded processing, the identified noise data is finally removed from the standardized data, while the detailed features of the effective signal are fully preserved, achieving high-precision adaptive noise removal.

[0062] In some embodiments of this disclosure, step 103 may specifically include the following steps:

[0063] The five high-frequency detail coefficients obtained from the decomposition are concatenated with the one low-frequency approximation coefficient to generate a multi-dimensional feature vector.

[0064] The multidimensional feature vector is input into a pre-defined convolutional neural network for feature mining, and the target feature vector output by the convolutional neural network is obtained.

[0065] The target feature vector is input into a pre-trained isolated forest algorithm to identify and remove noisy data based on a preset outlier threshold.

[0066] In one embodiment, the five high-frequency detail coefficients obtained by wavelet decomposition in step 102 and the one low-frequency approximation coefficient can be concatenated to generate a feature vector that integrates multi-dimensional features in the time and frequency domain. This vector retains the main features of the effective signal and also contains detailed information of various noise components, providing a comprehensive data foundation for subsequent deep learning feature mining.

[0067] In this embodiment of the disclosure, the generated multidimensional feature vector is input into a preset convolutional neural network for deep feature mining. The network automatically learns the discriminative features of noise and effective signals in high-level semantics through the cascaded structure of convolutional layers and pooling layers, and outputs the target feature vector after dimensionality reduction, thereby realizing the mapping and compression from the original time-frequency features to features with higher discriminative power.

[0068] In this embodiment of the disclosure, the target feature vector output by the convolutional neural network is input into a pre-trained isolated forest algorithm. This algorithm is trained based on sample data covering the entire operating history of the gas turbine. By calculating the path length of each sample in the isolated tree and comparing it with a preset outlier judgment threshold, it automatically identifies and isolates abnormal data points that belong to noise, while completely retaining the normal data corresponding to the valid signal, thus completing the process of removing noise data from the standardized data.

[0069] In some embodiments of this disclosure, the outlier determination threshold is obtained by training and optimizing the isolated forest algorithm based on gas turbine test sample data covering the entire process of startup, steady state, variable operating conditions, and shutdown.

[0070] In this embodiment, to adapt to the complex operating conditions of heavy-duty gas turbines in actual operation, test sample data covering the entire process, including start-up acceleration, steady-state operation, load variations, and shutdown coasting, can be collected in advance. Each sample group is labeled with the true labels of normal data and various types of noise data (including random noise, electromagnetic interference noise, and latent outliers). Using these sample data covering all operating conditions as a training set, supervised parameter optimization of the Isolation Forest algorithm is performed. Through grid search and cross-validation, different outlier judgment thresholds are traversed, and the noise recognition accuracy and false detection rate are evaluated. Finally, it is determined that when the threshold is set to 0.85, the algorithm has the best comprehensive recognition performance under various operating conditions, which can ensure a high detection rate of real noise data while avoiding the rejection of valid signals as noise.

[0071] It should be noted that the above-mentioned optimized threshold parameters fully consider the noise distribution patterns of the gas turbine measured data under different operating conditions, enabling the isolated forest algorithm to adaptively adapt to transient fluctuations during startup, stable characteristics during steady-state operation, and dynamic changes during variable operating conditions, thereby improving the noise removal accuracy during the noise elimination process.

[0072] Step 104: Use a completion model that combines variational autoencoder and generative adversarial network to complete the missing values in the standardized data.

[0073] In this embodiment of the disclosure, in order to solve the problem of random missing values and continuous missing values (including long-sequence missing data with a continuous missing duration of more than 10 seconds) caused by sensor failure, signal transmission interruption or power failure of acquisition equipment during gas turbine testing, a completion model that deeply integrates variational autoencoder and generative adversarial network is used to perform high-fidelity missing value completion processing on standardized data.

[0074] In some embodiments of this disclosure, step 104 may specifically include the following steps:

[0075] The K-nearest neighbor algorithm is used, with the real-time load, atmospheric temperature, atmospheric pressure and shaft speed of the gas turbine as operating condition matching features. Complete data segments with operating condition similarity higher than a preset threshold with the standardized data are selected from the historical complete test database to construct a similar operating condition reference set.

[0076] A completion model that deeply integrates variational autoencoder and generative adversarial network is constructed, and the completion model is pre-trained using a reference set of similar working conditions;

[0077] The missing data to be completed is input into the pre-trained completion model. The variational autoencoder is used to learn the physical correlation between the data and generate candidate completion data. The generative adversarial network is used to judge and optimize the candidate completion data and output the target completion data that conforms to the operating mechanism of the gas turbine.

[0078] In this embodiment, the K-nearest neighbor algorithm is employed, using the real-time load, atmospheric temperature, atmospheric pressure, and shaft speed of the gas turbine as core operating condition matching features. Complete data segments with similarity to the data to be supplemented exceeding a preset threshold are selected from a historical full-condition complete test database. A similar operating condition reference set is constructed, covering core physical laws such as the coupling relationship between compressor pressure ratio and flow rate, the heat transfer law between turbine inlet temperature and exhaust temperature, and the correlation characteristics between vibration amplitude and speed. This provides a reference benchmark consistent with the gas turbine's operating mechanism for subsequent supplementation. A supplementation model deeply integrated with a variational autoencoder and a generative adversarial network is built. The variational autoencoder can be configured with a 512-dimensional encoder and a 256-dimensional decoder to learn the physical correlation laws between multi-dimensional data of the heavy-duty gas turbine, such as the positive correlation between fuel flow rate and combustion chamber flame tube wall temperature, and the linkage between compressor inlet guide vane angle and outlet pressure. The generative adversarial network is configured with a 128-layer generator and a 64-layer discriminator to generate high-fidelity supplementation data consistent with the gas turbine's operating mechanism.

[0079] In one embodiment, after pre-training the completion model using a constructed reference set of similar operating conditions, the missing data to be completed is input into the model. The variational autoencoder generates candidate completion data, and the generative adversarial network then discriminates and iteratively optimizes the candidate data. Finally, the target completion data with a physical consistency error of ≤5% with the original data is output, ensuring that the completed data can truly reflect the actual operating state of the gas turbine and provide a complete and reliable data foundation for subsequent timestamp alignment and quality verification.

[0080] In some embodiments of this disclosure, the similar operating condition reference set includes the coupling relationship between compressor pressure ratio and flow rate in a gas turbine, heat transfer data of turbine inlet temperature and exhaust temperature, and correlation characteristic data of vibration amplitude and rotational speed.

[0081] In this embodiment of the disclosure, the similar operating condition reference set specifically includes multi-dimensional correlation data characterizing the core physical operating laws of the gas turbine.

[0082] Specifically, the similar operating condition reference set includes data on the coupling relationship between compressor pressure ratio and flow rate, reflecting the inherent physical constraints between the compressor's compression characteristics and flow capacity under different operating conditions; it includes data on the heat transfer law of turbine inlet temperature and exhaust temperature, reflecting the energy conversion and heat transfer characteristics of the gas during turbine expansion and work; it also includes data on the correlation between vibration amplitude and rotational speed, characterizing the dynamic response law of the rotor system at different speeds.

[0083] It should be noted that these data all originate from a complete historical test database covering all operating conditions, and have been filtered using the K-nearest neighbor algorithm based on operating condition matching features such as real-time load, atmospheric temperature, atmospheric pressure, and shaft speed. This ensures that the data segments in the reference set and the data to be supplemented are under similar operating conditions. By constructing a similar operating condition reference set that incorporates the aforementioned core physical laws, a reference benchmark consistent with the actual operating mechanism of the gas turbine is provided for the subsequent supplementation model fused with variational autoencoders and generative adversarial networks. This allows the supplementation model to learn and follow the physical constraints of key components such as the compressor, turbine, and rotor, thereby generating high-fidelity supplementation data that is both statistically reasonable and physically interpretable, avoiding supplementation results that violate the basic operating laws of the gas turbine.

[0084] Step 105: Perform initial temporal matching on the standardized data using a sliding window, and correct and align the timestamp deviation of the standardized data frames within the sliding window based on the Long Short-Term Memory network model.

[0085] In this embodiment of the disclosure, in order to solve the problem of asynchronous timestamps of multi-source data caused by differences in sampling frequency (ranging from 1Hz to 1000Hz) and signal transmission link delay in various heterogeneous subsystems, a two-step method combining sliding window initial matching and long short-term memory network fine alignment is adopted to perform time-series alignment processing on standardized data, so as to achieve accurate alignment of timestamps ≤1ms between experimental data of various heterogeneous systems, and provide a unified time-series benchmark for subsequent correlation analysis and fusion application of multi-source data.

[0086] In some embodiments of this disclosure, step 105 may specifically include the following steps:

[0087] A fixed-width sliding window is used to perform initial temporal matching of the standardized data to be aligned, and to establish the temporal correspondence between the standardized data of different heterogeneous subsystems.

[0088] The data in the sliding window is input into the trained time series prediction model so that the time series prediction model can predict the numerical characteristics of the low-frequency system at the corresponding subdivided time nodes based on the data change trend of the high-frequency system, and correct the fixed deviation and random jitter deviation of the timestamps of each heterogeneous system.

[0089] The completed time-series prediction model was obtained by training a long short-term memory network using historical synchronous calibration test datasets as the training set to learn the sampling frequency differences and signal transmission link delay patterns of various heterogeneous acquisition systems.

[0090] In this embodiment of the disclosure, a sliding window with a fixed width (e.g., 100ms) can be preset as a unified time reference. For the standardized data stream of each heterogeneous subsystem, the number of continuous data points that should be included within the window width (i.e., the product of the sampling frequency and the window width) is calculated based on its inherent sampling frequency. The corresponding number of data points are extracted from the data stream of each heterogeneous subsystem, thereby establishing a coarse-grained time correspondence between the data of different subsystems within the same time window and completing the initial timing matching.

[0091] In this embodiment, the data within the sliding window is input into a pre-trained long short-term memory network time-series prediction model. This model uses a historical synchronous calibration test dataset as its training set and has fully learned the sampling frequency differences and signal transmission link delay patterns of various heterogeneous acquisition systems. Based on the data change trends of high-frequency systems (such as a 1000Hz vibration monitoring system) within the window, the model predicts the numerical characteristics of low-frequency systems (such as a 10Hz unit control system) at the corresponding subdivided time nodes, while automatically correcting the fixed timestamp deviation and random jitter deviation caused by transmission link differences.

[0092] Furthermore, the multi-source heterogeneous data set formed after initial timing matching via a sliding window is input into a pre-trained Long Short-Term Memory (LSTM) network timing prediction model to achieve precise timestamp alignment. Specifically, for standardized data from high-frequency acquisition systems (such as a vibration monitoring system sampling at 1000Hz) and low-frequency acquisition systems (such as a unit control system sampling at 10Hz) within the same time window, the LSTM model utilizes its powerful timing modeling capabilities to first learn the data change trends and fluctuation patterns of the high-frequency system within the window, capturing its dynamic characteristics on a millisecond-level time scale. Based on this, the model performs high-precision prediction of the numerical characteristics of the low-frequency system at subdivided time nodes between two adjacent sampling points, according to the continuous change trends revealed by the high-frequency system, thereby filling the information gaps in the time dimension of the low-frequency system. Simultaneously, based on the transmission link delay patterns obtained by training on historical synchronous calibration test datasets, the model automatically identifies and corrects fixed timestamp deviations (such as constant offsets caused by unsynchronized system clocks) and random jitter deviations (such as instantaneous delay fluctuations caused by network congestion) caused by factors such as differences in signal transmission paths of various heterogeneous subsystems and network delay fluctuations. Through the mechanism of high-frequency trend prediction of low-frequency nodes and dual-bias synchronous correction, data generated at the same physical moment but transmitted through different acquisition systems are given a unified and accurate timestamp, achieving time alignment of ≤1ms between test data of heterogeneous systems, and providing a high-precision time reference for subsequent multi-source data fusion analysis and working condition reconstruction.

[0093] Furthermore, a historical synchronous calibration test dataset covering the entire process of startup, steady state, variable operating conditions, and shutdown can be pre-constructed using high-precision synchronous acquisition equipment. This dataset includes load command data from the unit control system, rotor vibration data from the vibration monitoring system, and combustion pulsation data from the combustion analysis system. All data has been calibrated with high-precision timestamps, ensuring a reliable temporal correspondence between data from different systems, which can serve as the baseline ground truth for model training. During the model training phase, the multi-source heterogeneous data from this synchronous calibration dataset is used as input, and a Long Short-Term Memory (LSTM) network is used as the core training model. By designing a network structure with three hidden layers and 256 neurons in each hidden layer, the model can fully learn the complex temporal mapping relationships between various heterogeneous systems. During training, the model focuses on learning two core principles: first, the data density differences caused by different sampling frequencies (e.g., 1000Hz, 500Hz, 10Hz) of various heterogeneous acquisition systems, i.e., how to establish the inherent correspondence between high-frequency data points and low-frequency data points in the time dimension; second, the statistical laws of fixed and random jitter deviations in timestamps caused by factors such as signal transmission paths and network delays in various systems.

[0094] Through iterative training with a large number of synchronously calibrated samples, the model gradually converges and forms the ability to predict the values of low-frequency system subdivision time nodes based on the changing trends of high-frequency systems, while automatically correcting various timestamp deviations. Finally, a time series prediction model that can be used for alignment of actual experimental data is obtained.

[0095] In some embodiments of this disclosure, step 105, which uses a fixed-width sliding window to perform initial temporal matching of the standardized data to be aligned and establishes the temporal correspondence between the standardized data belonging to different heterogeneous subsystems, may specifically include the following steps:

[0096] For each heterogeneous subsystem, the total number of continuous data points that should be included within the preset time window width is calculated based on the sampling frequency of the heterogeneous subsystem; the number of continuous data points is the product of the sampling frequency and the preset time window width.

[0097] For each time window, continuous data points corresponding to the total number of heterogeneous subsystems are extracted from the standardized data of each heterogeneous subsystem. Based on the data point sets of different heterogeneous subsystems extracted within the same time window, a time correspondence is established.

[0098] In this embodiment of the disclosure, for each heterogeneous subsystem, the number of continuous data points that should be included in the time window is calculated based on its inherent sampling frequency and the preset uniform time window width.

[0099] Specifically, a fixed time window width is preset as a unified benchmark for multi-source data timing matching, for example, 100ms. For a vibration monitoring system with a sampling frequency of 1000Hz, it should contain 100 consecutive data points within the 100ms window (1000Hz × 0.1s = 100 points); for a combustion dynamic analysis system with a sampling frequency of 500Hz, it should contain 50 consecutive data points within the same window (500Hz × 0.1s = 50 points); and for a unit control system with a sampling frequency of 10Hz, it should contain 1 consecutive data point within the window (10Hz × 0.1s = 1 point). Through this calculation, a mapping relationship of the number of data points for each heterogeneous subsystem under a unified time benchmark is established, providing a quantitative basis for subsequent data extraction and matching.

[0100] In some embodiments of this disclosure, for each continuously sliding time window, the total number of continuous data points corresponding to that subsystem are extracted from the standardized data streams of each heterogeneous subsystem. For example, within the first 100ms window, 100 continuous data points are extracted from the vibration monitoring system data stream, 50 continuous data points are extracted from the combustion dynamic analysis system data stream, and 1 data point is extracted from the unit control system data stream. These data points from different systems are then aggregated to form a multi-source data point set corresponding to that time window. By traversing all time windows and repeating the above extraction and aggregation operations, a coarse-grained temporal correspondence between standardized data across heterogeneous subsystems is finally established, using time windows as units. This clarifies which data points from different systems correspond to each other in the time dimension within each unified time interval, providing a structured input data foundation for subsequent fine-grained timestamp deviation correction and alignment based on a long short-term memory network model.

[0101] Step 106: Perform quality verification on the standardized data that has completed noise removal, missing value completion, and timestamp alignment. If the quality verification passes, the cleaning of the original experimental data is considered complete.

[0102] In this embodiment, a comprehensive quality check is performed on the standardized data that has undergone noise removal, missing value completion, and timestamp alignment. The usability and reliability of the cleaned data are comprehensively evaluated by calculating data integrity indicators, temporal consistency indicators, and deviation indicators from the theoretical values of preset working conditions. If any of the above indicators exceeds the preset qualified threshold, the check is deemed to have failed. Based on the type of failure, at least one of the noise removal, missing value completion, or timestamp alignment steps is selectively returned, and the corresponding processing parameters are dynamically adjusted and re-executed until all check indicators meet the preset conditions. Finally, the cleaning of the original experimental data is determined to be complete, and high-quality, high-confidence standardized data assets are output.

[0103] In some embodiments of this disclosure, step 106 may specifically include the following steps:

[0104] For standardized data that has undergone noise removal, missing value completion, and timestamp alignment, calculate the data integrity index, the time sequence consistency index, and the deviation index from the theoretical value of the preset working condition.

[0105] If any of the data integrity index, time series consistency index, or deviation index values does not meet its preset conditions, the quality verification is determined to fail, and the process returns to the step of feature extraction of low-frequency approximation coefficients and high-frequency detail coefficients until the quality verification passes.

[0106] In this embodiment of the disclosure, three types of core quality indicators can be calculated for standardized data that has undergone noise removal, missing value imputation, and timestamp alignment:

[0107] Data integrity metrics are used to assess whether there are any new missing or gapped data on the timeline due to improper processing, ensuring continuous coverage of the data sequence;

[0108] The timing consistency index is used to verify whether the timing relationship between the data of each heterogeneous system after the timestamp alignment is completed is consistent with the actual order of the physical process, such as whether the phase relationship between the vibration peak and the speed change is reasonable.

[0109] The deviation index value from the preset operating condition theoretical value is used to compare the key parameters after cleaning (such as compressor outlet pressure, turbine inlet temperature, etc.) with the theoretical values under the operating condition obtained based on the gas turbine thermodynamic model or historical benchmark data, and to quantify whether the deviation is within the allowable range.

[0110] In one embodiment, if any of the aforementioned data integrity index, time series consistency index, and deviation index fails to meet its preset pass threshold, the quality verification is deemed unsuccessful. In this case, a closed-loop return processing mechanism is triggered, automatically returning to step 103 where feature extraction of low-frequency approximation coefficients and high-frequency detail coefficients is performed. Based on the type of verification failure, relevant processing parameters are dynamically adjusted (e.g., adjusting the anomaly detection threshold of the isolated forest, modifying the network parameters for feature extraction, etc.), and noise removal and subsequent steps are re-executed until all quality verification indicators meet the preset conditions. This closed-loop iterative verification mechanism ensures that the final output standardized data meets high quality and high confidence requirements, providing reliable data assets for the forward design verification and intelligent analysis applications of gas turbines.

[0111] In some embodiments, such as Figure 2As shown, the input raw experimental data is sequentially processed using wavelet transform, CNN (Convolutional Neural Network), and isolated forest to identify and accurately remove noisy data. A similar working condition reference set is constructed using the K-nearest neighbor algorithm, and a completion model combining VAE (Variational Autoencoder) and GAN (Generative Adversarial Network) is used for high-fidelity completion of missing values. A sliding window is used for initial temporal matching of the data, and timestamp deviation correction and accurate alignment are performed based on LSTM (Long Short-Term Memory). Finally, the data undergoes quality verification. If the verification passes, the cleaned high-quality data is output and stored; if the verification fails, it is returned for reprocessing, forming a complete closed-loop cleaning process.

[0112] According to the multi-source data cleaning method for gas turbine testing proposed in this disclosure, the original test data collected from multiple heterogeneous subsystems in the gas turbine test are acquired and standardized, unifying the format basis of the multi-source data. The standardized data is decomposed into low-frequency approximation coefficients corresponding to the effective signal and high-frequency detail coefficients corresponding to the noise signal, achieving coarse separation between the effective and noise signals. Multi-dimensional feature vectors are extracted from the low-frequency approximation coefficients and high-frequency detail coefficients, and noise data in the standardized data is identified and removed based on these feature vectors. A completion model combining variational autoencoders and generative adversarial networks is used to complete the standardized data with missing values, generating high-fidelity completed data that conforms to the gas turbine operating mechanism. A sliding window is used for initial temporal matching of the standardized data, and a long short-term memory network model is used to correct and align the timestamp deviation of the standardized data frames within the sliding window, solving the problem of time asynchrony in multi-source heterogeneous data. A closed-loop quality control mechanism is formed by performing quality verification on the standardized data after noise removal, missing value completion, and timestamp alignment, and determining that data cleaning is complete upon successful verification. This method transforms raw test data from multi-source heterogeneous gas turbines into high-quality, highly available standardized data, providing reliable data support for the performance verification and intelligent analysis of gas turbines.

[0113] Figure 3 This is a block diagram of a multi-source data cleaning apparatus for gas turbine testing, according to an exemplary embodiment. (Refer to...) Figure 3 The device includes a standardization unit 301, a decomposition unit 302, a noise reduction unit 303, a completion unit 304, an alignment unit 305, and a verification unit 306.

[0114] The standardization unit 301 is used to acquire raw test data collected from multiple heterogeneous subsystems during gas turbine testing, and to standardize the raw test data to obtain standardized data.

[0115] Decomposition unit 302 is used to perform multi-scale decomposition on standardized data to obtain low-frequency approximation coefficients corresponding to the effective signal and high-frequency detail coefficients corresponding to the noise signal.

[0116] The denoising unit 303 is used to extract features from low-frequency approximation coefficients and high-frequency detail coefficients to obtain multi-dimensional feature vectors, and to identify and remove noisy data in the standardized data based on the multi-dimensional feature vectors.

[0117] The completion unit 304 is used to complete the missing values of standardized data using a completion model that integrates variational autoencoder and generative adversarial network.

[0118] Alignment unit 305 is used to perform initial temporal matching of standardized data through a sliding window, and to correct and align the timestamp deviation of standardized data frames within the sliding window based on a long short-term memory network model.

[0119] The verification unit 306 is used to perform quality verification on the standardized data that has completed noise removal, missing value completion and timestamp alignment. If the quality verification passes, it is determined that the cleaning of the original experimental data is complete.

[0120] In some embodiments of this disclosure, the decomposition unit 302 may specifically be used to decompose standardized data into one layer of low-frequency approximation coefficients corresponding to the effective signal and five layers of high-frequency detail coefficients corresponding to the noise signal using db4 wavelet basis pairs.

[0121] In some embodiments of this disclosure, the noise reduction unit 303 may specifically be used for:

[0122] The five high-frequency detail coefficients obtained from the decomposition are concatenated with the one low-frequency approximation coefficient to generate a multi-dimensional feature vector.

[0123] The multidimensional feature vector is input into a pre-defined convolutional neural network for feature mining, and the target feature vector output by the convolutional neural network is obtained.

[0124] The target feature vector is input into a pre-trained isolated forest algorithm to identify and remove noisy data based on a preset outlier threshold.

[0125] In some embodiments of this disclosure, the outlier determination threshold is obtained by training and optimizing the isolated forest algorithm based on gas turbine test sample data covering the entire process of startup, steady state, variable operating conditions, and shutdown.

[0126] In some embodiments of this disclosure, the completion unit 304 may specifically be used for:

[0127] The K-nearest neighbor algorithm is used, with the real-time load, atmospheric temperature, atmospheric pressure and shaft speed of the gas turbine as operating condition matching features. Complete data segments with operating condition similarity higher than a preset threshold with the standardized data are selected from the historical complete test database to construct a similar operating condition reference set.

[0128] A completion model that deeply integrates variational autoencoder and generative adversarial network is constructed, and the completion model is pre-trained using a reference set of similar working conditions;

[0129] The missing data to be completed is input into the pre-trained completion model. The variational autoencoder is used to learn the physical correlation between the data and generate candidate completion data. The generative adversarial network is used to judge and optimize the candidate completion data and output the target completion data that conforms to the operating mechanism of the gas turbine.

[0130] In some embodiments of this disclosure, the similar operating condition reference set includes the coupling relationship between compressor pressure ratio and flow rate in a gas turbine, heat transfer data of turbine inlet temperature and exhaust temperature, and correlation characteristic data of vibration amplitude and rotational speed.

[0131] In some embodiments of this disclosure, the alignment unit 305 may specifically be used for:

[0132] A fixed-width sliding window is used to perform initial temporal matching of the standardized data to be aligned, and to establish the temporal correspondence between the standardized data of different heterogeneous subsystems.

[0133] The data in the sliding window is input into the trained time series prediction model so that the time series prediction model can predict the numerical characteristics of the low-frequency system at the corresponding subdivided time nodes based on the data change trend of the high-frequency system, and correct the fixed deviation and random jitter deviation of the timestamps of each heterogeneous system.

[0134] The completed time-series prediction model was obtained by training a long short-term memory network using historical synchronous calibration test datasets as the training set to learn the sampling frequency differences and signal transmission link delay patterns of various heterogeneous acquisition systems.

[0135] In some embodiments of this disclosure, the verification unit 306 may specifically be used for:

[0136] For standardized data that has undergone noise removal, missing value completion, and timestamp alignment, calculate the data integrity index, the time sequence consistency index, and the deviation index from the theoretical value of the preset working condition.

[0137] If any of the data integrity index, time series consistency index, or deviation index values does not meet its preset conditions, the quality verification is determined to fail, and the process returns to the step of feature extraction of low-frequency approximation coefficients and high-frequency detail coefficients until the quality verification passes.

[0138] In some embodiments of this disclosure, the alignment unit 305 may specifically be used for:

[0139] For each heterogeneous subsystem, the total number of continuous data points that should be included within the preset time window width is calculated based on the sampling frequency of the heterogeneous subsystem; the number of continuous data points is the product of the sampling frequency and the preset time window width.

[0140] For each time window, continuous data points corresponding to the total number of heterogeneous subsystems are extracted from the standardized data of each heterogeneous subsystem. Based on the data point sets of different heterogeneous subsystems extracted within the same time window, a time correspondence is established.

[0141] In some embodiments of this disclosure, the raw test data includes environmental operating condition data, thermodynamic performance data, mechanical health data, and combustion state data of the gas turbine.

[0142] In some embodiments of this disclosure, environmental operating condition data include atmospheric pressure, atmospheric temperature, and ambient relative humidity; thermodynamic performance data include compressor inlet flow rate, compressor outlet pressure, turbine inlet gas temperature, turbine exhaust temperature, fuel supply flow rate, and compressor inlet guide vane opening; mechanical health data include rotor radial vibration amplitude, rotor axial displacement, support bearing temperature, real-time shaft speed, and lubricating oil supply pressure; and combustion state data include combustion chamber flame tube wall temperature, combustion pulsation pressure, flue gas oxygen content, and nitrogen oxide emission concentration.

[0143] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0144] According to the gas turbine test multi-source data cleaning device proposed in this disclosure, the device acquires and standardizes the raw test data collected from multiple heterogeneous subsystems in the gas turbine test, unifying the format basis of the multi-source data. The standardized data is decomposed into low-frequency approximation coefficients corresponding to the effective signal and high-frequency detail coefficients corresponding to the noise signal, achieving coarse separation between the effective and noise signals. Multi-dimensional feature vectors are extracted from the low-frequency approximation coefficients and high-frequency detail coefficients, and noise data in the standardized data is identified and removed based on these feature vectors. A missing value completion model combining variational autoencoders and generative adversarial networks is used to complete the standardized data, generating high-fidelity completed data that conforms to the gas turbine operating mechanism. A sliding window is used for initial temporal matching of the standardized data, and a long short-term memory network model is used to correct and align the timestamp deviation of the standardized data frames within the sliding window, solving the problem of time asynchrony in multi-source heterogeneous data. A closed-loop quality control mechanism is formed by performing quality verification on the standardized data after noise removal, missing value completion, and timestamp alignment, and determining that data cleaning is complete upon successful verification. This method transforms raw test data from multi-source heterogeneous gas turbines into high-quality, highly available standardized data, providing reliable data support for the performance verification and intelligent analysis of gas turbines.

[0145] Figure 4 This is a block diagram illustrating an apparatus for a multi-source data cleaning method for gas turbine testing, according to an exemplary embodiment. For example, apparatus 400 may be an electronic device, such as a mobile phone, computer, digital broadcasting terminal, messaging device, tablet device, personal digital assistant, etc.

[0146] Reference Figure 4 The device 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input / output (I / O) interface 412, a sensor component 414, and a communication component 416.

[0147] Processing component 402 typically controls the overall operation of device 400, such as operations associated with display, telephone calls, data communication, camera operation, and recording. Processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the methods described above. Furthermore, processing component 402 may include one or more modules to facilitate interaction between processing component 402 and other components. For example, processing component 402 may include a multimedia module to facilitate interaction between multimedia component 408 and processing component 402.

[0148] Memory 404 is configured to store various types of data to support the operation of device 400. Examples of such data include instructions for any application or method operating on device 400, contact data, phonebook data, messages, pictures, videos, etc. Memory 404 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0149] The power supply component 406 provides power to the various components of the device 400. The power supply component 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power to the device 400.

[0150] Multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 408 includes a front-facing camera and / or a rear-facing camera. When the device 400 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

[0151] Audio component 410 is configured to output and / or input audio signals. For example, audio component 410 includes a microphone (MIC) configured to receive external audio signals when device 400 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 404 or transmitted via communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

[0152] I / O interface 412 provides an interface between processing component 402 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.

[0153] Sensor assembly 414 includes one or more sensors for providing status assessments of various aspects of device 400. For example, sensor assembly 414 may detect the on / off state of device 400, the relative positioning of components such as the display and keypad of device 400, changes in the position of device 400 or a component of device 400, the presence or absence of user contact with device 400, the orientation or acceleration / deceleration of device 400, and temperature changes of device 400. Sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 414 may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.

[0154] Communication component 416 is configured to facilitate wired or wireless communication between device 400 and other devices. Device 400 can access wireless networks based on communication standards, such as WiFi, 2G, or 3G, or combinations thereof. In one exemplary embodiment, communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 416 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[0155] In an exemplary embodiment, the apparatus 400 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the methods described above.

[0156] In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions is also provided, such as a memory 404 including instructions, which can be executed by a processor 420 of the device 400 to perform the above-described method. For example, the non-transitory computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.

[0157] In an exemplary embodiment, a computer program product is also provided, including a computer program that implements the above-described method when executed by a processor 420 of the device 400.

[0158] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.

[0159] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. A method for cleaning multi-source data for gas turbine test, characterized in that, include: The raw test data collected from multiple heterogeneous subsystems during the gas turbine test are obtained, and the raw test data are standardized to obtain standardized data. The standardized data is decomposed into multiple scales to obtain the low-frequency approximation coefficients corresponding to the effective signal and the high-frequency detail coefficients corresponding to the noise signal. Feature extraction is performed on low-frequency approximation coefficients and high-frequency detail coefficients to obtain multi-dimensional feature vectors. Noise data in the standardized data is identified and removed based on the multi-dimensional feature vectors. The standardized data is filled with missing values using a completion model that combines variational autoencoders and generative adversarial networks. The standardized data is initially matched temporally using a sliding window, and the standardized data frames within the sliding window are corrected for timestamp deviations and aligned based on a long short-term memory network model. This includes: using a fixed-width sliding window to perform initial temporal matching on the standardized data to be aligned, and establishing the temporal correspondence between standardized data belonging to different heterogeneous subsystems. The data in the sliding window is input into the trained time series prediction model so that the time series prediction model can predict the numerical characteristics of the low-frequency system at the corresponding subdivided time nodes based on the data change trend of the high-frequency system, and correct the fixed deviation and random jitter deviation of the timestamps of each heterogeneous system. The completed time-series prediction model is obtained by training a long short-term memory network to learn the sampling frequency differences and signal transmission link delay patterns of various heterogeneous acquisition systems using a historical synchronous calibration test dataset as the training set. The standardized data that has undergone noise removal, missing value completion, and timestamp alignment is subjected to quality verification. If the quality verification passes, the cleaning of the original experimental data is considered complete.

2. The gas turbine test multi-source data cleaning method of claim 1, wherein, The process of performing multi-scale decomposition on the standardized data to obtain low-frequency approximation coefficients corresponding to the effective signal and high-frequency detail coefficients corresponding to the noise signal includes: The standardized data is decomposed into one layer of low-frequency approximation coefficients corresponding to the effective signal and five layers of high-frequency detail coefficients corresponding to the noise signal using the db4 wavelet basis pair.

3. The gas turbine test multi-source data cleaning method of claim 2, wherein, The step of extracting features from low-frequency approximation coefficients and high-frequency detail coefficients to obtain multi-dimensional feature vectors, and identifying and removing noisy data from the standardized data based on these multi-dimensional feature vectors, includes: The five high-frequency detail coefficients obtained from the decomposition are concatenated with the one low-frequency approximation coefficient to generate a multi-dimensional feature vector. The multidimensional feature vector is input into a preset convolutional neural network for feature mining to obtain the target feature vector output by the convolutional neural network. The target feature vector is input into a pre-trained isolated forest algorithm to identify and remove noisy data based on a preset outlier threshold.

4. The gas turbine test multi-source data cleaning method of claim 3, wherein, The outlier determination threshold is obtained by training and optimizing the isolated forest algorithm based on gas turbine test sample data covering the entire process of startup, steady state, variable operating conditions, and shutdown.

5. The method for cleaning multi-source data from gas turbine testing according to claim 1, characterized in that, The missing value completion process for the standardized data using the completion model fused with variational autoencoder and generative adversarial network includes: The K-nearest neighbor algorithm is used, with the real-time load, atmospheric temperature, atmospheric pressure and shaft speed of the gas turbine as operating condition matching features. Complete data segments with operating condition similarity higher than a preset threshold with the standardized data are selected from the historical complete test database to construct a similar operating condition reference set. A completion model that deeply integrates variational autoencoder and generative adversarial network is constructed, and the completion model is pre-trained using the similar working condition reference set; The missing data to be completed is input into the pre-trained completion model. The variational autoencoder is used to learn the physical correlation between the data and generate candidate completion data. The generative adversarial network is used to judge and optimize the candidate completion data and output the target completion data that conforms to the operating mechanism of the gas turbine.

6. The method for cleaning multi-source data from gas turbine testing according to claim 5, characterized in that, The similar operating condition reference set includes the coupling relationship between compressor pressure ratio and flow rate in a gas turbine, heat transfer data of turbine inlet temperature and exhaust temperature, and correlation characteristic data of vibration amplitude and rotational speed.

7. The method for cleaning multi-source data from gas turbine testing according to claim 1, characterized in that, The quality verification of standardized data that has undergone noise removal, missing value completion, and timestamp alignment includes: For standardized data that has undergone noise removal, missing value completion, and timestamp alignment, calculate the data integrity index, the time sequence consistency index, and the deviation index from the theoretical value of the preset working condition. If any of the data integrity index, time series consistency index, and deviation index values does not meet its preset conditions, the quality verification is determined to fail, and the process returns to the step of feature extraction of low-frequency approximation coefficients and high-frequency detail coefficients until the quality verification passes.

8. The method for cleaning multi-source data from gas turbine testing according to claim 1, characterized in that, The process of using a fixed-width sliding window to perform initial temporal matching of the standardized data to be aligned, and establishing the temporal correspondence between standardized data belonging to different heterogeneous subsystems, includes: For each heterogeneous subsystem, the total number of consecutive data points that should be included within a preset time window width is calculated based on the sampling frequency of the heterogeneous subsystem; the number of consecutive data points is the product of the sampling frequency and the preset time window width. For each time window, continuous data points corresponding to the total number of heterogeneous subsystems are extracted from the standardized data of each heterogeneous subsystem. Based on the set of data points from different heterogeneous subsystems extracted within the same time window, the time correspondence is established.

9. The method for cleaning multi-source data from gas turbine testing according to claim 1, characterized in that, The original test data includes environmental operating conditions data, thermodynamic performance data, mechanical health data, and combustion state data of the gas turbine.

10. The method for cleaning multi-source data from gas turbine testing according to claim 9, characterized in that, The environmental operating condition data includes atmospheric pressure, atmospheric temperature, and ambient relative humidity; the thermodynamic performance data includes compressor inlet flow rate, compressor outlet pressure, turbine inlet gas temperature, turbine exhaust temperature, fuel supply flow rate, and compressor inlet guide vane opening; the mechanical health data includes rotor radial vibration amplitude, rotor axial displacement, support bearing temperature, real-time shaft speed, and lubricating oil supply pressure; and the combustion state data includes combustion chamber flame tube wall temperature, combustion pulsation pressure, flue gas oxygen content, and nitrogen oxide emission concentration.

11. A multi-source data cleaning device for gas turbine testing, characterized in that, include: The standardization unit is used to acquire raw test data collected from multiple heterogeneous subsystems during gas turbine testing, and to standardize the raw test data to obtain standardized data. The decomposition unit is used to perform multi-scale decomposition on the standardized data to obtain the low-frequency approximation coefficients corresponding to the effective signal and the high-frequency detail coefficients corresponding to the noise signal. The denoising unit is used to extract features from low-frequency approximation coefficients and high-frequency detail coefficients to obtain a multi-dimensional feature vector, and to identify and remove noise data in the standardized data based on the multi-dimensional feature vector; The completion unit is used to perform missing value completion processing on the standardized data using a completion model that integrates variational autoencoder and generative adversarial network; The alignment unit is used to perform initial temporal matching of the standardized data through a sliding window, and to correct and align the timestamp deviation of the standardized data frames within the sliding window based on a long short-term memory network model. This includes: using a fixed-width sliding window to perform initial temporal matching of the standardized data to be aligned, and establishing the temporal correspondence between the standardized data belonging to different heterogeneous subsystems. The data in the sliding window is input into the trained time series prediction model so that the time series prediction model can predict the numerical characteristics of the low-frequency system at the corresponding subdivided time nodes based on the data change trend of the high-frequency system, and correct the fixed deviation and random jitter deviation of the timestamps of each heterogeneous system. The completed time-series prediction model is obtained by training a long short-term memory network to learn the sampling frequency differences and signal transmission link delay patterns of various heterogeneous acquisition systems using a historical synchronous calibration test dataset as the training set. The verification unit is used to perform quality verification on the standardized data that has completed noise removal, missing value completion, and timestamp alignment. If the quality verification passes, it is determined that the cleaning of the original experimental data is complete.

12. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method as described in any one of claims 1 to 10.

13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 10.

14. A computer program product, comprising a computer program, characterized in that, The computer program, when executed by a processor, implements the method as described in any one of claims 1 to 10.