A single-cell trajectory inference method based on adaptive feature selection
By using an adaptive feature selection method, combined with dual-dimensional gene importance assessment and dynamic weight fusion, the number of features is intelligently determined, solving the problems of single feature selection criteria and fixed weights in existing technologies. This achieves high-precision single-cell trajectory inference, reduces model complexity, and improves performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- LUDONG UNIVERSITY
- Filing Date
- 2026-04-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing single-cell trajectory inference methods suffer from problems in feature selection, such as single standards, fixed weights, rigid number of features, and lack of performance feedback loops. This leads to feature selection results that are biased towards one aspect of information, lose key features, and fail to fully characterize the biological importance of genes.
An adaptive feature selection approach is adopted, which constructs a performance-driven closed-loop optimization architecture through two-dimensional gene importance assessment, dynamic weight fusion, intelligent inflection point detection, and adaptive reconstruction of model dimensions. This architecture automatically determines the optimal number of features and improves model performance.
It achieves high-precision, interpretable, and adaptive single-cell trajectory inference, and can automatically select the optimal feature subset according to the characteristics of different datasets, reducing model complexity and improving trajectory inference accuracy.
Smart Images

Figure CN122020104B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of bioinformatics, specifically relating to a single-cell trajectory inference method based on adaptive feature selection. Background Technology
[0002] Single-cell trajectory inference is a core analytical task for revealing the dynamic changes in cell differentiation, lineage commitment, and disease progression. Feature selection is a crucial preliminary step in single-cell trajectory inference, directly impacting model performance and biological interpretability. Existing single-cell trajectory inference methods, such as Monocle, Dyno, VITAE, and scVAE, suffer from the following technical shortcomings in feature selection: ① Single feature selection criteria: They use highly variable genes as input features, ignoring genes with relatively stable expression but crucial for cell differentiation trajectories; ② Fixed feature weights: The weights of highly variable genes and trajectory-related genes remain fixed throughout the analysis, ignoring the characteristics of different datasets and failing to dynamically optimize during model training; ③ Fixed number of features: They ignore the complexity and information distribution of different datasets, making it impossible to determine the number of features based on the inherent structure and characteristics of the data; ④ Lack of performance-driven closed-loop optimization: Existing feature selection is usually performed independently as a preprocessing step, neglecting the feedback relationship between feature selection results and subsequent trajectory inference performance; ⑤ Incomplete assessment of feature importance: Using a single indicator to assess gene importance makes it difficult to comprehensively characterize the biological importance of genes, leading to feature selection results biased towards one aspect of information and the loss of other key features. Therefore, there is an urgent need in this field for a single-cell trajectory inference method that can adaptively fuse multi-dimensional gene importance information, dynamically adjust feature weights, intelligently determine the number of features, and possess performance-driven closed-loop optimization capabilities. Summary of the Invention
[0003] This invention proposes a single-cell trajectory inference method based on adaptive feature selection, aiming to solve the problems of single feature selection criteria, fixed weights, rigid number of features, and lack of performance feedback loop in existing technologies. The specific technical solution includes the following five steps:
[0004] Step 1: Data Preprocessing and Two-Dimensional Gene Importance Assessment: Input single-cell RNA sequencing data, and obtain an initial gene expression matrix through standardization, logarithmic transformation, and screening for highly variable genes; adopt a two-dimensional gene assessment strategy to calculate the scores of highly variable genes based on expression variability and the trajectory importance scores related to differentiation trajectories, and output two types of score vectors.
[0005] Step 2, Dynamic Weight Fusion and Nonlinear Enhancement: Using the score vector output from Step 1 as input, the fusion weights are adaptively adjusted through a performance feedback closed loop. A nonlinear enhancement term is introduced to highlight key factors, and the output is a comprehensive score vector.
[0006] Step 3, Adaptive Feature Quantity Determination: Using the score vector output from Step 2 as input, an intelligent inflection point detection algorithm is employed. Through descending sorting, smoothing filtering, inflection point detection, and cumulative variance explanation rate calculation, the optimal number of features is automatically determined, and the output is a feature selection mask.
[0007] Step 4, Model Dimension Adaptive Reconstruction and Training: Using the feature selection mask output in Step 3 as input, extract a feature subset based on the mask to construct new input data. Save the original variational autoencoder parameters, reconstruct the variational autoencoder model based on the new dimension, complete training on the new feature subset, and output a model adapted to the new dimension.
[0008] Step 5, Performance-Driven Iterative Optimization and Trajectory Inference: Using the model output from Step 4 as input, perform multiple rounds of iterative optimization. In each round, perform posterior estimation, backbone network construction, and trajectory inference. Adjust the fusion weights from Step 2 based on performance trends, record the corresponding feature selection results, and finally output the trajectory inference result under the optimal feature set.
[0009] A single-cell trajectory inference method based on adaptive feature selection, the implementation process of step 1 is as follows:
[0010] The system receives single-cell RNA sequencing data files in h5 or h5ad format. Data preprocessing includes: total count normalization, logarithmic transformation, identification of highly variable genes, and data scaling using Scanpy. Highly variable gene scores are calculated primarily using the ranking values generated by Scanpy, mapping the rankings to [the appropriate values] using a ranking transformation formula. Within a given interval, higher rankings receive higher scores; if ranking information is unavailable, normalized dispersion scores are used; if dispersion information is also unavailable, the coefficient of variation is calculated based on the gene expression mean and variance, and all scores are normalized to a given value. The interval yields a highly variable gene score vector. The calculation of trajectory importance scores depends on whether there is pseudo-time information or a true label. If pseudo-time information is present, the absolute value of the correlation between gene expression and pseudo-time is calculated using the Spearman rank correlation coefficient. If a true cell type label is present, one-way ANOVA is used. The value is used as the inter-group difference score; if no trajectory information is available, the standard deviation of gene expression is used as a proxies, and all scores are normalized to 0. Interval .
[0011] A single-cell trajectory inference method based on adaptive feature selection, the implementation process of step 2 is as follows:
[0012] The core of the dynamic weight fusion mechanism lies in adaptively adjusting the fusion weights of the two classes of scores based on model performance feedback. The system maintains two dynamic weight parameters. and The values represent the contribution of highly variable gene scores and trajectory importance scores, respectively, with their sum always equal to 1. Both are initially set to 0.5 to ensure equal importance for both types of scores in the initial state. The fusion calculation employs a strategy combining linear weighting and nonlinear enhancement. First, the linear weighting result is calculated: the highly variable gene score vector and the trajectory importance score vector are multiplied by their respective weights and then summed to obtain the preliminary fusion score. Based on this, a nonlinear enhancement term is introduced, squaring the preliminary fusion score, multiplying it by a fixed coefficient, and then summing the results to form the final comprehensive fusion score. Dynamic weight adjustment relies on a performance feedback loop. After each iteration, the system calculates a comprehensive score based on multi-dimensional performance indicators. By comparing the comprehensive score of the current iteration with that of the previous iteration, the performance change rate is calculated. A tiered adjustment strategy is used to update the fusion weights: when performance significantly improves, the weights are enhanced in the advantageous direction based on the current weights; when performance slightly improves, random perturbations are added for fine-tuning; when performance is stable or slightly decreases, random exploration is conducted within a small range; when performance significantly decreases, the weights revert to the historical best weights. Through this performance-driven closed-loop optimization mechanism, the fusion weights can adaptively adapt to the characteristics of different datasets, dynamically optimize during training, and ultimately output a fusion score vector that best reflects the comprehensive importance of genes.
[0013] A single-cell trajectory inference method based on adaptive feature selection, the implementation process of step 3 is as follows:
[0014] The adaptive feature quantity determination uses the fusion score vector output from step 2 as input. The system first sorts the fusion scores in descending order to obtain an ordered sequence, placing genes with higher scores first. The sorted sequence is then smoothed to remove noise interference. Initial smoothing is performed using a sliding window averaging method, with the window size dynamically adjusted based on the total number of genes to ensure the smoothing effect is adapted to the data scale. A second smoothing is then performed using a Savitzky-Golay filter, which effectively removes high-frequency noise while preserving the overall trend of the curve. After smoothing, the system locates the position with the largest curvature change through inflection point detection: the first and second differences of the smoothed sequence are calculated, and the minimum point of the second difference corresponds to the inflection point where the score decreases the fastest, i.e., the candidate inflection point. The system also locates the position where the information contribution rate reaches a threshold using the cumulative variance explained rate: the proportion of the cumulative sum of the sorted sequences to the total sum is calculated, and the position where the cumulative proportion first reaches a preset threshold is located, i.e., the information contribution inflection point. The system integrates the inflection point detection results and the information contribution inflection point, combined with preset minimum and maximum feature number ranges, to automatically determine the optimal feature quantity. After determining the number of features, the system generates a feature selection mask, setting the corresponding number of gene positions with the highest fusion scores to 1, and setting the remaining positions to 0. This mask will be used for model dimensionality reconstruction and data subset extraction in subsequent steps.
[0015] A single-cell trajectory inference method based on adaptive feature selection, the implementation process of step 4 is as follows:
[0016] The adaptive reconstruction of the model dimensions takes the original variational autoencoder model configuration and feature selection mask as input. The system first extracts a new subset of features from the original gene expression matrix based on the feature selection mask, constructing a new input data matrix with reduced dimensions. The core of reconstructing the variational autoencoder model lies in dynamically adjusting the dimensions of the input and output layers to adapt to the new number of features while maintaining the overall model architecture. The specific reconstruction process is as follows: The pre-trained weights of the original model are saved; a new encoder is created, adjusting the input layer dimensions to the new number of features, while the dimensions of subsequent hidden layers remain consistent with the original model; a new decoder is created, adjusting the output layer dimensions to the new number of features, while the hidden layer dimensions correspond inversely to the original model; the latent space layer is reinitialized, keeping the number of discrete states and the dimensions of latent variables unchanged. The weight transfer strategy is handled according to the characteristics of different layers: for hidden and latent variable layers whose dimensions have not changed, the pre-trained weights of the original model are directly loaded; for input and output layers whose dimensions have changed, random initialization is used, and partial mapping from the corresponding positions of the original weights can be attempted to retain learned information. After reconstruction, the training and testing datasets are rebuilt using the new feature matrices. The model undergoes two phases: pre-training and formal training. An adaptive learning rate optimizer is employed, and an early stopping strategy is introduced to prevent overfitting. The final output is a model adapted to the new feature dimensions.
[0017] A single-cell trajectory inference method based on adaptive feature selection, the implementation process of step 5 is as follows:
[0018] The performance-driven iterative optimization starts with the trained model output in step 4 and proceeds through multiple iterations. The complete process for each iteration is as follows: Posterior estimation is performed on the entire dataset using the current model; the probability distribution of cell states and the posterior distribution of cell positions are calculated using Monte Carlo sampling; a backbone network is constructed based on the posterior distribution using a specified method; low-confidence edges are filtered out using a threshold, and the maximum spanning tree is extracted as needed to ensure an acyclic structure; a directed acyclic graph is constructed based on the backbone network and the specified root node, and cell pseudo-time is calculated. Multi-dimensional performance metrics are calculated. The fusion weights are adjusted according to the weight update strategy described in step 2 based on the changing trend of the comprehensive performance score. Based on the updated weights, steps 2 to 4 are re-executed for a new round of feature selection and model reconstruction training. The performance metrics, the number of selected features, and the fusion weights are recorded in the historical queue; if the current comprehensive performance is better than the historical best, the best performance record and the corresponding feature mask are updated. After the iteration ends, the final model trained based on the best feature set is output, and a complete trajectory inference result is generated, including the backbone network, cell pseudo-time, cell state allocation, and uncertainty estimation. The system also outputs a complete iteration history, including performance metrics, weight changes, and feature quantity changes for each round, for subsequent analysis and visualization.
[0019] This invention constructs a two-dimensional gene importance assessment system by automatically integrating gene expression variability and cell differentiation trajectory information, and introduces a dynamic weight adjustment mechanism to adaptively optimize the fusion weights of the two types of scores through a performance feedback closed-loop. Simultaneously, an intelligent inflection point detection algorithm is used to automatically determine the optimal number of features, and a model dimension adaptive reconstruction technique is employed to address the technical bottleneck of input dimension variations, ultimately forming a performance-driven closed-loop optimization architecture. This invention achieves high-precision, interpretable, and adaptive single-cell trajectory inference, automatically selecting the optimal feature subset based on the characteristics of different datasets, significantly reducing model complexity while maintaining or improving trajectory inference accuracy. Attached Figure Description
[0020] Figure 1 This is a diagram illustrating the overall architecture of a single-cell trajectory inference method based on adaptive feature selection.
[0021] Figure 2 This is a flowchart of the two-dimensional gene importance assessment and dynamic weight fusion module.
[0022] Figure 3 Flowchart for determining the number of adaptive features.
[0023] Figure 4 Flowchart for the model dimension adaptive reconstruction module.
[0024] Figure 5 Flowchart for the performance-driven iterative optimization and trajectory inference module. Detailed Implementation
[0025] The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments, but the scope of protection of the present invention is not limited thereto.
[0026] Implementation of a two-dimensional gene importance assessment. For example... Figure 1 and Figure 2 As shown, the two-dimensional gene importance assessment comprehensively evaluates gene importance through two parallel assessment branches. First, single-cell RNA sequencing data is read from h5 or h5ad format files. Scanpy is used to standardize and logarithmically transform the total count and screen for highly variable genes. By calculating the expression mean and variance of each gene, the 2000 genes with the highest degree of variation are selected as the initial feature pool. This number has been validated through extensive experiments to achieve the best balance between information coverage and computational complexity. Then, data scaling is performed, and the expression values of each gene are centered and standardized to eliminate the influence of differences in gene expression levels. For sample quality control, an upper limit of 10,000 cells is set, which is biologically sufficient to cover the major cell types of most tissues. A lower limit of 100 cells is set; samples below this threshold are considered invalid and directly filtered. At least 30-50 cells are required for each cell state to make reliable inferences; this number is rounded down to 100 as a safety threshold. The highly variable gene score assessment branch employs a priority strategy for calculation: First, it utilizes the rank information of highly variable genes, strictly ranging from 1 to 2000, with rank 1 representing the gene with the highest degree of variation. The score conversion uses a linear mapping formula. When no ranking exists, check for the presence of dispersion information, which reflects the degree of gene expression variation relative to the mean. If present, map the dispersion values to the mean using a min-max normalization method. Interval: ,in This is a very small constant added to prevent division by zero errors; when dispersion information is also unavailable, the mean and variance of each gene are calculated directly from the expression matrix, and expression variability is measured by the coefficient of variation, which is defined as the ratio of the standard deviation to the mean. After calculating the coefficient of variation, the same minimum-maximum normalization is performed. The trajectory score evaluation branch prioritizes the utilization of trajectory information from most direct to most indirect: first, it detects the presence of pseudo-temporal information. After initial model pre-training, it extracts continuous position estimates of cells along differentiation trajectories from the model's latent space. If pseudo-temporal information exists, a rank-based nonparametric correlation analysis method is used to calculate the correlation between the expression level of each gene and pseudo-time. This method first converts gene expression values and pseudo-time values into ranks (descending order), and then calculates the correlation coefficient between the two sets of ranks: ,in Represents cells Zhonggen i The amount of expression, For cells pseudo-time value, and Genes The mean of the expression rank and pseudo-time rank. The range of values for this coefficient is... Positive values indicate that gene expression increases with pseudotime, while negative values indicate that it decreases. The absolute value of the correlation coefficient is taken as the raw score of the gene's trajectory importance. If the pseudo-time value is absent, the presence of a true cell type label is checked. This label is typically stored as a categorical variable in the cell annotation information. If a true label exists and the number of label categories is greater than one, an analysis of variance-based method is used to assess gene expression differences among different cell types. The core idea of this method is to decompose the total variation in gene expression into between-group and within-group differences, and calculate the sum of squares between groups: Sum of squares within groups: Calculated based on between-group sum of squares and within-group sum of squares Statistic: ,in Number of cell types For the first The number of cells in the class as well as These represent the mean expression level for each cell type and the overall mean expression level across all cells, respectively. The total number of cells. For the degrees of freedom between groups, The degree of freedom within the group. A higher value indicates a more significant difference in gene expression across different cell types. If no prior information is available, the standard deviation of gene expression is used as a proxies for trajectory importance. The standard deviation is calculated using an unbiased estimation formula: This index reflects the overall degree of gene variation in a cell population and can serve as a reasonable substitute when no prior information is available. All calculated raw scores must be normalized to ensure that the final output trajectory score vector... With highly variable gene score vector They are of the same dimension.
[0027] Dynamic weight fusion implementation. For example... Figure 2 As shown, dynamic weight fusion achieves adaptive fusion of two types of scores, and its key lies in the dynamic adjustment mechanism of weights and the introduction of nonlinear enhancement. The fusion formula is: .in, and For dynamic weights, satisfying The specific parameter configuration for the weight update mechanism is as follows: The system maintains a performance history queue of length 5, using a first-in, first-out (FIFO) strategy to store the comprehensive scores of the most recent 5 rounds, which are used to calculate the moving average performance to smooth short-term fluctuations. After each iteration, the performance change rate is calculated. The weight update strategy is as follows: If (Significantly enhanced), strengthening the advantageous directions based on the current weight, that is, if If the direction of the highly variable gene is considered dominant, its score weight is increased by 0.05; otherwise, if the trajectory direction is considered dominant, its score weight is decreased by 0.05 (equivalent to increasing the trajectory weight). The updated weights should be limited to the range of 0.2 to 0.8 to prevent excessive weight bias towards a single direction, which could lead to an imbalance in feature selection. (Slightly improved) Fine-tuning is achieved by adding random perturbation. A Gaussian noise with a mean of 0 and a standard deviation of 0.05 is superimposed on the highly variable gene score weights, and the results are truncated to the range of 0.3 to 0.7. This mechanism introduces moderate randomness while maintaining the dominant direction, avoiding premature convergence to a local optimum; if (Stable or slightly decreasing), a small-scale random exploration is conducted, using the current highly variable gene score weights as the center, and uniform sampling is performed within intervals of 0.1 to the left and right, while ensuring that the sampling intervals fall entirely within the range of 0.3 to 0.7. This mechanism attempts to find potentially better weight combinations through limited random exploration; if... (Significant decline), revert to the historical best weights. The system maintains the historical best weights and their corresponding comprehensive performance scores. When a significant performance deterioration is detected, it immediately reverts to the previously best-performing weight configuration to prevent the model from continuing to deteriorate due to incorrect weight adjustments.
[0028] Adaptive feature quantity determination implementation. For example... Figure 3As shown, the number of features is automatically determined using an intelligent inflection point detection algorithm. The system first sorts the fusion scores in descending order to obtain an ordered sequence s_sorted. Smoothing processing employs a two-level strategy: the first level is a sliding window averaging, with the window size... Based on the total number of genes Dynamically determined, the calculation formula is as follows The first stage allows the window to adaptively adjust to the data size; the second stage uses a Savitzky-Golay filter for secondary smoothing, with a fixed polynomial order of 3 and a window length not exceeding [a certain value]. The largest odd number is used to remove high-frequency noise while preserving the overall trend of the curve. After smoothing, the sequence s_smooth is obtained. Next, inflection point detection is performed: its first-order difference is calculated. and second-order difference The second-order difference minimum point corresponds to the inflection point with the largest curvature change. To improve efficiency, the search range is limited to the first 500 genes. Candidate inflection point calculation... Then, the cumulative variance explained rate is calculated: the proportion of the cumulative sum of the sorted sequences to the total sum is calculated, and the position where the cumulative proportion first reaches 0.9 is located, denoted as n_var90, indicating that the first n_var90 genes contribute 90% of the total fusion score. Finally, the system comprehensively considers inflection point detection, information contribution, and baseline requirements to determine the final number of features: The lower limit of 100 ensures statistical reliability, while the upper limit of 2000 controls computational complexity. Ensure at least 10% of genes are selected. Finally, generate the feature selection mask: confirm. Then, initialize the all-zero mask. Get the highest fusion score The indexes of each gene are used, and the values at these positions are set to 1. The output mask is used for subsequent model reconstruction.
[0029] Model dimension adaptive reconstruction implementation. For example... Figure 4 As shown, adaptive model reconstruction enables dynamic model reconstruction based on feature selection results. Specific implementation details are as follows: The system first saves the complete configuration parameters of the original variational autoencoder model, including the original feature dimensions, hidden layer dimension list, latent variable dimensions, data types, whether there are covariate flags, and the number of discrete states. Feature subset extraction is performed using a new feature matrix. Implementation, in which This is the original expression matrix. Select a mask for the features generated in step 3. The new feature dimension is calculated as follows: Create a new encoder with the input layer dimension set to... The hidden layer dimension list remains consistent with the original model. Weights are initialized using a He normal distribution. If pre-trained weights exist and the hidden layer dimensions remain unchanged, the original hidden layer weights are loaded directly. A new decoder is created, with the output layer dimension set to... The hidden layer dimension list corresponds inversely to that of the original model. Weight initialization also uses a He normal distribution, while the output layer uses random initialization due to the change in dimensions. The latent space layer is reconstructed, maintaining the number of discrete states. The dimensions of the latent variables remain unchanged. Position matrix. Reinitialize to random values, distribution parameters Reset to a uniform distribution. For hidden layers and latent variable layers with unchanged dimensions, directly load the original weights. For input and output layers, use a partial mapping strategy: if the new dimension is less than or equal to the original dimension, truncate the first half of the original weights. Column (input layer) or front Rows (output layer); if the new dimension is greater than the original dimension, retain the original weights. Use the new feature matrix. Training and testing sets were constructed with a 9:1 ratio, and stratified sampling was used to ensure a consistent proportion of cells across all categories. The training batch size was set to 256, and the testing batch size to 512. The Adam optimizer was used in both the pre-training and formal training phases, with a fixed learning rate. After training, the output model adapted to the new feature dimensions is completed. The model weights are saved as a checkpoint file, which includes the training epochs, validation loss, and optimizer state, facilitating subsequent recovery and continued training.
[0030] Performance-driven iterative optimization and trajectory inference implementation. For example... Figure 5 As shown, performance-driven iterative optimization and trajectory inference construct a complete performance feedback loop, deeply integrating feature selection and trajectory inference. Detailed parameter configurations for iterative optimization are as follows: the maximum number of iterations is set to 3 by default, a value determined through cross-validation to achieve a balance between performance and computational cost; the number of Monte Carlo samplings per iteration... The value is set to 50 to ensure the stability of the posterior estimate. The batch size for the posterior estimate is dynamically adjusted based on the number of cells; it is set to 32 when the number of cells is greater than 15,000, and 128 otherwise. The backbone network is constructed using modified_map, with edge weight thresholds... The threshold is set to 0.0001, a low threshold that ensures all potential connections are preserved, and subsequent extraction using the maximum spanning tree ensures an acyclic structure. Specific parameters for the multi-dimensional performance evaluation and weight update mechanism are detailed in the dynamic weight fusion implementation section, including performance change rate calculation, threshold settings for the hierarchical adjustment strategy, step size parameters, and weight range limitations. After iteration, the system reconstructs the final model based on the historical best feature set and outputs complete results: the backbone network G_final is stored in GraphML format, the cell pseudotime pseudotime_final is added to adata.obs['pseudotime'] column by column, the cell state assignment labels_final is stored in adata.obs['vitae_final_clustering'], and the uncertainty estimate is stored in adata.obs['projection_uncertainty']. Simultaneously, the system generates a complete iteration history CSV file for easy subsequent analysis and visualization.
[0031] To verify the effectiveness of this invention, comparative experiments were conducted on five publicly available single-cell datasets, including linear, bifurcation, tree, cycle_1, and trifurcating_1. Trajectory inference was performed using both the original VITAE model and the adaptive feature selection model of this invention, and the prediction performance of the two models was compared. Experimental results show that this invention significantly improves the topology reconstruction capability compared to the baseline method, with the graph editing distance similarity (GED) score increasing from 0.8466 to 0.8777, an improvement of 3.67%. This demonstrates the effectiveness and superiority of this invention.
[0032] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.
Claims
1. A single-cell trajectory inference method based on adaptive feature selection, characterized in that, Includes the following steps: Step 1: Data Preprocessing and Two-Dimensional Gene Importance Assessment: Input single-cell RNA sequencing data, which are standardized, logarithmically normalized, and highly variable genes are screened to obtain an initial gene expression matrix. A two-dimensional gene assessment strategy is adopted to calculate the scores of highly variable genes based on expression variability and the trajectory importance scores related to differentiation trajectories, and output two types of score vectors. The calculation of highly variable gene scores prioritizes the use of highly variable gene ranking values. The ranking is mapped to the [0,1] interval through a ranking transformation formula, and the higher the ranking, the higher the score. If the ranking information is unavailable, the normalized dispersion score is used. If dispersion information is also unavailable, the coefficient of variation is calculated based on the mean and variance of gene expression, and all scores are normalized to the [0,1] interval to obtain a highly variable gene score vector. The trajectory importance score is calculated according to priority: if pseudo-time information is available, the absolute value of the correlation between gene expression and pseudo-time is calculated using the Spearman rank correlation coefficient; if real cell type labels are available, the F-value is calculated using one-way ANOVA as the inter-group difference score; if there is no trajectory information, the standard deviation of gene expression is used as a substitute score, and all scores are normalized to the [0,1] interval to obtain a trajectory importance score vector. Step 2, Dynamic Weight Fusion and Nonlinear Enhancement: Using the score vector output from Step 1 as input; The fusion weights are adaptively adjusted through a performance feedback closed loop, and a nonlinear enhancement term is introduced to highlight key genes. The output is a comprehensive score vector. Step 3, Adaptive Feature Quantity Determination: Using the score vector output from Step 2 as input, an intelligent inflection point detection algorithm is adopted. Through descending sorting, smoothing filtering, inflection point detection, and cumulative variance explanation rate calculation, the optimal number of features is automatically determined, and the output is a feature selection mask. The intelligent inflection point detection algorithm with adaptive feature quantity determination specifically includes: sorting the fusion scores in descending order to obtain an ordered sequence; processing the sequence using a two-stage smoothing strategy: first, performing initial smoothing through sliding window averaging, and then performing secondary smoothing using a Savitzky-Golay filter; calculating the first and second differences of the smoothed sequence, and taking the position corresponding to the minimum point of the second difference as a candidate inflection point; calculating the proportion of the cumulative sum of the sorted sequence to the total sum, and locating the position where the cumulative proportion first reaches a preset threshold as the information contribution inflection point; combining the candidate inflection points and the information contribution inflection points, and combining the preset minimum and maximum feature number ranges, automatically determining the optimal feature quantity, and generating a feature selection mask accordingly; Step 4, Model Dimension Adaptive Reconstruction and Training: Using the feature selection mask output in Step 3 as input; extracting feature subsets based on the mask to construct new input data; saving the original variational autoencoder parameters, reconstructing the variational autoencoder model based on the new dimension, completing training on the new feature subset, and outputting a model adapted to the new dimension; Step 5, Performance-driven iterative optimization and trajectory inference: Using the model output from Step 4 as input, perform multiple rounds of iterative optimization; in each round, perform posterior estimation, backbone network construction and trajectory inference, adjust the fusion weights of Step 2 according to the performance change trend, record the corresponding feature selection results, and finally output the trajectory inference results under the optimal feature set.
2. The single-cell trajectory inference method based on adaptive feature selection according to claim 1, characterized in that, In step 2, the dynamic weight fusion mechanism maintains two dynamic weight parameters, which correspond to the contribution of highly variable gene scores and trajectory importance scores, respectively, and the sum of the two is always 1. The fusion calculation adopts a strategy that combines linear weighting and nonlinear enhancement. The linear weighting result is the sum of the two scores multiplied by their corresponding weights. On this basis, a nonlinear enhancement term is introduced, which squares the initial fusion score, multiplies it by a fixed coefficient, and then adds them together to form the final comprehensive fusion score. The dynamic adjustment of weights relies on a performance feedback loop: after each iteration, the system calculates a comprehensive score based on multi-dimensional performance indicators and calculates the performance change rate by comparing the comprehensive score of the current round with that of the previous round. A tiered adjustment strategy is adopted to update the fusion weights, including: when performance is significantly improved, the weights are enhanced in the direction of advantage based on the current weights; when performance is slightly improved, random perturbations are added for fine-tuning; when performance is stable or slightly decreased, random exploration is carried out within a small range; when performance is significantly decreased, the weights are reverted to the historical best weights.
3. The single-cell trajectory inference method based on adaptive feature selection according to claim 1, characterized in that, In step 4, the specific process of adaptive reconstruction of model dimensions includes: extracting a new subset of features from the original gene expression matrix based on the feature selection mask, and constructing a new input data matrix after dimensional reduction; creating a new encoder, adjusting the input layer dimension to the new number of features, while keeping the hidden layer dimension consistent with the original model; creating a new decoder, adjusting the output layer dimension to the new number of features, while keeping the hidden layer dimension inversely corresponding to the original model; reinitializing the latent space layer, keeping the number of discrete states and the dimension of latent variables unchanged; for the hidden layer and latent variable layer whose dimensions have not changed, directly loading the pre-trained weights of the original model; for the input layer and output layer whose dimensions have changed, using a random initialization or partial mapping strategy.
4. The single-cell trajectory inference method based on adaptive feature selection according to claim 1, characterized in that, In step 5, the performance-driven iterative optimization uses Monte Carlo sampling for posterior estimation, the backbone network is constructed using a specified method and low-confidence edges are filtered by threshold, a directed acyclic graph is constructed based on the backbone network and root nodes and cell pseudo-time is calculated; the performance index, number of selected features, and fusion weights of each iteration are recorded in the historical queue, and the best performance record and corresponding feature mask are updated when the overall performance is better than the historical best; the final output trajectory inference results include backbone network, cell pseudo-time, cell state allocation and uncertainty estimation; the system also outputs a complete iterative history record.