Big data-based human cell viability data analysis system
By using multimodal fusion and linear regression models in big data analytics systems, the data integration challenges in traditional cell viability analysis have been solved, enabling efficient utilization of multidimensional information and accurate prediction of efficacy scores, thereby improving the accuracy and efficiency of drug development.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENQI LIFE TECH (FUJIAN) CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional cell viability analysis relies on a single detection indicator, making it difficult to integrate multi-dimensional information. The correlation analysis between transcriptome data and cell viability data lacks systematicity, resulting in poor data comparability and insufficient accuracy of efficacy prediction models.
A big data-based human cell viability data analysis system was adopted, including modules for data acquisition, preprocessing and standardization, multimodal fusion, generation of influencing features, and prediction and interpretation. Through steady-state transformation, unified grid and noise reduction and completion operations, combined with low-rank coupling decomposition model and linear regression model, the fusion of multimodal data and efficacy scoring were realized.
It improves data quality and feature representation capabilities, enhances the accuracy of efficacy characterization, and enables efficient prediction and threshold early warning for efficacy evaluation and mechanism analysis. It has good scalability and engineering feasibility.
Smart Images

Figure CN121789830B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of human cell data analysis technology, specifically relating to a human cell viability data analysis system based on big data. Background Technology
[0002] In drug development and cell biology research, human cell viability assessment is a core tool for evaluating the efficacy and toxicity of compounds, and its accuracy directly affects experimental conclusions and clinical translation efficiency. Traditional cell viability analysis often relies on a single detection indicator, such as the absorbance value of the MTT assay, which can only reflect local characteristics of cell metabolic state and is significantly affected by batch differences and heterogeneity of detection methods, making it difficult to integrate multi-dimensional information such as gene expression.
[0003] In existing technologies, the correlation analysis between transcriptome data and cell viability data is mostly conducted independently, lacking a systematic coupling mechanism. On the one hand, the original experimental data suffers from problems such as inconsistent dimensions, noise interference, and missing values, directly leading to poor data comparability. On the other hand, the high-dimensionality and heterogeneity of multimodal data make it difficult to discover shared biological patterns, resulting in insufficient accuracy of efficacy prediction models and failing to provide effective guidance for optimizing experimental protocols. Summary of the Invention
[0004] This invention provides a human cell viability data analysis system based on big data, solving the technical problems in related technologies.
[0005] This invention provides a human cell viability data analysis system based on big data, comprising:
[0006] The data acquisition module is used to collect raw experimental data and transcriptome data of human cell samples to be evaluated. The raw experimental data includes: viability monitoring indicators, dosage information and time information.
[0007] The preprocessing and normalization module is used to preprocess the raw experimental data and output a three-dimensional vitality tensor. The preprocessing operations include: steady-state transformation, mesh unification, and noise reduction and completion.
[0008] The multimodal fusion module is used to couple and decompose the three-dimensional vitality tensor and transcriptome data to extract the shared sample coefficient matrix;
[0009] The influence feature generation module is used to calculate the dose influence feature, time influence feature, and vitality influence feature based on the shared sample coefficient matrix and the three-dimensional vitality tensor.
[0010] The prediction and interpretation module is used to build a linear regression model based on the vitality impact characteristics and preset efficacy labels, output the efficacy score of the corresponding sample to be evaluated, and give a thresholded early warning result based on the efficacy score.
[0011] Furthermore, the viability monitoring indicators include: MTT absorbance value, Resazurin fluorescence intensity value, ATP luminescence intensity value, and flow cytometry percentage of viable cells;
[0012] The steady-state transition of the preprocessing operation includes:
[0013] For each type of vitality monitoring index, the baseline vitality measurement value and the corresponding treatment vitality measurement value are extracted, and a Sigmoid curve is constructed to obtain the mapping slope parameter and the center parameter.
[0014] Based on the mapping slope parameter and the center parameter, a monotonic sigmoid mapping is performed on all treated vitality measurements, limiting the mapping results to the interval of -1 to 1 and stabilizing the variance, and outputting a steady-state vitality signal.
[0015] Furthermore, the dosage information includes: compound name and mass concentration;
[0016] The unified grid for preprocessing operations includes:
[0017] Logarithmically transform the mass concentration of the compound and combine it with time information to generate a concentration time series;
[0018] Structured grid vitality data is generated by bilinear interpolation of steady-state vitality signals based on concentration time series.
[0019] Furthermore, the noise reduction and completion in the preprocessing operation includes:
[0020] S201, Perform Anscombe transformation on structured grid vitality data;
[0021] S202, Wavelet transform and soft thresholding are performed on the transformed structured grid vitality data to obtain denoised structured grid vitality data;
[0022] S203, perform inverse Anscombe transform on the denoised structured grid vitality data and generate a mask for missing locations to obtain noise-suppressed grid vitality data;
[0023] S204 uses the CP decomposition method based on the missing location mask to complete the missing values and outputs a complete three-dimensional vitality tensor.
[0024] Furthermore, the three-dimensional vitality tensor and transcriptome data are coupled and decomposed to extract the shared sample coefficient matrix, including:
[0025] S301. A transcriptome expression matrix is generated based on transcriptome data. A low-rank coupled decomposition model is constructed using the three-dimensional vitality tensor and the transcriptome expression matrix as inputs. The number of potential factors is set as the adjustable parameter of the model. The initial shared sample coefficient matrix, dose coefficient matrix, time coefficient matrix and gene coefficient matrix are variables to be determined. An objective function to minimize the reconstruction error is constructed.
[0026] S302, the initial shared sample coefficient matrix is extracted through a low-rank coupling decomposition model, as well as the dose coefficient matrix and time coefficient matrix corresponding to the dose dimension and time dimension, respectively, and the gene coefficient matrix is extracted from the transcriptome expression matrix;
[0027] S303 uses an alternating least squares algorithm to iteratively update the initial shared sample coefficient matrix, dose coefficient matrix, time coefficient matrix, and gene coefficient matrix until the reconstruction error converges;
[0028] S304 uses the minimization of the reconstruction error of the training set and the prediction error of the validation set as the criterion to adaptively determine the number of potential factors, and terminates the iteration early when the error decreases below the first preset threshold, thus obtaining a converged shared sample coefficient matrix.
[0029] Furthermore, the objective function for minimizing the reconstruction error consists of the sum of two reconstruction errors. The first part is the reconstruction error of the three-dimensional vitality tensor, which is obtained by taking the outer product of the column vectors of the sample coefficient matrix, dose coefficient matrix and time coefficient matrix on the same latent factor and summing them, and then subtracting them element by element from the three-dimensional vitality tensor and summing the squared differences.
[0030] The second part is the transcriptome expression matrix reconstruction error, which is obtained by multiplying the shared sample coefficient matrix with the gene coefficient matrix, subtracting each element from the transcriptome expression matrix, and summing the squared differences.
[0031] Furthermore, based on the shared sample coefficient matrix and the three-dimensional vitality tensor, the dose-effect characteristics, time-effect characteristics, and vitality-effect characteristics are calculated, including:
[0032] S401, calculate the dose influence slope on the dose coefficient matrix in the dose dimension to obtain the dose influence characteristics;
[0033] S402, calculate the slope of the time influence on the time coefficient matrix in the time dimension to obtain the time influence characteristics;
[0034] S403, multiply the dose-effect characteristics and time-effect characteristics to obtain the activity-effect characteristics.
[0035] Furthermore, the preset efficacy labels include: efficacy achieved and efficacy insufficient, represented by 1 and 0 respectively.
[0036] Furthermore, the acquisition of efficacy scores specifically includes:
[0037] S501, construct a linear regression model with vitality impact characteristics as independent variables and efficacy labels as dependent variables;
[0038] S502, the weight vector and bias term of the linear regression model are obtained by minimizing the weighted squared error and introducing the L2 regularization term;
[0039] S503 uses a dual criterion of minimizing the reconstruction error of the training set and minimizing the prediction error of the validation set to adaptively determine the regularization coefficient;
[0040] S504 uses a trained linear regression model to output efficacy scores for the samples to be evaluated.
[0041] The beneficial effects of this invention are as follows: By employing steady-state transformation, unified grid, and noise reduction and completion operations, this invention achieves structural consistency and numerical robustness of different types of viability assay data, thereby improving data quality. It utilizes a coupled decomposition model to fuse viability tensors and transcriptomic expression information, extracting shared sample coefficients, correcting detection method biases, and enhancing feature expression capabilities. Furthermore, it generates dose-influence features, time-influence features, and their combined viability influence fingerprint, making efficacy characterization more accurate. Finally, by introducing a regularized linear regression model, it achieves efficient prediction and threshold warning of efficacy scores, demonstrating good scalability and engineering feasibility. The overall solution is highly systematic, with a clear logical path, effectively enhancing the application value of human cell viability data in drug efficacy evaluation and mechanism analysis. Attached Figure Description
[0042] Figure 1 This is a schematic diagram of the modules of the human cell viability data analysis system based on big data of the present invention. Detailed Implementation
[0043] The subject matter described herein will now be discussed with reference to exemplary embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and implement the subject matter described herein, and changes may be made to the function and arrangement of the elements discussed without departing from the scope of this specification. Various processes or components may be omitted, substituted, or added as needed in the examples. Furthermore, features described in some examples may be combined in other examples.
[0044] It should be noted that, unless otherwise defined, the technical or scientific terms used in one or more embodiments of the present invention should have the ordinary meaning understood by one of ordinary skill in the art to which this invention pertains. The terms "first," "second," and similar terms used in one or more embodiments of the present invention do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.
[0045] like Figure 1 As shown, the human cell viability data analysis system based on big data includes:
[0046] The data acquisition module 101 is used to collect raw experimental data and transcriptome data of human cell samples to be evaluated. The raw experimental data includes: viability monitoring indicators, dosage information and time information.
[0047] The preprocessing and normalization module 102 is used to preprocess the raw experimental data and output a three-dimensional vitality tensor. The preprocessing operations include: steady-state transformation, mesh unification, and noise reduction and completion.
[0048] The multimodal fusion module 103 is used to couple and decompose the three-dimensional vitality tensor and transcriptome data to extract the shared sample coefficient matrix;
[0049] The influence feature generation module 104 is used to calculate the dose influence feature, time influence feature, and vitality influence feature based on the shared sample coefficient matrix and the three-dimensional vitality tensor.
[0050] The prediction and interpretation module 105 is used to establish a linear regression model based on the vitality impact characteristics and preset efficacy labels, output the efficacy score of the corresponding sample, and give a thresholded early warning result based on the efficacy score.
[0051] In one embodiment of the present invention, the viability monitoring indicators include: MTT absorbance, Resazurin fluorescence intensity, ATP luminescence intensity, and flow cytometry viable cell percentage; wherein, the MTT absorbance is obtained by detecting the absorbance of formazan, the reduction product of MTT, by succinate dehydrogenase in the mitochondria of viable cells; the Resazurin fluorescence intensity is detected based on the characteristic of viable cells reducing oxidized Resazurin to the fluorescent product Resorufin; the ATP luminescence intensity is determined by measuring intracellular ATP content using bioluminescence to reflect metabolic activity; and the flow cytometry viable cell percentage is analyzed by using the difference in cell membrane integrity labeling with fluorescent dyes to analyze the proportion of viable cells.
[0052] The steady-state transition of the preprocessing operation includes:
[0053] For each type of viability monitoring index, baseline viability measurements and corresponding treatment viability measurements were extracted. Baseline viability measurements were the viability results of cells in the untreated control group, while treatment viability measurements were the viability results of cells treated with different compound doses. Based on these data, a Sigmoid curve was constructed to obtain the mapping slope parameter and the center parameter. The function form is as follows: ,in, This represents the mapped output under the Sigmoid curve, where k represents the mapping slope parameter and x represents the measured treatment vigor value. The center parameter is represented by D, and the first bias is represented by D.
[0054] Based on the mapping slope parameter and center parameter, a monotonic Sigmoid mapping is performed on all treated vigor measurements, limiting the mapping result to the interval between -1 and 1 and stabilizing the variance, outputting a steady-state vigor signal; wherein, the calculation formula for the monotonic Sigmoid mapping is: , Indicates steady-state vitality signal, This represents the activity measurement value of the t-th treatment.
[0055] By employing the three-step process of benchmark-processed value extraction, Sigmoid fitting, and monotonic mapping, the original experimental data can be unified into a stable and comparable steady-state vitality signal without losing the original ordination relationship. This lays the data foundation for subsequent dose-time gridding and multimodal coupled decomposition.
[0056] In one embodiment of the present invention, the dosage information includes: compound name and mass concentration;
[0057] The unified grid for preprocessing operations includes:
[0058] The mass concentration of the compound is logarithmically transformed and combined with time information to generate a concentration time series. The time information includes fixed sampling points such as 0 hours, 6 hours, 24 hours and 48 hours after processing. The logarithmic transformation can map the concentration values to a uniformly distributed logarithmic concentration space, and the logarithmic concentration values are mapped one-to-one with the time points to generate a concentration time series containing all samples to be evaluated.
[0059] Bilinear interpolation is performed on the steady-state viability signal based on the concentration time series to generate structured grid viability data. Specifically, since the concentration gradient or time sampling points of different experiments may differ, the steady-state viability signal needs to be aligned to the same grid node through interpolation to obtain structured grid viability data. This data is represented in the form of a three-dimensional matrix with dimensions of number of samples × number of concentration nodes × number of time nodes, ensuring the spatial consistency of viability data under different samples and different treatment conditions.
[0060] This embodiment linearizes the nonlinear concentration response through logarithmic transformation and achieves grid alignment of different experimental data by combining bilinear interpolation. This eliminates the data heterogeneity caused by differences in concentration gradient and time sampling, provides a standardized three-dimensional data structure for subsequent noise reduction, completion, and multimodal fusion, and improves the reliability of cross-sample analysis.
[0061] In one embodiment of the present invention, the noise reduction and completion of the preprocessing operation includes:
[0062] S201 performs an Anscombe transform on the structured grid vitality data, converting the data to an approximate Gaussian distribution space;
[0063] S202, wavelet transform and soft thresholding are performed on the transformed structured grid vitality data to obtain denoised structured grid vitality data, which effectively suppresses high-frequency noise;
[0064] S203. Perform an inverse Anscombe transform on the denoised structured grid vitality data to restore the original data distribution and generate a missing location mask to obtain noise-suppressed grid vitality data. The missing location mask is 1 to indicate the presence of an observation and 0 to indicate the absence of an observation, which is used to indicate the missing location in the subsequent completion process.
[0065] S204 employs the CP decomposition method based on missing location masks to complete missing values and output a complete 3D vitality tensor. Specifically, the CP decomposition method decomposes the noise-suppressed grid vitality data into the sum of the outer products of three low-rank factor matrices (corresponding to the sample, dose, and time dimensions, respectively), constructing a low-rank representation model of the data. The outer product operation combines the factor vectors of the three dimensions into 3D tensor fragments, and the superposition of multiple fragments reconstructs the original data structure. During the iterative solution process, the algorithm only uses observations marked as 1 in the mask for model training. By minimizing the reconstruction error, it continuously optimizes the three matrices, ensuring that the reconstructed tensor is consistent with the original observations at known locations. Simultaneously, it predictively fills in missing locations marked as 0 in the mask. After multiple iterations until the reconstruction error converges, the final 3D vitality tensor covers all samples, doses, and time nodes. It retains the effective information of the original observation data and reasonably completes missing values through low-rank structure constraints, thus outputting a missing-free and structurally complete 3D vitality tensor, providing a continuous and complete data foundation for subsequent multimodal fusion and other steps.
[0066] In one embodiment of the present invention, the three-dimensional vitality tensor and transcriptome data are coupled and decomposed to extract a shared sample coefficient matrix, including:
[0067] S301. A transcriptome expression matrix is generated based on transcriptome data. A low-rank coupled decomposition model is constructed using the three-dimensional vitality tensor and the transcriptome expression matrix as inputs. The number of potential factors is set as the adjustable parameter of the model. The initial shared sample coefficient matrix, dose coefficient matrix, time coefficient matrix and gene coefficient matrix are variables to be determined. An objective function to minimize the reconstruction error is constructed.
[0068] Transcriptome data refers to the quantitative information on gene expression in human cell samples to be evaluated, obtained through high-throughput sequencing technology. After removing batch effects, a transcriptome expression matrix with the dimension of sample number × gene number is formed. Each row of the transcriptome expression matrix corresponds to a human cell sample to be evaluated, and each column corresponds to the expression level of a gene.
[0069] The objective function of the low-rank coupled decomposition model is:
[0070] ;
[0071] in, Represents the objective function value. This indicates the operation of finding the minimum value. Let G represent the three-dimensional vitality tensor, A represent the shared sample coefficient matrix, B represent the dose coefficient matrix (representing the weight structure of different dose concentrations across different latent factor dimensions, with each row corresponding to a concentration level), C represent the time coefficient matrix (representing the response contribution at each time point across the latent factor dimension), S represent the gene coefficient matrix (representing the weight of each gene across the latent factor dimension), R represent the number of latent factors (representing the number of latent variable dimensions used for fitting and modeling, a hyperparameter of the low-rank coupled decomposition model), and r represent the latent factor index. This represents the r-th column vector of the shared sample coefficient matrix. This represents the r-th column vector of the dose coefficient matrix. This represents the r-th column vector of the time coefficient matrix. This represents the outer product of vectors, where the result of the outer product of three vectors is a three-dimensional tensor with rank 1. Represents the tensor reconstruction error, representing The sum of the squares of all elements in the matrix is the square of the Frobenius norm. This represents the fusion weight coefficient, and T represents the transpose operation;
[0072] S302, the initial shared sample coefficient matrix is extracted through a low-rank coupling decomposition model, as well as the dose coefficient matrix and time coefficient matrix corresponding to the dose dimension and time dimension, respectively, and the gene coefficient matrix is extracted from the transcriptome expression matrix;
[0073] S303 employs an alternating least squares algorithm to iteratively update the initial shared sample coefficient matrix, dose coefficient matrix, time coefficient matrix, and gene coefficient matrix until the reconstruction error converges. Specifically, during the iteration process, three matrices are fixed each time, and the fourth matrix is updated by minimizing the objective function. Each time the objective function is updated, a closed-form solution is obtained by taking the partial derivative of the objective function and setting it to zero. The above loop is repeated until the difference between the objective function values of two iterations is less than a preset difference threshold, at which point the reconstruction error is determined to have converged.
[0074] S304, based on minimizing the reconstruction error of the training set and the prediction error of the validation set, adaptively determines the number of latent factors and terminates the iteration early when the error decrease is less than a first preset threshold, thus obtaining a converged shared sample coefficient matrix. Specifically, by testing different numbers of latent factors, the reconstruction error of the training set and the prediction error of the validation set under each latent factor are calculated, and the latent factor that minimizes both is selected as the optimal number of latent factors. During the iteration process, if the error decrease is less than 0.1% for five consecutive iterations, the iteration is terminated early to avoid overfitting. Finally, a converged shared sample coefficient matrix is output, where each row corresponds to a sample, each column corresponds to a latent pattern, and the element value represents the comprehensive strength of the sample under that pattern.
[0075] This embodiment solves the problem of strong heterogeneity and weak correlation of multimodal data by organically linking the three-dimensional vitality tensor and transcriptome expression matrix through a low-rank coupling decomposition model and a shared sample coefficient matrix. The alternating least squares algorithm ensures the stability and efficiency of the model solution, and the adaptive latent factor number strategy balances the model complexity and fitting accuracy.
[0076] In one embodiment of the present invention, dose-effect characteristics, time-effect characteristics, and vitality-effect characteristics are calculated based on the shared sample coefficient matrix and the three-dimensional vitality tensor, including:
[0077] S401, calculate the dose influence slope on the dose coefficient matrix along the dose dimension to obtain the dose influence characteristics; specifically, use linear regression to calculate the dose influence slope on the dose-concentration axis (logarithmic scale). Fitting a linear model yields the dose-effect slope, which represents the dose-effect strength of the r-th latent factor along the dose dimension. Summarizing the slopes corresponding to all latent factors yields the dose-effect characteristics.
[0078] S402, calculate the slope of the time influence on the time coefficient matrix in the time dimension to obtain the time influence characteristics; similarly, calculate the slope of the time influence in the time dimension, i.e., for each column. This represents the response of the r-th latent factor at each time point, and the time-effect characteristics are obtained by regression fitting on the time axis.
[0079] S403, the dose-effect characteristics and time-effect characteristics are multiplied element-wise according to the potential factor dimension to obtain the vitality-effect characteristics.
[0080] This embodiment models the decomposition factors in a trend-based manner along both the dosage and time dimensions, thereby extracting biologically interpretable feature indicators from the perspective of potential factors. Furthermore, through cross-dimensional feature fusion operations, it significantly enhances the model's ability to capture the response of drug action trends and rhythmic effects, thereby improving the accuracy and discriminativeness of downstream efficacy prediction.
[0081] In one embodiment of the present invention, obtaining the efficacy score specifically includes:
[0082] S501, construct a linear regression model with vitality impact characteristics as independent variables and efficacy labels as dependent variables; where efficacy labels are marked as binary values according to experimental standards, 1 indicates that the efficacy standard is met, and 0 indicates that the efficacy is insufficient;
[0083] S502, by minimizing the weighted squared error and introducing the L2 regularization term, the weight vector and bias term of the linear regression model are obtained; by taking the partial derivative of the objective function composed of the weighted squared error and the L2 regularization term and setting it to zero, the closed-form solution of the weight vector and bias term can be obtained, ensuring that the linear regression model can fit the data while suppressing the risk of overfitting.
[0084] S503 uses a dual criterion of minimizing the reconstruction error of the training set and minimizing the prediction error of the validation set to adaptively determine the regularization coefficient. The regularization coefficient is used to control the sparsity of the weight vector. This step balances the linear regression model's ability to fit the training set and its ability to generalize to unknown data.
[0085] S504 uses a trained linear regression model to output efficacy scores for the samples to be evaluated.
[0086] The linear regression model in this embodiment intuitively quantifies the contribution of each vitality-influencing feature to the therapeutic effect through weight vectors. L2 regularization effectively solves the overfitting problem caused by high-dimensional features. The regularization coefficient determined by the dual criteria ensures that the model has both fitting accuracy and generalization ability.
[0087] After obtaining the efficacy score of the sample to be evaluated, it is compared with the preset warning threshold. If the efficacy score of the sample is greater than or equal to the threshold, it is judged as "efficacy target achieved warning", indicating that the compound or treatment regimen corresponding to the sample may achieve the expected efficacy under the current conditions. If the efficacy score of the sample is less than the threshold, it is judged as "efficacy insufficient warning", indicating that the treatment regimen corresponding to the sample may not have achieved the expected effect and the dosage or time parameters need to be further adjusted.
[0088] It should be noted that the interval and threshold sizes are set for ease of comparison. The size of the threshold depends on the amount of sample data and the base number set by those skilled in the art for each set of sample data, as long as it does not affect the proportional relationship between the parameter and the quantized value. Furthermore, the above formulas are all dimensionless calculations, and the formulas are derived from software simulations using a large amount of collected data to obtain the most recent real-world results. The preset parameters in the formulas are set by those skilled in the art according to the actual situation.
[0089] The embodiments of the present invention have been described above, but the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms based on the guidance of the present embodiments, all of which are within the protection scope of the present embodiments.
Claims
1. A human cell viability data analysis system based on big data, characterized in that, include: The data acquisition module is used to collect raw experimental data and transcriptome data of human cell samples to be evaluated. The raw experimental data includes: viability monitoring indicators, dosage information and time information. The preprocessing and normalization module is used to preprocess the raw experimental data and output a three-dimensional vitality tensor. The preprocessing operations include: steady-state transformation, mesh unification, and noise reduction and completion. The multimodal fusion module is used to couple and decompose the three-dimensional vitality tensor and transcriptome data to extract the shared sample coefficient matrix, including: S301. A transcriptome expression matrix is generated based on transcriptome data. A low-rank coupled decomposition model is constructed using the three-dimensional vitality tensor and the transcriptome expression matrix as inputs. The number of potential factors is set as the adjustable parameter of the model. The initial shared sample coefficient matrix, dose coefficient matrix, time coefficient matrix and gene coefficient matrix are variables to be determined. An objective function to minimize the reconstruction error is constructed. The objective function for minimizing the reconstruction error consists of the sum of two reconstruction errors. The first part is the reconstruction error of the three-dimensional vitality tensor, which is obtained by taking the outer product of the column vectors of the sample coefficient matrix, dose coefficient matrix, and time coefficient matrix on the same latent factor and summing them, and then subtracting them element-by-element from the three-dimensional vitality tensor and summing the squared differences. The second part is the reconstruction error of the transcriptome expression matrix, which is obtained by multiplying the shared sample coefficient matrix and gene coefficient matrix and then subtracting them element-by-element from the transcriptome expression matrix and summing the squared differences. S302, the initial shared sample coefficient matrix is extracted through a low-rank coupling decomposition model, as well as the dose coefficient matrix and time coefficient matrix corresponding to the dose dimension and time dimension, respectively, and the gene coefficient matrix is extracted from the transcriptome expression matrix; S303 uses an alternating least squares algorithm to iteratively update the initial shared sample coefficient matrix, dose coefficient matrix, time coefficient matrix, and gene coefficient matrix until the reconstruction error converges; S304, based on minimizing the reconstruction error of the training set and the prediction error of the validation set, adaptively determines the number of potential factors, and terminates the iteration early when the error decrease is lower than the first preset threshold, thus obtaining a converged shared sample coefficient matrix. The influence feature generation module is used to calculate the dose influence feature, time influence feature, and vitality influence feature based on the shared sample coefficient matrix and the three-dimensional vitality tensor. The prediction and interpretation module is used to build a linear regression model based on the vitality impact characteristics and preset efficacy labels, output the efficacy score of the corresponding sample to be evaluated, and give a thresholded early warning result based on the efficacy score.
2. The human cell viability data analysis system based on big data according to claim 1, characterized in that, Viability monitoring indicators include: MTT absorbance, Resazurin fluorescence intensity, ATP luminescence intensity, and flow cytometry percentage of viable cells; The steady-state transition of the preprocessing operation includes: For each type of vitality monitoring index, the baseline vitality measurement value and the corresponding treatment vitality measurement value are extracted, and a Sigmoid curve is constructed to obtain the mapping slope parameter and the center parameter. Based on the mapping slope parameter and the center parameter, a monotonic sigmoid mapping is performed on all treated vitality measurements, limiting the mapping results to the interval of -1 to 1 and stabilizing the variance, and outputting a steady-state vitality signal.
3. The human cell viability data analysis system based on big data according to claim 2, characterized in that, Dosage information includes: compound name and mass concentration; The unified grid for preprocessing operations includes: Logarithmically transform the mass concentration of the compound and combine it with time information to generate a concentration time series; Structured grid vitality data is generated by bilinear interpolation of steady-state vitality signals based on concentration time series.
4. The human cell viability data analysis system based on big data according to claim 3, characterized in that, The noise reduction and completion operations in the preprocessing process include: S201, Perform Anscombe transformation on structured grid vitality data; S202, Wavelet transform and soft thresholding are performed on the transformed structured grid vitality data to obtain denoised structured grid vitality data; S203, perform inverse Anscombe transform on the denoised structured grid vitality data and generate a mask for missing locations to obtain noise-suppressed grid vitality data; S204 uses the CP decomposition method based on the missing location mask to complete the missing values and outputs a complete three-dimensional vitality tensor.
5. The human cell viability data analysis system based on big data according to claim 1, characterized in that, Based on the shared sample coefficient matrix and the three-dimensional vitality tensor, the dose-effect characteristics, time-effect characteristics, and vitality-effect characteristics were calculated, including: S401, calculate the dose influence slope on the dose coefficient matrix in the dose dimension to obtain the dose influence characteristics; S402, calculate the slope of the time influence on the time coefficient matrix in the time dimension to obtain the time influence characteristics; S403, multiply the dose-effect characteristics and time-effect characteristics to obtain the activity-effect characteristics.
6. The human cell viability data analysis system based on big data according to claim 1, characterized in that, The preset efficacy labels include: efficacy achieved and efficacy insufficient, represented by 1 and 0 respectively.
7. The human cell viability data analysis system based on big data according to claim 1, characterized in that, The acquisition of efficacy scores specifically includes: S501, construct a linear regression model with vitality impact characteristics as independent variables and efficacy labels as dependent variables; S502, the weight vector and bias term of the linear regression model are obtained by minimizing the weighted squared error and introducing the L2 regularization term; S503 uses a dual criterion of minimizing the reconstruction error of the training set and minimizing the prediction error of the validation set to adaptively determine the regularization coefficient; S504 uses a trained linear regression model to output efficacy scores for the samples to be evaluated.