Statistical machine learning-based biochip data feature engineering method

By employing a statistical machine learning-based biochip data feature engineering method, the problems of dimensionality curse and feature selection difficulties in biochip data analysis have been solved. This method enables effective analysis and dimensionality reduction of unknown gene expression data, thereby improving the accuracy and efficiency of bioinformatics research.

CN114724633BActive Publication Date: 2026-06-26RENJI HOSPITAL AFFILIATED TO SHANGHAI JIAO TONG UNIV SCHOOL OF MEDICINE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
RENJI HOSPITAL AFFILIATED TO SHANGHAI JIAO TONG UNIV SCHOOL OF MEDICINE
Filing Date
2022-04-18
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Biochip data analysis suffers from the curse of dimensionality, existing hypothesis testing methods cannot select a specific number of features for analysis, and traditional dimensionality reduction methods cannot effectively display genes with significant phenotypic influence, thus limiting the information display of unknown gene expression data in biochips.

Method used

We employed a biochip data feature engineering method based on statistical machine learning. Through steps such as generating a data matrix, z-score standardization, calculating F-values ​​to screen gene data, generating a correlation coefficient matrix, screening gene pairs, and calculating multiple correlation coefficients, we performed feature selection and dimensionality reduction, and selected a certain number of genes with large inter-group differences for correlation analysis.

Benefits of technology

It effectively reduces data dimensionality, retains key gene information, can analyze the expression relationships of unnamed or unannotated genes, improves the accuracy and reliability of data analysis, reduces computational load, and avoids statistical errors.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114724633B_ABST
    Figure CN114724633B_ABST
Patent Text Reader

Abstract

A kind of biological chip data feature engineering method based on statistical machine learning, comprising the following steps: generating data matrix;Carrying out z-score standardization;Value is calculated, and large-value gene data is screened;Correlation coefficient matrix is generated;Screening gene pair;Complex correlation coefficient is calculated;Mark gene complex correlation coefficient change.The present application is beneficial to analyze the correlation between data in a large number of biological chip data, and select a certain number of genes reflecting the difference between data groups according to the demand by using the method of feature selection.The present application uses correlation analysis statistics correlation coefficient, partial correlation coefficient and complex correlation coefficient for feature selection, which is beneficial to further reduce the data dimension, and is beneficial to predict the correlation change between two genes under different experimental treatment conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of biomedicine, and more particularly to biochip technology, specifically a biochip data feature engineering method based on statistical machine learning. Background Technology

[0002] Biochip technology is a commonly used method in epigenetics research. It utilizes the specific interactions between biomolecules (such as the complementary base pairing of nucleic acids, the specific binding between antigens and antibodies, etc.). The technical method involves preparing a portion of the interacting biomolecules (such as a series of DNA or RNA single strands with specific sequences, or antibodies with a series of corresponding antigens) as "probes." These probes are then immobilized on a substrate (a thin film made of nylon, nitrocellulose, etc.) using physical or chemical methods, forming a two-dimensional microarray. Subsequently, the biomolecules to be detected (such as RNA and proteins) from various biological samples (such as cell samples, tissue samples, etc.) are extracted and fully reacted with the probes on the substrate. After elution, the sample biomolecules that can react and bind with the probes are retained. Finally, the types and quantities of sample biomolecules bound to the probes are detected by isotope labeling or biofluorescence.

[0003] Biochip technology can obtain vast amounts of data at the transcriptional and translational levels from a single sample, with each sample capable of detecting thousands or tens of thousands of RNA and protein types. Applying this data to machine learning places extremely high demands on the sample size; the sample size should ideally match or exceed the number of RNA and protein types. Otherwise, the "curse of dimensionality" will occur, rendering many machine learning methods unsuitable. However, in reality, due to factors such as detection costs, the number of samples used for biochip detection is far less than the number of RNA or protein types that can be measured, making the "curse of dimensionality" seem unavoidable. Therefore, reducing the data dimensionality using dimensionality reduction methods becomes essential before performing machine learning.

[0004] Currently, commonly used dimensionality reduction methods include linear dimensionality reduction, represented by Principal Component Analysis (PCA), and manifold dimensionality reduction, represented by the t-SNE algorithm, both suitable for classification studies. However, the data obtained from biochip technology is not always used for classification; rather, it's necessary to determine which RNAs or proteins are worth studying and what biological mechanisms their behavior in various samples reflects. Therefore, feature selection is a more suitable dimensionality reduction approach. In machine learning algorithms, feature selection methods can be mainly divided into three categories: filtering, wrapping, and embedding, each with multiple specific algorithms suitable for different application scenarios. In the application of biochip data, feature selection involves using an algorithm to screen out biologically significant or interesting RNAs or proteins from a vast array of RNA and protein categories. Furthermore, biomedical research places great emphasis on the reproducibility of molecular biology experiments, making algorithm robustness a priority. These factors place specific requirements on feature selection algorithms for biochip data. Statistical machine learning algorithms can better meet these requirements.

[0005] The following problems exist in the existing technology:

[0006] 1. The F-statistic for calculating biochip data is commonly used for hypothesis testing. The F-test is well-suited for determining whether differences in the expression of individual genes in a sample are statistically significant. However, commonly used hypothesis testing methods cannot select a specific number of features for analysis. In the process of analyzing large amounts of data, especially when analyzing the relationships between data points, this is not conducive to building a foundation for quantitative analysis.

[0007] 2. After biochip data is selected, the feature-selected genes are often placed into databases such as KEGG Pathway and Reactome for analysis. However, these databases often contain known genes and gene-gene interactions, limiting the display of more information about unknown and unannotated gene expression data in biochips.

[0008] 3. Dimensionality reduction of biochip data usually uses methods such as PCA. These methods are effective in showing the overall differences in samples, but they cannot be used to display several genes that have a significant impact on phenotype for analysis. Summary of the Invention

[0009] In view of the above-mentioned technical problems in the prior art, the present invention provides a biochip data feature engineering method based on statistical machine learning, which aims to solve the above-mentioned technical problem of difficulty in biochip data analysis in the prior art.

[0010] The present invention provides a biochip data feature engineering algorithm based on statistical machine learning, comprising the following steps:

[0011] Step S10: Generate a data matrix;

[0012] Step S20: Perform z-score standardization;

[0013] Step S30: Calculation Value, filter large Value gene data;

[0014] Step S40: Generate the correlation coefficient matrix;

[0015] Step S50: Screen gene pairs;

[0016] Step S60: Calculate the multiple correlation coefficient;

[0017] Step S70: Changes in the multiple correlation coefficient of marker genes;

[0018] Furthermore, step S10 includes: after obtaining the biochip data, each gene is... , , … ... Number, and satisfy The group of each sample is... , , … … Number, and satisfy ; Each data sample in the group is , , … ... Number, where " " "、..." "、..." "These are all subscripts, indicating that the sample is located at the 1st position." Group, and located in the first Group 1 One, of which For the first The number of samples contained in each group, and satisfying the following conditions: ;

[0019] A data matrix is ​​generated using gene IDs as row names and sample IDs as column names, with the index of each row and column name serving as the row and column index. Each element in the data matrix represents the raw data of the expression level of a single gene in a single sample on a biochip, where the element is the first element. line, number The data in the column is denoted as , The tables in Appendix 1-1 and Appendix 1-2 contain The term "orig" (original) in the upper right corner indicates that the data has not been processed and is not a commonly used function; the meaning of the upper right corner "orig" is the same as above. The lower right corner "... "Represents data" The position within the matrix. Neither of these are commonly used functions; see Appendix 1-1 and Appendix 1-2 for details; the tables in Appendix 1-1 and Appendix 1-2 show the detailed format of the data matrix described in this section.

[0020] Furthermore, step S20 includes: using the formula

[0021]

[0022] right The Middle Perform z-score normalization on each row; after performing z-score normalization on each row, obtain the result from... Standardized data matrix Similarly, the upper right subscript "z-score" refers to the data matrix. The data within has been z-score normalized; see Appendix 2-1 and Appendix 2-2 for details. The tables in Appendix 2-1 and Appendix 2-2 show the detailed format of the data matrix described in this section.

[0023] Furthermore, step S30 includes: assuming the first The gene represented by the line The total error is Within-group error is The inter-group error is The required statistic is ; It is a subscript, not a commonly used function, and the same applies to other items in step S30.

[0024] The calculation formula is as follows:

[0025]

[0026] The calculation formula is as follows:

[0027]

[0028] according to Relationship, The calculation formula is as follows:

[0029]

[0030] So The formula for calculating the value is as follows:

[0031]

[0032] calculate Each line Value, set numerical value ( ),extract middle The largest value Rows are used to form a standardized data matrix for screening. In the process, the genes retained after feature selection correspond to China and Israel , , … ... Renumber the numbered genes, in the following order: , , … ... ;

[0033] Compared to the previous text, this one has the subscript "sel" (selected) in the lower right corner, indicating that the data in this data matrix has been selected once, and it is not a commonly used function.

[0034] This refers to the number of genes that need to be selected from the original gene pool. Here... Represents the set of positive integers, i.e. Only positive integers are allowed. At least two genes should be selected from the given genes; otherwise, the subsequent operations become meaningless. Therefore, here... Because it can only be obtained from Selected from individual genes, therefore Cannot be greater than Furthermore, if Then the operation in step S30 becomes meaningless, therefore it is stipulated that... See Appendix 3-1 and Appendix 3-2 for details. The tables in Appendix 3-1 and Appendix 3-2 show the detailed format of the data matrix described in this section.

[0035] It should be noted that in this step, The value should be similar to or less than the number of samples in each group. Otherwise, due to the small sample size, the high data dimensionality may cause computer calculation errors and unnecessary trouble during subsequent correlation analysis.

[0036] Furthermore, step S40 includes: for The matrix (see Appendix 3 for details) Data on group samples and genes (as shown in the table below)

[0037]

[0038] To calculate the correlation coefficients between all pairs of genes in this group, it is necessary to use... All samples in the group (i.e. ). To facilitate the representation of the first For any pair of genes, introduce the letter... This represents any two genes.

[0039] Suppose that No. The correlation coefficient matrix generated from the group samples is The Middle Okay, number The elements of the column are No. Each row of the group samples (i.e., the table above) represents The first gene One gene With the One gene correlation coefficient :

[0040] Top right corner mark "Refers to the correlation matrix being composed of the first..." The function is obtained through group calculations and is not a commonly used function.

[0041] bottom right corner mark "Indicates which pair of genes the correlation coefficient comes from, and is not a commonly used function;

[0042] The calculation formula is as follows:

[0043]

[0044] Therefore, the correlation coefficient matrix can be obtained. As shown in the following formula

[0045]

[0046] This will enable By grouping the correlation coefficients between all genes in each group, it becomes easier to observe and manipulate them. Therefore, the correlation coefficient matrix for each group can be written out separately. Since the number of genes is the same, these matrices have the same shape. Therefore, they can be multiplied, cross-multiplied, added, and subtracted from each other:

[0047]

[0048] Clearly, the correlation coefficient It has two properties:

[0049]

[0050]

[0051] then, It can be written in the following form:

[0052]

[0053] get , , ..., Total A correlation coefficient matrix. The values ​​on both sides of the symmetrical diagonal are equal. To reduce computational effort, the upper or lower triangular portion can be used for calculation.

[0054] Furthermore, step S50 includes:

[0055] Step S51: Calculate the set of differences:

[0056] Pick , , ..., The correlation coefficient matrix is ​​subtracted pairwise.

[0057] For ease of representation Any two from each group, introduce letters. This indicates any two groups.

[0058] Similarly, in order to facilitate the representation of the first Any two genes, introduce letters. Represent any two genes, where The Middle Okay, number Listed as .

[0059] remember and The difference is Therefore, there is

[0060]

[0061] , , are all subscripts and not common functions;

[0062] Since the correlation relationships between pairwise genes have positive and negative correlations, and their correlation coefficients are positive and negative values respectively. To select significant correlation relationships (regardless of positive or negative), take the absolute value of each term in the matrix after taking the difference .

[0063] Set a value ( ) and denote the largest term in as ; as described above, only consider the correlation coefficients on one side of the symmetry axis, so set to traverse the terms with subscripts u < v; traverse and screen the terms, denote and take the gene pairs corresponding to their subscripts and denote as , and denote the set containing all as ; traverse and screen the terms, denote and take the gene pairs corresponding to their subscripts and denote as and denote the set containing all as ; ;

[0064] is a subscript and not a common function. refers to the set of positive integers. Here the value represents the number of gene pairs to be selected, and at least 1 should be selected, so . The upper limit of h is set to because the values on the diagonal of the correlation coefficient matrix are all 1, and the values on both sides of the diagonal are symmetric, so only calculate one side, then there are at most terms available; if , then step S51 loses its meaning, so here it is stipulated that ;

[0065] The upper left subscript "lar" (large) represents the set of the selected specific largest correlation coefficients and is not a common function;

[0066] The top-left subscript `neg` (negative) indicates a negative value and is not a commonly used function;

[0067] The top left corner superscript "neg" and the top right corner superscript " "None of these are commonly used functions, and their meanings are the same as described above. The parentheses represent a pair of genes that will form the elements of the set;

[0068] H is a mathematical set that contains the gene pairs with specific indices. The upper right subscript "neg" indicates that the correlation coefficient of the gene pairs it contains is negative, and it is not a commonly used function.

[0069] "lar", "pos", " These are all subscripts, not commonly used functions;

[0070] remember for Remove from matrix The determinant of the remaining part of the matrix in the row and column where the item is located.

[0071] The gray portion of the matrix shown below constitutes... The determinant of a matrix is

[0072]

[0073] Let the partial correlation coefficient be... The calculation formula is as follows:

[0074]

[0075] For all Calculate separately and And calculate their difference. , recorded as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Same as above, here " "、" "、" ", "lar", "neg", and "pos" are all subscripts and are not commonly used functions.

[0076] contrast , , , By analyzing each element, gene pairs with different combinations of correlation coefficients and partial correlation coefficients can be obtained. Based on the corresponding subscripts, combining the top-left and bottom-left subscripts of the same gene pair allows for the re-obtaining of the dot product set of the gene pairs using the aforementioned subscript representation method. , , , .

[0077] Step S52: Calculate the set of dot products:

[0078] Similar to step S51, traverse each step separately. , , ..., In each correlation coefficient matrix Item, set value ( ), calculate and retain middle The largest The first item is set to zero, and all other elements are set to zero to form a new screening correlation coefficient matrix. , , ..., common indivual;

[0079] Pick , , ..., The filtering matrix is ​​a pairwise dot product, that is, multiplying the elements at positions, where the nth element is the product of the nth element. The and the first The dot product of each is Non-zero terms in the data are denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ;

[0080] For all Calculate separately and ,

[0081]

[0082]

[0083] And find their product. , recorded as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ;

[0084] contrast , , , By analyzing each element, gene pairs with different combinations of correlation coefficients and partial correlation coefficients can be obtained. Based on the corresponding subscripts, combining the top-left and bottom-left subscripts of the same gene pair allows for the re-obtaining of the dot product set of the gene pairs using the aforementioned subscript representation method. , , , ;

[0085] Gene pairs from different sets can be selected for bioinformatics, molecular biology, and cell biology research, depending on the research needs.

[0086] Most of the subscripts in step S52 are similar to those in step S51, except for the newly added "sel" and " The "" and "nz" symbols are also subscripts and are not commonly used functions.

[0087] Furthermore, step S60 includes: taking , , , , , , , The set of all gene pairs contained in the set. Take set All genes in the group were renumbered as , , … ... ;extract Includes , , … ... of Rows are used to form a standardized data matrix for screening. ;

[0088] Suppose that No. The correlation coefficient matrix generated from the group samples is The Middle Okay, number The elements of the column are No. Each row of the group samples represents In the gene, the _ ... One gene With the One gene correlation coefficient ;

[0089] The calculation formula is as follows:

[0090]

[0091] Clearly, the correlation coefficient It has two properties:

[0092]

[0093]

[0094] then, The format is as follows:

[0095]

[0096] get , , ..., Total A correlation coefficient matrix;

[0097] remember for Remove from matrix The determinant of the remaining part of the matrix in the row and column containing the item. Single gene in the group Multiple correlation coefficient Calculation formula:

[0098] Mul, , They are all subscripts. This indicates the method for calculating the determinant of the matrix within the parentheses, which is the method used in general linear algebra for calculating determinants.

[0099] Furthermore, step S70 includes: for each group of data, calculating each gene pairwise. The absolute value of the difference in the multiple correlation coefficients between different groups, where and The absolute value of the difference between the multiple correlation coefficients of the two groups is , recorded as Set value ( ), Gene extraction The largest indivual Value, if Record the group classification of the two groups in their subscripts. Generate a set ;like Record the group classification of the two groups in their subscripts. Generate a set .

[0100] Compared with the prior art, the effects of this invention are positive and obvious.

[0101] 1. The technical solution of this invention, in order to facilitate the analysis of correlations between data in large datasets and to reduce computational load, utilizes feature selection to select a certain number of genes with large inter-group difference statistics (F-values) based on requirements. This allows for the selection of a specific number of genes for correlation analysis. In traditional bioinformatics transcriptomics research, when using ANOVA for hypothesis testing, the required number of genes cannot be clearly defined. This screening method is prone to statistical type II errors, resulting in too few selected genes and hindering further analysis.

[0102] 2. This invention utilizes correlation coefficients, partial correlation coefficients, and multiple correlation coefficients for feature selection, which helps to further reduce data dimensionality and is beneficial for inferring changes in the correlation between gene expression under different experimental treatments. Traditional bioinformatics transcriptome research requires placing selected genes into databases such as KEGG Pathway and Reactome for association analysis, which cannot analyze unnamed, unannotated, or named and annotated genes.

[0103] 3. The dimensionality reduction method used in this invention is feature selection (selecting a certain number of genes with large inter-group difference statistics F-values, and those with large correlation coefficients / partial correlation coefficients / multiple correlation coefficients in various correlation analyses), rather than dimensionality reduction methods such as PCA. This is beneficial for reducing data dimensionality while preserving the basis vectors of each dimension (in this invention, i.e., the genes detected in each sample under each group treatment and their expression levels) to prevent recombination. The dimensionality-reduced data can then be used for gene correlation analysis. Traditional bioinformatics uses dimensionality reduction methods such as PCA, which, while providing a good visualization of the overall gene expression situation, cannot select the main possible genes affecting phenotypes for analysis. Attached Figure Description

[0104] Figure 1 This is a schematic diagram of PCA as shown in Example 1. Detailed Implementation

[0105] The present invention will be further described below with reference to embodiments, but the present invention is not limited to these embodiments. Any similar structures and similar variations of the present invention should be included within the scope of protection of the present invention. The gene names, group names, and the conditions, quantities, and names of the parameter settings in the present invention are used for descriptive convenience only and are not intended to limit the technical solutions of the present invention.

[0106] Example 1

[0107] The present invention provides a biochip data feature engineering method based on statistical machine learning, the code of which is detailed in Appendix 4, and includes the following steps:

[0108] I. Generating a Data Matrix

[0109] After obtaining the biochip data, each gene is... , , … ... Number, and satisfy The group of each sample is... , , … … Number, and satisfy ; Each data sample in the group is , , … ... Number, where " " "、..." "、..." "These are all subscripts, indicating that the sample is located at the 1st position." Group, and located in the first Group 1 One, of which For the first The number of samples contained in each group, and satisfying the following conditions: ;

[0110] A data matrix is ​​generated using gene IDs as row names and sample IDs as column names, with the index of each row and column name serving as the row and column index. Each element in the data matrix represents the raw data of the expression level of a single gene in a single sample on a biochip, where the element is the first element. line, number The data in the column is denoted as The upper right subscript "orig" indicates that the data has not been processed and is not a commonly used function; the lower right subscript " "Represents data" Position in the matrix.

[0111] To facilitate the generation of the data matrix using Python code, a biochip dataset containing three groups, each with nine samples, was set up. Each sample was used to measure the expression levels of ten genes. The `np.random.rand(10,27)` function from the Numpy library in Python 3.9 was used to generate a random number array that met the specified parameters and then organized into a data matrix (in this embodiment, subsequent steps are calculated using this data matrix, and no new data matrix is ​​generated during the calculation process) (see Appendix 4 for details).

[0112] II. z-score standardization

[0113] Using formula

[0114]

[0115] right The Middle Perform z-score normalization on each row; after performing z-score normalization on each row, obtain the result from... Standardized data matrix The upper right corner superscript "z-score" refers to the data matrix. The data within has been standardized using z-score.

[0116] The data matrix is ​​z-score standardized using the preprocessing module of the sklearn software library. According to the sklearn software library manual, the data matrix is ​​z-score standardized using the code preprocessing.scale(X,axis=1), and the z-score standardization method is the same as the calculation method of the formula listed in step S20 of this invention (see Appendix 4 for details).

[0117] III. Calculation Value, filter large Value gene data

[0118] Let the first The gene represented by the line The total error is Within-group error is The inter-group error is The required statistic is The top right corner It's a subscript, not a commonly used function.

[0119] The calculation formula is as follows:

[0120]

[0121] The calculation formula is as follows:

[0122]

[0123] according to Relationship, The calculation formula is as follows:

[0124]

[0125] So The formula for calculating the value is as follows:

[0126]

[0127] calculate Each line Value, set numerical value , ,extract middle The largest value Rows are used to form a standardized data matrix for screening. In the process, the genes retained after feature selection correspond to China and Israel , , … ... Renumber the numbered genes, in the following order: , , … ... ;

[0128] The subscript "sel" in the lower right corner indicates that the data in this matrix has been selected once, and it is not a commonly used function.

[0129] It refers to the number of genes that need to be selected from the original gene pool. Represents the set of positive integers, i.e. It can only be a positive integer. , Less than or equal to .

[0130] The data is organized using the DataFrame module of the pandas software library. The code is table_1 = pd.DataFrame(X_zscore, index=genes, columns=groups). The SelectKBest and f_classif modules of the sklearn software library are used to select gene data with large F values. The method of calculating the F value and selecting gene data with large F values ​​is the same as the calculation method of the formula listed in step S30 of this invention (see Appendix 4 for details).

[0131] In this example, 4 genes are selected from 10 genes.

[0132] IV. Generating the Correlation Coefficient Matrix

[0133] Suppose that No. The correlation coefficient matrix generated from the group samples is The Middle Okay, number The elements of the column are No. Each row of the group samples represents In the gene, the _ ... One gene With the One gene correlation coefficient ;

[0134] Top right corner mark " "Refers to the correlation matrix being composed of the first..." The function is obtained through group calculations and is not a commonly used function.

[0135] bottom right corner mark " "Indicates which pair of genes the correlation coefficient comes from, and is not a commonly used function;

[0136] The calculation formula is as follows:

[0137]

[0138] Clearly, the correlation coefficient It has two properties:

[0139]

[0140]

[0141] then, The format is as follows:

[0142]

[0143] get , , ..., Total A correlation coefficient matrix.

[0144] The correlation coefficient matrix of the data is calculated using the `.corr()` method of the pandas library. The calculation method is the same as the formula listed in the invention's "Steps for Generating the Correlation Coefficient Matrix," resulting in three correlation coefficient matrices (displayed in the Python dictionary), which are as follows: , , (See Appendix 4 for details).

[0145] V. Screening gene pairs - Calculating the difference set and click set

[0146] 1. Calculate the set of differences:

[0147] Pick , , ..., The correlation coefficient matrix is ​​subtracted pairwise; where and The difference between the two items The result is The Middle Okay, number Listed as Its absolute value is Set value , ,remember middle The largest Item, for ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ;

[0148] These are all subscripts and are not commonly used functions;

[0149] remember for Remove from matrix The determinant of the remaining matrix in the row and column containing the term is denoted by the partial correlation coefficient. The calculation formula is as follows:

[0150]

[0151] For all Calculate separately and And calculate their difference. , recorded as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ;

[0152] It's a subscript, not a commonly used function. Refer to the set of integers, The value represents the number of gene pairs to be selected. , The upper limit is set to ;

[0153] The superscript "lar" (large) indicates the selection of a specific set of h largest correlation coefficients, which is not a commonly used function;

[0154] The top-left subscript `neg` (negative) indicates a negative value and is not a commonly used function;

[0155] The top left corner superscript "neg" and the top right corner superscript " "None of these are commonly used functions; the parentheses represent a pair of genes that will form the elements of the set;

[0156] It is a mathematical set that contains the gene pairs with specific indices mentioned above. The upper right subscript "neg" indicates that the correlation coefficient of the gene pairs it contains is negative, and it is not a commonly used function.

[0157] "lar", "pos", " These are all subscripts, not commonly used functions;

[0158] contrast , , , By analyzing each element, gene pairs with different combinations of correlation coefficients and partial correlation coefficients can be obtained. Based on the corresponding subscripts, combining the top-left and bottom-left subscripts of the same gene pair allows for the re-obtaining of the dot product set of the gene pairs using the aforementioned subscript representation method. , , , .

[0159] 2. Calculate the set of dot products:

[0160] Traverse each , , ..., In each correlation coefficient matrix Item, set value , Calculate and retain middle The largest The first item is set to zero, and all other elements are set to zero to form a new screening correlation coefficient matrix. , , ..., common indivual;

[0161] Pick , , ..., The filtering matrix is ​​performed by multiplying each element pairwise, i.e., multiplying the elements at positions, where the nth element is the most significant element. The and the first The dot product of each is Non-zero terms in the data are denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ;

[0162] For all Calculate separately and ,

[0163]

[0164]

[0165] And find their product. , recorded as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ;

[0166] contrast , , , By analyzing each element, gene pairs with different combinations of correlation coefficients and partial correlation coefficients can be obtained. Based on the corresponding subscripts, combining the top-left and bottom-left subscripts of the same gene pair allows for the re-obtaining of the dot product set of the gene pairs using the aforementioned subscript representation method. , , , ;

[0167] The above "sel", " The "" and "nz" symbols are subscripts and are not commonly used functions.

[0168] The determinant was calculated using the `.det()` method of the `numpy.linalg` module in the NumPy library, and the radicals were calculated using the `np.sqrt()` method. Combined with Python dictionaries, lists, tuples, and `for` and `if` statements, the difference set was calculated, and two pairs of genes were selected for analysis. The results are shown below (for ease of code writing, the above text...). , , , Here they are represented by Hcorr_pos, Hcorr_neg, Hpcorr_pos, and Hpcorr_neg respectively, as mentioned above. , , , Here they are referred to as Lcorr_pos, Lcorr_neg, Lpcorr_pos, and Lpcorr_neg respectively (see Appendix 4 for details).

[0169] VI. Calculate the multiple correlation coefficient

[0170] Pick , , , , , , , The set of all gene pairs contained in the set. Take set All genes in the group were renumbered as , , … ... ;extract Includes , , … ... of Rows are used to form a standardized data matrix for screening. ;

[0171] Suppose that No. The correlation coefficient matrix generated from the group samples is The Middle Okay, number The elements of the column are No. Each row of the group samples represents In the gene, the _ ... One gene With the One gene correlation coefficient ;

[0172] The calculation formula is as follows:

[0173]

[0174] Clearly, the correlation coefficient It has two properties:

[0175]

[0176]

[0177] then, The format is as follows:

[0178]

[0179] get , , ..., Total A correlation coefficient matrix;

[0180] remember for Remove from matrix The determinant of the remaining part of the matrix in the row and column containing the item. Single gene in the group Multiple correlation coefficient Calculation formula: ;

[0181] Mul, , , They are all subscripts. This indicates that the determinant of the matrix within the parentheses is calculated.

[0182] The determinant is calculated using the `.det()` method of the `np.linalg` module in the Numpy software library, and the radical is calculated using the `np.sqrt()` method. The multiple correlation coefficient is calculated by combining Python dictionaries, lists, tuples, and `for` and `if` statements (see Appendix 4 for details).

[0183] VII. Changes in the multiple correlation coefficient of marker genes

[0184] For each set of data, calculate the pairwise values ​​for each gene. The absolute value of the difference in the multiple correlation coefficients between different groups, where and The absolute value of the difference between the multiple correlation coefficients of the two groups is , recorded as Set value ( ), Gene extraction The largest indivual Value, if Record the group classification of the two groups in their subscripts. Generate a set ;like Record the group classification of the two groups in their subscripts. Generate a set .

[0185] Genes with the largest changes in multiple correlation coefficients were selected based on the absolute value of the differences in multiple correlation coefficients and then labeled as described above. , In the set (as mentioned above) , Here they are referred to as Wmcorr_pos and Wmcorr_neg, respectively.

[0186] In this example, the setting is... The value is 2 (see Appendix 4 for details).

[0187] VIII. Output Results

[0188] Data Matrix

[0189] C1 C1 C1 ... C3 C3 C3

[0190] G1 0.680982 0.225255 0.656386 ... 0.200613 0.147603 0.709164

[0191] G2 0.960227 0.816656 0.134203 ... 0.314901 0.283624 0.753496

[0192] G3 0.994845 0.925523 0.488861 ... 0.417790 0.582145 0.856230

[0193] G4 0.478150 0.622302 0.135423 ... 0.384658 0.335714 0.008523

[0194] G5 0.654489 0.637569 0.907528 ... 0.991896 0.962021 0.922300

[0195] G6 0.399502 0.006074 0.518035 ... 0.293167 0.698163 0.846190

[0196] G7 0.321540 0.335946 0.100564 ... 0.679930 0.306557 0.297414

[0197] G8 0.553802 0.294821 0.826319 ... 0.029026 0.982559 0.864068

[0198] G9 0.620114 0.861853 0.625861 ... 0.000017 0.160131 0.968738

[0199] G10 0.829668 0.879815 0.625071 ... 0.290687 0.048892 0.868403

[0200] [10 rows x 27 columns]

[0201] z-score normalized data matrix

[0202] C1 C1 C1 ... C3 C3 C3

[0203] G1 1.154326 -0.785713 1.049622 ... -0.890613 -1.116278 1.274300

[0204] G2 1.685152 1.209049 -1.054069 ... -0.454846 -0.558567 0.999601

[0205] G3 1.694778 1.447628 -0.109183 ... -0.362570 0.223399 1.200582

[0206] G4 -0.182067 0.281412 -1.284003 ... -0.482662 -0.640025 -1.692010

[0207] G5 0.499656 0.447277 1.282981 ... 1.544157 1.451672 1.328711

[0208] G6 -0.036523 -1.427910 0.382676 ... -0.412584 1.019712 1.543223

[0209] G7 -0.474017 -0.427686 -1.184675 ... 0.678565 -0.522202 -0.551605

[0210] G8 0.057993 -0.768322 0.927493 ... -1.616378 1.425999 1.047937

[0211] G9 0.349620 1.093431 0.367305 ... -1.558368 -1.065709 1.422309

[0212] G10 1.088227 1.253745 0.412925 ... -0.690760 -1.488841 1.216078

[0213] [10 rows x 27 columns]

[0214] Data matrix selected by F-value

[0215] C1 C1 C1 ... C3 C3 C3

[0216] G2 1.685152 1.209049 -1.054069 ... -0.454846 -0.558567 0.999601

[0217] G4 -0.182067 0.281412 -1.284003 ... -0.482662 -0.640025 -1.692010

[0218] G5 0.499656 0.447277 1.282981 ... 1.544157 1.451672 1.328711

[0219] G10 1.088227 1.253745 0.412925 ... -0.690760 -1.488841 1.216078

[0220] [4 rows x 27 columns]

[0221] Correlation coefficient matrix

[0222] {'R_C1': G2 G4 G5 G10

[0223] G2 1.000000 0.274025 -0.610449 0.096428

[0224] G4 0.274025 1.000000 -0.372394 -0.071025

[0225] G5 -0.610449 -0.372394 1.000000 0.156379

[0226] G10 0.096428 -0.071025 0.156379 1.000000, 'R_C2': G2G4 G5 G10

[0227] G2 1.000000 0.243180 -0.515472 -0.021548

[0228] G4 0.243180 1.000000 -0.545216 0.200719

[0229] G5 -0.515472 -0.545216 1.000000 -0.151341

[0230] G10 -0.021548 0.200719 -0.151341 1.000000, 'R_C3': G2G4 G5 G10

[0231] G2 1.000000 -0.646828 0.162927 0.048848

[0232] G4 -0.646828 1.000000 0.058486 0.061938

[0233] G5 0.162927 0.058486 1.000000 -0.521304

[0234] G10 0.048848 0.061938 -0.521304 1.000000}

[0235] Hcorr_pos

[0236] {'C1subC2_Hcorr_pos': {('G5', 'G10'): 0.3077205435428334}, 'C1subC3_Hcorr_pos': {('G2', 'G4'): 0.9208531277714334}, 'C2subC3_Hcorr_pos': {('G2','G4'): 0.8900073149797143}}

[0237] Hcorr_neg

[0238] {'C1subC2_Hcorr_neg': {('G4', 'G10'): -0.2717432857952358}, 'C1subC3_Hcorr_neg': {('G2', 'G5'): -0.7733758507760573}, 'C2subC3_Hcorr_neg': {('G2','G5'): -0.6783988489241518}}

[0239] Hpcorr_pos

[0240] {'C1subC2_Hpcorr_pos': {('G5', 'G10'): 0.35399634516138545, ('G4', 'G10'): 0.16812501779498043}, 'C1subC3_Hpcorr_pos': {('G2', 'G4'):0.7655991993785126, ('G2', 'G5'): 0.9820777215580473}, 'C2subC3_Hpcorr_pos':{('G2', 'G4'): 0.6599655276925556, ('G2', 'G5'): 0.8618637407553214}}

[0241] Hpcorr_neg

[0242] {'C1subC2_Hpcorr_neg': {}, 'C1subC3_Hpcorr_neg': {}, 'C2subC3_Hpcorr_neg': {}}

[0243] Data matrix selected by F-value

[0244] C1 C1 C1 ... C3 C3 C3

[0245] G2 1.685152 1.209049 -1.054069 ... -0.454846 -0.558567 0.999601

[0246] G4 -0.182067 0.281412 -1.284003 ... -0.482662 -0.640025 -1.692010

[0247] G5 0.499656 0.447277 1.282981 ... 1.544157 1.451672 1.328711

[0248] G10 1.088227 1.253745 0.412925 ... -0.690760 -1.488841 1.216078

[0249] [4 rows x 27 columns]

[0250] Correlation coefficient matrix

[0251] {'R_C1': G2 G4 G5 G10

[0252] G2 1.000000 0.274025 -0.610449 0.096428

[0253] G4 0.274025 1.000000 -0.372394 -0.071025

[0254] G5 -0.610449 -0.372394 1.000000 0.156379

[0255] G10 0.096428 -0.071025 0.156379 1.000000, 'R_C2': G2G4 G5 G10

[0256] G2 1.000000 0.243180 -0.515472 -0.021548

[0257] G4 0.243180 1.000000 -0.545216 0.200719

[0258] G5 -0.515472 -0.545216 1.000000 -0.151341

[0259] G10 -0.021548 0.200719 -0.151341 1.000000, 'R_C3': G2G4 G5 G10

[0260] G2 1.000000 -0.646828 0.162927 0.048848

[0261] G4 -0.646828 1.000000 0.058486 0.061938

[0262] G5 0.162927 0.058486 1.000000 -0.521304

[0263] G10 0.048848 0.061938 -0.521304 1.000000}

[0264] Lcorr_pos

[0265] {'C1dotC2_Lcorr_pos': {('G2', 'G5'): 0.3146689577461114, ('G4', 'G5'): 0.20303523461364617}, 'C1dotC3_Lcorr_pos': {}, 'C2dotC3_Lcorr_pos': {}}

[0266] Lcorr_neg

[0267] {'C1dotC2_Lcorr_pos': {('G2', 'G5'): 0.3146689577461114, ('G4', 'G5'): 0.20303523461364617}, 'C1dotC3_Lcorr_pos': {}, 'C2dotC3_Lcorr_pos': {}}

[0268] Lpcorr_pos

[0269] {'C1dotC2_Lpcorr_pos': {('G2', 'G5'): 0.2844250234762425, ('G4', 'G5'): 0.12225511672689861}, 'C1dotC3_Lpcorr_pos': {}, 'C2dotC3_Lpcorr_pos': {}}

[0270] Lpcorr_neg

[0271] {'C1dotC2_Lpcorr_neg': {}, 'C1dotC3_Lpcorr_neg': {}, 'C2dotC3_Lpcorr_neg': {}}

[0272] Correlation coefficient matrix

[0273] {'mul_R_C1': G5 G10 G2 G4

[0274] G5 1.000000 0.156379 -0.610449 -0.372394

[0275] G10 0.156379 1.000000 0.096428 -0.071025

[0276] G2 -0.610449 0.096428 1.000000 0.274025

[0277] G4 -0.372394 -0.071025 0.274025 1.000000, 'mul_R_C2': G5G10 G2 G4

[0278] G5 1.000000 -0.151341 -0.515472 -0.545216

[0279] G10 -0.151341 1.000000 -0.021548 0.200719

[0280] G2 -0.515472 -0.021548 1.000000 0.243180

[0281] G4 -0.545216 0.200719 0.243180 1.000000, 'mul_R_C3': G5G10 G2 G4

[0282] G5 1.000000 -0.521304 0.162927 0.058486

[0283] G10 -0.521304 1.000000 0.048848 0.061938

[0284] G2 0.162927 0.048848 1.000000 -0.646828

[0285] G4 0.058486 0.061938 -0.646828 1.000000}

[0286] Multiple correlation coefficient

[0287] miu_C1_genes

[0288] {'G5': 0.6755572695097039, 'G10': 0.2898321473209162, 'G2': 0.6428100133791989, 'G4': 0.37809113356040014}

[0289] miu_C2_genes

[0290] {'G5': 0.6769710983052993, 'G10': 0.2337674056236672, 'G2': 0.5261406355846606, 'G4': 0.5589998934381061}

[0291] miu_C3_genes

[0292] {'G5': 0.6220146513887393, 'G10': 0.5919009623087049, 'G2': 0.7151773735432908, 'G4': 0.7021106229456662}

[0293] Wmcorr_pos

[0294] {'G5': {'C2subC3': 0.05495644691655999, 'C1subC3':0.0535426181209645}, 'G10': {}, 'G2': {'C1subC2': 0.11666937779453823}, 'G4':{}}

[0295] Wmcorr_neg

[0296] {'G5': {}, 'G10': {'C2subC3': -0.3581335566850377, 'C1subC3': -0.3020688149877887}, 'G2': {'C2subC3': -0.18903673795863019}, 'G4': {'C1subC3': -0.32401948938526604, 'C1subC2': -0.18090875987770594}}

[0297] IX. Comparison of Existing Technologies

[0298] 1. NetworkAnalyst

[0299] To demonstrate the advantages of the technology designed in this invention compared to traditional bioinformatics analysis techniques, the NetworkAnalyst tool developed by McGill University and British Columbia University (reference PMID: 30931480; website https: / / www.networkanalyst.ca / ) was selected to analyze the random data matrix generated by the code in Appendix 4.

[0300] ① Using the Gene Expression Table tool, in the "Upload your gene expression table" box of the "Upload Data" section, set the Data type to "Microarray data (intensities)" and upload the data from the "Example 1.xlsx" file generated in the code (packaged as a .txt file) according to the tool's requirements. Leave other options as default, and then click the "Submit" and "Proceed" buttons in sequence. The uploaded text file content is as follows:

[0301] #Sample"S1,1""S1,2""S1,3""S1,4""S1,5""S1,6""S1,7""S1,8""S1,9""S2,1""S2,2""S2,3""S2,4 ""S2,5""S2,6""S2,7""S2,8""S2,9""S3,1""S3,2""S3,3""S3,4""S3,5""S3,6""S3,7""S3,8""S3,9"

[0302] #CLASSC1C1C1C1C1C1C1C1C1C2C2C2C2C2C2C2C2C2C3C3C3C3C3C3C3C3C3

[0303] G11.154325836-0.7857126181.049621980.3772990530.3065423420.10462347-0.2704747050.461786452-1.113199536-1.307171417-1.4062237721.7872665650.2610359760.6421030080.968430156-1.4458170030.5266307851.049190361-0.992704216-0.95851938-1.1841575680.6144183681.600388958-0.707090978-0.890613469-1.1162782331.274299584

[0304] G21.6851519321.209048866-1.054069255-1.2464985241.808257897-0.7964622311.5420596731.3421008480.2393541890.1803053610.2530549390.3615366261.5375249930.4077518390.030322462-0.693300879-1.044654882-1.395924706-0.256479593-0.569454349-0.756118593-1.376842902-1.062794032-0.330058043-0.454846307-0.5585666890.999601359

[0305] G31.6947776881.447628296-0.109182908-0.335453506-1.118952521-0.819194078-1.31 1440255-0.292110128-1.552076421.566554657-0.8516931931.4147420350.708330748-0. 7334448640.269141759-1.1007956220.679737308-1.4103049311.5268460610.383257386-0.4293298990.3782493780.105914113-1.172611724-0.3625695020.223398621.200581501

[0306] G4-0.1820667150.28141164-1.2840029350.769455459-0.270646174-0.6053105031.4497930410.1325804521.1002440081.0819240041.124541596-0.4185711541.0409279811.2816 65567-1.1833015770.252364891.476931372-0.572744462-0.078942178-1.624825937-1.5190819641.4275471720.157273265-1.022469394-0.482662481-0.640025005-1.692009966

[0307] G50.4996562250.4472773681.2829805810.046887353-1.5138783520.424891141-0.7927 4073-0.940699290.376222502-0.08194388-0.827338829-0.717297548-1.430941626-1.3 797344111.0475682380.1001864-0.012434915-0.404106642-1.481201361-1.0568324390.0616696531.418449135-0.8308618391.4396825081.5441571981.4516724371.328711125

[0308] G6-0.0365231-1.4279096060.3826757270.6828480981.887319952-1.206584566-0.72734 8717-0.076937287-0.572420974-0.023295823-1.2672612021.613589109-0.878723464-0. 911712005-0.8068040230.770761166-0.7380184030.6876506872.012931590.575145281-0.5777756210.202423868-1.391924949-0.322457307-0.4125840631.019712391.543223241

[0309] G7-0.474017452-0.427685862-1.1846746070.516468992-0.7751030160.8383691620.9618479850.704642044-0.9088162081.2772189790.0792289281.542774082-0.537358606-1.31 1351257-1.4135908091.4627787970.6676735680.754347246-1.4531468860.883902933-1.4692920950.1873169691.643195424-1.1694859830.678564891-0.522202424-0.551604795

[0310] G80.057993233-0.7683218190.927493123-1.2885006780.5342843151.4606011621.35781 2171-0.098349429-0.633140478-0.658822941-0.297183368-0.5327548861.2503223040.6 12641185-1.4115284360.0032740141.4660486620.319068737-1.2923635611.0024142970.159260275-1.504621203-0.339796429-1.183388096-1.6163777151.4259988231.04793674

[0311] G90.3496200581.0934314450.3673047031.359842892-1.2348606490.8706180140.407400 843-0.3559739250.7858165781.183292947-0.96173598-0.7685515881.051989498-0.9171 67781-0.116308671-1.5085871220.692240886-1.235473492-0.325735319-1.247997631-0.8155406461.4258217631.179637931-0.077316509-1.55836771-1.0657092341.422308698

[0312] G101.0882269221.2537454710.4129251250.95304361.383236727-1.146118135-1.03160 9472-0.954554280.777714884-1.564987517-0.70620249-0.01164374-1.1593852871.204 999515 -0.329130293 -0.683270881 0.486986222 -1.310259193 0.433231012 0.999997205 0.37016408 0.811056918 1.042657564 -1.357301212 -0.690759683 -1.488840696 1.216077635

[0313] ② In the “Quality Check” step, click the “Proceed” button.

[0314] ③ In the “Normalization” step, click the “Submit” and “Proceed” buttons in sequence according to the default options.

[0315] ④ In the “Differential Analysis” step, click “Pairwise comparisons” in the “Comparison of Interest” option bar, and then click the “Submit” and “Proceed” buttons in sequence. In this step, according to the tool manual, the “Pairwise comparisons” comparison method is consistent with the F-value calculation method of this invention and is comparable.

[0316] ⑤ In the “Sig. Genes” step, genes with differential expression were obtained based on the default selection parameters (i.e., “Adjusted p-value” set to 0.05 and “Log2 fold change” set to 1.0). The results showed that 0 genes showed differential expression in the three “Selected Comparison” data groups (C1.C2, C1.C3, and C2.C3), making further differential expression analysis impossible. In this step, “Adjusted p-value” was set to 0.05 and “Log2 fold change” was set to 1.0, which is the minimum acceptable requirement in mainstream bioinformatics analysis (i.e., “Adjusted p-value” cannot be greater than 0.05 and “Log2 fold change” cannot be less than 1.0).

[0317] Therefore, the gene difference analysis method designed in this invention is more suitable for the selection of genes in correlation analysis.

[0318] 2. ImageGP

[0319] To demonstrate the advantages of the technology designed in this invention compared to traditional bioinformatics analysis techniques, the ImageGP tool from EHB Biotechnology Co., Ltd. (website http: / / www.ehbio.com / Cloud_Platform / front / # / ) was used to perform PCA analysis on the random data matrix generated by the code in Appendix 4.

[0320] ① Click the "PCA analysis" tool on the webpage. With other options at their default settings, paste the data in the "Paste datamatrix to text area" box in the following format:

[0321] SampleS1,1S1,2S1,3S1,4S1,5S1,6S1,7S1,8S1,9S2,1S2,2S2,3S2,4S2,5S2,6S2,7S2,8S2,9S3,1S3,2S3,3S3,4S3,5S3,6S3,7S3,8S3,9

[0322] G11.154325836-0.7857126181.049621980.3772990530.3065423420.10462347-0.2704747050.461786452-1.113199536-1.307171417-1.4062237721.7872665650.2610359760.6421 030080.968430156-1.4458170030.5266307851.049190361-0.992704216-0.95851938-1.1841575680.6144183681.600388958-0.707090978-0.890613469-1.1162782331.274299584

[0323] G21.6851519321.209048866-1.054069255-1.2464985241.808257897-0.7964622311.54205 96731.3421008480.2393541890.1803053610.2530549390.3615366261.5375249930.407751 8390.030322462-0.693300879-1.044654882-1.395924706-0.256479593-0.569454349-0.756118593-1.376842902-1.062794032-0.330058043-0.454846307-0.5585666890.999601359

[0324] G31.6947776881.447628296-0.109182908-0.335453506-1.118952521-0.819194078-1.31 1440255-0.292110128-1.552076421.566554657-0.8516931931.4147420350.708330748-0. 7334448640.269141759-1.1007956220.679737308-1.4103049311.5268460610.383257386-0.4293298990.3782493780.105914113-1.172611724-0.3625695020.223398621.200581501

[0325] G4-0.1820667150.28141164-1.2840029350.769455459-0.270646174-0.6053105031.4497930410.1325804521.1002440081.0819240041.124541596-0.4185711541.0409279811.2816 65567-1.1833015770.252364891.476931372-0.572744462-0.078942178-1.624825937-1.5190819641.4275471720.157273265-1.022469394-0.482662481-0.640025005-1.692009966

[0326] G50.4996562250.4472773681.2829805810.046887353-1.5138783520.424891141-0.7927 4073-0.940699290.376222502-0.08194388-0.827338829-0.717297548-1.430941626-1.3 797344111.0475682380.1001864-0.012434915-0.404106642-1.481201361-1.0568324390.0616696531.418449135-0.8308618391.4396825081.5441571981.4516724371.328711125

[0327] G6-0.0365231-1.4279096060.3826757270.6828480981.887319952-1.206584566-0.72734 8717-0.076937287-0.572420974-0.023295823-1.2672612021.613589109-0.878723464-0. 911712005-0.8068040230.770761166-0.7380184030.6876506872.012931590.575145281-0.5777756210.202423868-1.391924949-0.322457307-0.4125840631.019712391.543223241

[0328] G7-0.474017452-0.427685862-1.1846746070.516468992-0.7751030160.8383691620.9618479850.704642044-0.9088162081.2772189790.0792289281.542774082-0.537358606-1.31 1351257-1.4135908091.4627787970.6676735680.754347246-1.4531468860.883902933-1.4692920950.1873169691.643195424-1.1694859830.678564891-0.522202424-0.551604795

[0329] G80.057993233-0.7683218190.927493123-1.2885006780.5342843151.4606011621.35781 2171-0.098349429-0.633140478-0.658822941-0.297183368-0.5327548861.2503223040.6 12641185-1.4115284360.0032740141.4660486620.319068737-1.2923635611.0024142970.159260275-1.504621203-0.339796429-1.183388096-1.6163777151.4259988231.04793674

[0330] G90.3496200581.0934314450.3673047031.359842892-1.2348606490.8706180140.407400 843-0.3559739250.7858165781.183292947-0.96173598-0.7685515881.051989498-0.9171 67781-0.116308671-1.5085871220.692240886-1.235473492-0.325735319-1.247997631-0.8155406461.4258217631.179637931-0.077316509-1.55836771-1.0657092341.422308698

[0331] G101.0882269221.2537454710.4129251250.95304361.383236727-1.146118135-1.03160 9472-0.954554280.777714884-1.564987517-0.70620249-0.01164374-1.1593852871.204 999515 -0.329130293 -0.683270881 0.486986222 -1.310259193 0.433231012 0.999997205 0.37016408 0.811056918 1.042657564 -1.357301212 -0.690759683 -1.488840696 1.216077635

[0332] In the "Paste PhenoData (first column must match first column of datamatrix)" box, paste the grouping information in the following format:

[0333] SampleCLASS

[0334] S1,1C1

[0335] S1,2C1

[0336] S1,3C1

[0337] S1,4C1

[0338] S1,5C1

[0339] S1,6C1

[0340] S1,7C1

[0341] S1,8C1

[0342] S1,9C1

[0343] S2,1C2

[0344] S2,2C2

[0345] S2,3C2

[0346] S2,4C2

[0347] S2,5C2

[0348] S2,6C2

[0349] S2,7C2

[0350] S2,8C2

[0351] S2,9C2

[0352] S3,1C3

[0353] S3,2C3

[0354] S3,3C3

[0355] S3,4C3

[0356] S3,5C3

[0357] S3,6C3

[0358] S3,7C3

[0359] S3,8C3

[0360] S3,9C3

[0361] Then, in the "Essential parameters" section, set "Point Shape variable" to "CLASS", and click the "Submit" button at the bottom of the tool to get the following result. Figure 1 The PCA image shown. Figure 1 In the scatter plot, "●", "▲", and "■" represent the groups respectively. , , The samples are represented by PC1 on the x-axis and PC2 on the y-axis, which represent the principal components of the two PCA dimensionality reductions. The clustering of each point in the scatter plot describes the overall differences in gene expression among the samples in each group (refer to Machine Learning, author: Zhou Zhihua, publisher: Tsinghua University Press, pp. 229-232).

[0362] PCA dimensionality reduction can obtain the overall expression situation, but since the data after dimensionality reduction only contains the principal components of the gene expression basis vectors, it is impossible to further analyze the genes.

[0363] Appendix 1-1

[0364]

[0365] Appendix 1-2

[0366]

[0367] Appendix 2-1

[0368]

[0369] Appendix 2-2

[0370]

[0371] Appendix 3-1

[0372]

[0373] Appendix 3-2

[0374]

[0375] Appendix 4

[0376] import numpy as np

[0377] import pandas as pd

[0378] import numpy.linalg as LA

[0379] from sklearn import preprocessing

[0380] from sklearn.feature_selection import SelectKBest

[0381] from sklearn.feature_selection import f_classif

[0382] #Set row and column names

[0383] genes = ['G1','G2','G3','G4','G5','G6','G7','G8','G9','G10']

[0384] groups = ['C1','C1','C1','C1','C1','C1','C1','C1','C1',

[0385] 'C2','C2','C2','C2','C2','C2','C2','C2','C2',

[0386] 'C3','C3','C3','C3','C3','C3','C3','C3','C3']

[0387] #Randomly generate gene expression data matrix

[0388] X = np.random.rand(10,27)

[0389] table_0 = pd.DataFrame(X,index=genes,columns=groups)

[0390] print('Data Matrix')

[0391] print(table_0)

[0392] #z-score standardization

[0393] X_zscore = preprocessing.scale(X,axis=1)

[0394] #Generate DataFrame

[0395] table_1 = pd.DataFrame(X_zscore,

[0396] index=genes,

[0397] columns=groups)

[0398] print('The data matrix normalized by z-score')

[0399] print(table_1)

[0400] #Export as Excel file

[0401] table_1.to_excel('Example 1.xlsx',sheet_name='Sheet1')

[0402] def FSelector(topF):

[0403] #Construct a DataFrame from a data matrix

[0404] table_0 = pd.DataFrame(X,

[0405] index=genes,

[0406] columns=groups)

[0407] #z-score standardization

[0408] X_zscore = preprocessing.scale(X,axis=1)

[0409] #Generate DataFrame

[0410] table_1 = pd.DataFrame(X_zscore,

[0411] index=genes,

[0412] columns=groups)

[0413] #Select the top F genes with the largest F values ​​based on the F-value.

[0414] features = table_1.iloc[:,:].T

[0415] labels = table_1.columns

[0416] skb = SelectKBest(f_classif,k=topF)

[0417] result = skb.fit_transform(features,labels)

[0418] selected_genes_indexes = skb.get_support(indices=True)

[0419] selected_genes = []

[0420] for i in selected_genes_indexes:

[0421] selected_genes.append(genes[i])

[0422] #Regenerate DataFrame

[0423] table_2 = pd.DataFrame(result.T,

[0424] index=selected_genes,

[0425] columns=groups)

[0426] return table_2

[0427] def CorrelationMatrix(topF):

[0428] #Get the table selected by the F value

[0429] table_2 = FSelector(topF)

[0430] print('Data matrix selected by F value')

[0431] print(table_2)

[0432] # Divide table_2 into multiple tables by group

[0433] table_2a = table_2.iloc[:,0:9]

[0434] table_2b = table_2.iloc[:,9:18]

[0435] table_2c = table_2.iloc[:,18:27]

[0436] # Calculate the correlation coefficient matrix for each group separately.

[0437] R_C1 = table_2a.T.corr()

[0438] R_C2 = table_2b.T.corr()

[0439] R_C3 = table_2c.T.corr()

[0440] #Pack the correlation coefficient matrix into a dictionary

[0441] CorMat = {}

[0442] CorMat['R_C1']=R_C1

[0443] CorMat['R_C2']=R_C2

[0444] CorMat['R_C3']=R_C3

[0445] print('correlation coefficient matrix')

[0446] print(CorMat)

[0447] return CorMat

[0448] def Substraction(topF, topH):

[0449] # Obtain the correlation coefficient matrix

[0450] CorMat = CorrelationMatrix(topF)

[0451] R_C1 = CorMat['R_C1']

[0452] R_C2 = CorMat['R_C2']

[0453] R_C3 = CorMat['R_C3']

[0454] #Preset dictionary of correlation coefficient difference matrix set

[0455] Hcorr_neg = {}

[0456] Hcorr_pos = {}

[0457] Hpcorr_neg = {}

[0458] Hpcorr_pos = {}

[0459] #Calculate the correlation coefficient difference matrix

[0460] R_C1subC2 = CorMat['R_C1']-CorMat['R_C2']

[0461] R_C1subC3 = CorMat['R_C1']-CorMat['R_C3']

[0462] R_C2subC3 = CorMat['R_C2']-CorMat['R_C3']

[0463] #--------------------------------------------------------------------

[0464] #Take the absolute value of the correlation coefficient difference matrix

[0465] R_C1subC2_abs = abs(R_C1subC2)

[0466] R_C1subC3_abs = abs(R_C1subC3)

[0467] R_C2subC3_abs = abs(R_C2subC3)

[0468] #Forming an upper triangular matrix

[0469] np_upTriangle = np.triu(np.ones((topF,topF)))

[0470] # Multiply the correlation coefficient difference matrix with the upper triangular matrix

[0471] R_C1subC2_upTri = pd.DataFrame(np.array(R_C1subC2)*np_upTriangle,

[0472] index=R_C1subC2.columns,

[0473] columns=R_C1subC2.columns)

[0474] R_C1subC3_upTri = pd.DataFrame(np.array(R_C1subC3)*np_upTriangle,

[0475] index=R_C1subC3.columns,

[0476] columns=R_C1subC3.columns)

[0477] R_C2subC3_upTri = pd.DataFrame(np.array(R_C2subC3)*np_upTriangle,

[0478] index=R_C2subC3.columns,

[0479] columns=R_C2subC3.columns)

[0480] # Multiply the absolute value matrix of the correlation coefficient difference by the upper triangular matrix.

[0481] R_C1subC2_abs_upTri = pd.DataFrame(np.array(R_C1subC2_abs)*np_upTriangle,

[0482] index=R_C1subC2_abs.columns,

[0483] columns=R_C1subC2_abs.columns)

[0484] R_C1subC3_abs_upTri = pd.DataFrame(np.array(R_C1subC3_abs)*np_upTriangle,

[0485] index=R_C1subC3_abs.columns,

[0486] columns=R_C1subC3_abs.columns)

[0487] R_C2subC3_abs_upTri = pd.DataFrame(np.array(R_C2subC3_abs)*np_upTriangle,

[0488] index=R_C2subC3_abs.columns,

[0489] columns=R_C2subC3_abs.columns)

[0490] #Record the top H gene pairs with the largest absolute values ​​of the correlation coefficient difference matrix using tuples.

[0491] #And according to the sign of the value, put the gene pair and correlation coefficient value into the corresponding dictionary.

[0492] C1subC2_Hcorr_pos = {}

[0493] C1subC2_Hcorr_neg = {}

[0494] R_C1subC2_abs_dict = {}

[0495] R_C1subC2_abs_values ​​= []

[0496] R_C1subC2_abs_keys_sel = []

[0497] for i in R_C1subC2_abs_upTri.index:

[0498] for j in R_C1subC2_abs_upTri.columns:

[0499] if R_C1subC2_abs_upTri.loc[i,j]>0:

[0500] R_C1subC2_abs_dict[(i,j)]=R_C1subC2_abs_upTri.loc[i,j]

[0501] for v in R_C1subC2_abs_dict.values():

[0502] R_C1subC2_abs_values.append(v)

[0503] R_C1subC2_abs_values.sort(reverse=True)

[0504] R_C1subC2_abs_values_sel = R_C1subC2_abs_values[0:topH] # The topH value can be set to different values ​​in different groups as needed.

[0505] for v in R_C1subC2_abs_values_sel:

[0506] for k in R_C1subC2_abs_dict.keys():

[0507] if v == R_C1subC2_abs_dict[k]:

[0508] R_C1subC2_abs_keys_sel.append(k)

[0509] for k in R_C1subC2_abs_keys_sel:

[0510] if R_C1subC2_upTri.loc[k[0],k[1]]>0:

[0511] C1subC2_Hcorr_pos[k] = R_C1subC2_upTri.loc[k[0],k[1]]

[0512] elif R_C1subC2_upTri.loc[k[0],k[1]]<0:

[0513] C1subC2_Hcorr_neg[k] = R_C1subC2_upTri.loc[k[0],k[1]]

[0514] C1subC3_Hcorr_pos = {}

[0515] C1subC3_Hcorr_neg = {}

[0516] R_C1subC3_abs_dict = {}

[0517] R_C1subC3_abs_values ​​= []

[0518] R_C1subC3_abs_keys_sel = []

[0519] for i in R_C1subC3_abs_upTri.index:

[0520] for j in R_C1subC3_abs_upTri.columns:

[0521] if R_C1subC3_abs_upTri.loc[i,j]>0:

[0522] R_C1subC3_abs_dict[(i,j)]=R_C1subC3_abs_upTri.loc[i,j]

[0523] for v in R_C1subC3_abs_dict.values():

[0524] R_C1subC3_abs_values.append(v)

[0525] R_C1subC3_abs_values.sort(reverse=True)

[0526] R_C1subC3_abs_values_sel = R_C1subC3_abs_values[0:topH] # The topH value can be set to different values ​​in different groups as needed.

[0527] for v in R_C1subC3_abs_values_sel:

[0528] for k in R_C1subC3_abs_dict.keys():

[0529] if v == R_C1subC3_abs_dict[k]:

[0530] R_C1subC3_abs_keys_sel.append(k)

[0531] for k in R_C1subC3_abs_keys_sel:

[0532] if R_C1subC3_upTri.loc[k[0],k[1]]>0:

[0533] C1subC3_Hcorr_pos[k] = R_C1subC3_upTri.loc[k[0],k[1]]

[0534] elif R_C1subC3_upTri.loc[k[0],k[1]]<0:

[0535] C1subC3_Hcorr_neg[k] = R_C1subC3_upTri.loc[k[0],k[1]]

[0536] C2subC3_Hcorr_pos = {}

[0537] C2subC3_Hcorr_neg = {}

[0538] R_C2subC3_abs_dict = {}

[0539] R_C2subC3_abs_values = []

[0540] R_C2subC3_abs_keys_sel = []

[0541] for i in R_C2subC3_abs_upTri.index:

[0542] for j in R_C2subC3_abs_upTri.columns:

[0543] if R_C2subC3_abs_upTri.loc[i,j]>0:

[0544] R_C2subC3_abs_dict[(i,j)]=R_C2subC3_abs_upTri.loc[i,j]

[0545] for v in R_C2subC3_abs_dict.values():

[0546] R_C2subC3_abs_values.append(v)

[0547] R_C2subC3_abs_values.sort(reverse=True)

[0548] R_C2subC3_abs_values_sel = R_C2subC3_abs_values[0:topH] # The topH value can be set to different values ​​in different groups as needed.

[0549] for v in R_C2subC3_abs_values_sel:

[0550] for k in R_C2subC3_abs_dict.keys():

[0551] if v == R_C2subC3_abs_dict[k]:

[0552] R_C2subC3_abs_keys_sel.append(k)

[0553] for k in R_C2subC3_abs_keys_sel:

[0554] if R_C2subC3_upTri.loc[k[0],k[1]]>0:

[0555] C2subC3_Hcorr_pos[k] = R_C2subC3_upTri.loc[k[0],k[1]]

[0556] elif R_C2subC3_upTri.loc[k[0],k[1]]<0:

[0557] C2subC3_Hcorr_neg[k] = R_C2subC3_upTri.loc[k[0],k[1]]

[0558] # Add to Hcorr collection

[0559] Hcorr_pos['C1subC2_Hcorr_pos']=C1subC2_Hcorr_pos

[0560] Hcorr_pos['C1subC3_Hcorr_pos']=C1subC3_Hcorr_pos

[0561] Hcorr_pos['C2subC3_Hcorr_pos']=C2subC3_Hcorr_pos

[0562] Hcorr_neg['C1subC2_Hcorr_neg']=C1subC2_Hcorr_neg

[0563] Hcorr_neg['C1subC3_Hcorr_neg']=C1subC3_Hcorr_neg

[0564] Hcorr_neg['C2subC3_Hcorr_neg']=C2subC3_Hcorr_neg

[0565] print('Hcorr_pos')

[0566] print(Hcorr_pos)

[0567] print('Hcorr_neg')

[0568] print(Hcorr_neg)

[0569] #--------------------------------------------------------------------

[0570] #Calculate the partial correlation coefficients of the genes corresponding to the bonds with larger values ​​in the absolute value matrix of correlation coefficient differences.

[0571] #And according to the sign of the value, put the gene pair and correlation coefficient value into the corresponding dictionary.

[0572] C1subC2_Hpcorr_pos = {}

[0573] C1subC2_Hpcorr_neg = {}

[0574] for k in R_C1subC2_abs_keys_sel:

[0575] a = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[1],axis=1)))

[0576] b = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[0],axis=1)))

[0577] c = LA.det(np.array(R_C1.drop(k[1],axis=0).drop(k[1],axis=1)))

[0578] rho_C1 = a / np.sqrt(b*c)

[0579] a = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[1],axis=1)))

[0580] b = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[0],axis=1)))

[0581] c = LA.det(np.array(R_C2.drop(k[1],axis=0).drop(k[1],axis=1)))

[0582] rho_C2 = a / np.sqrt(b*c)

[0583] rho_C1subC2 = rho_C1-rho_C2

[0584] if rho_C1subC2 > 0:

[0585] C1subC2_Hpcorr_pos[k] = rho_C1subC2

[0586] elif rho_C1subC2 < 0:

[0587] C1subC2_Hpcorr_neg[k] = rho_C1subC2

[0588] C1subC3_Hpcorr_pos = {}

[0589] C1subC3_Hpcorr_neg = {}

[0590] for k in R_C1subC3_abs_keys_sel:

[0591] a = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[1],axis=1)))

[0592] b = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[0],axis=1)))

[0593] c = LA.det(np.array(R_C1.drop(k[1],axis=0).drop(k[1],axis=1)))

[0594] rho_C1 = a / np.sqrt(b*c)

[0595] a = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[1],axis=1)))

[0596] b = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[0],axis=1)))

[0597] c = LA.det(np.array(R_C3.drop(k[1],axis=0).drop(k[1],axis=1)))

[0598] rho_C3 = a / np.sqrt(b*c)

[0599] rho_C1subC3 = rho_C1-rho_C3

[0600] if rho_C1subC3 > 0:

[0601] C1subC3_Hpcorr_pos[k] = rho_C1subC3

[0602] elif rho_C1subC2 < 0:

[0603] C1subC3_Hpcorr_neg[k] = rho_C1subC3

[0604] C2subC3_Hpcorr_pos = {}

[0605] C2subC3_Hpcorr_neg = {}

[0606] for k in R_C2subC3_abs_keys_sel:

[0607] a = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[1],axis=1)))

[0608] b = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[0],axis=1)))

[0609] c = LA.det(np.array(R_C2.drop(k[1],axis=0).drop(k[1],axis=1)))

[0610] rho_C2 = a / np.sqrt(b*c)

[0611] a = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[1],axis=1)))

[0612] b = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[0],axis=1)))

[0613] c = LA.det(np.array(R_C3.drop(k[1],axis=0).drop(k[1],axis=1)))

[0614] rho_C3 = a / np.sqrt(b*c)

[0615] rho_C2subC3 = rho_C2-rho_C3

[0616] if rho_C1subC3 > 0:

[0617] C2subC3_Hpcorr_pos[k] = rho_C2subC3

[0618] elif rho_C1subC2 < 0:

[0619] C2subC3_Hpcorr_neg[k] = rho_C2subC3

[0620] # Put into the Hpcorr set

[0621] Hpcorr_pos['C1subC2_Hpcorr_pos'] = C1subC2_Hpcorr_pos

[0622] Hpcorr_pos['C1subC3_Hpcorr_pos'] = C1subC3_Hpcorr_pos

[0623] Hpcorr_pos['C2subC3_Hpcorr_pos'] = C2subC3_Hpcorr_pos

[0624] Hpcorr_neg['C1subC2_Hpcorr_neg'] = C1subC2_Hpcorr_neg

[0625] Hpcorr_neg['C1subC3_Hpcorr_neg'] = C1subC3_Hpcorr_neg

[0626] Hpcorr_neg['C2subC3_Hpcorr_neg'] = C2subC3_Hpcorr_neg

[0627] print('Hpcorr_pos')

[0628] print(Hpcorr_pos)

[0629] print('Hpcorr_neg')

[0630] print(Hpcorr_neg)

[0631] return [Hcorr_pos, Hcorr_neg, Hpcorr_pos, Hpcorr_neg]

[0632] def Dotproduct(topF, topL):

[0633] # Obtain the correlation coefficient matrix

[0634] CorMat = CorrelationMatrix(topF)

[0635] R_C1 = CorMat['R_C1']

[0636] R_C2 = CorMat['R_C2']

[0637] R_C3 = CorMat['R_C3']

[0638] #Preset dictionary of correlation coefficient difference matrix set

[0639] Lcorr_neg = {}

[0640] Lcorr_pos = {}

[0641] Lpcorr_neg = {}

[0642] Lpcorr_pos = {}

[0643] #-------------------------------------------------------

[0644] #Take the absolute value of the correlation coefficient matrix

[0645] R_C1_abs = abs(R_C1)

[0646] R_C2_abs = abs(R_C2)

[0647] R_C3_abs = abs(R_C3)

[0648] #Construct an upper triangular matrix and subtract the identity matrix

[0649] np_upTriangle = np.triu(np.ones((topF,topF)))-np.eye(topF)

[0650] # Multiply the correlation coefficient matrix (absolute value) by the upper triangular matrix (minus the identity matrix).

[0651] R_C1_abs_upTri = pd.DataFrame(np.array(R_C1_abs)*np_upTriangle,

[0652] index=R_C1_abs.columns,

[0653] columns=R_C1_abs.columns)

[0654] R_C2_abs_upTri = pd.DataFrame(np.array(R_C2_abs)*np_upTriangle,

[0655] index=R_C2_abs.columns,

[0656] columns=R_C2_abs.columns)

[0657] R_C3_abs_upTri = pd.DataFrame(np.array(R_C3_abs)*np_upTriangle,

[0658] index=R_C3_abs.columns,

[0659] columns=R_C3_abs.columns)

[0660] #Filter the topL items in each matrix by their correlation coefficients, and set all other items to zero.

[0661] R_C1_abs_dict = {}

[0662] R_C1_abs_values ​​= []

[0663] R_C1_abs_keys_sel = []

[0664] for i in R_C1_abs_upTri.index:

[0665] for j in R_C1_abs_upTri.columns:

[0666] if R_C1_abs_upTri.loc[i,j]>0:

[0667] R_C1_abs_dict[(i,j)]=R_C1_abs_upTri.loc[i,j]

[0668] for v in R_C1_abs_dict.values():

[0669] R_C1_abs_values.append(v)

[0670] R_C1_abs_values.sort(reverse=True)

[0671] R_C1_abs_values_sel = R_C1_abs_values[0:topL] # The topL value can be set to different values ​​in different groups as needed.

[0672] for v in R_C1_abs_values_sel:

[0673] for k in R_C1_abs_dict.keys():

[0674] if v == R_C1_abs_dict[k]:

[0675] R_C1_abs_keys_sel.append(k)

[0676] for k in R_C1_abs_keys_sel:

[0677] if R_C1.loc[k[0],k[1]]<0:

[0678] R_C1_abs_dict[k] = -R_C1_abs_dict[k]

[0679] R_C1_sel = pd.DataFrame(np.zeros((topF,topF)),# Here, R_C1_sel is defined

[0680] index=R_C1.columns,

[0681] columns=R_C1.columns)

[0682] for k in R_C1_abs_keys_sel:

[0683] R_C1_sel.loc[k[0],k[1]] = R_C1_abs_dict[k]

[0684] R_C2_abs_dict = {}

[0685] R_C2_abs_values = []

[0686] R_C2_abs_keys_sel = []

[0687] for i in R_C2_abs_upTri.index:

[0688] for j in R_C2_abs_upTri.columns:

[0689] if R_C2_abs_upTri.loc[i,j]>0:

[0690] R_C2_abs_dict[(i,j)]=R_C2_abs_upTri.loc[i,j]

[0691] for v in R_C2_abs_dict.values():

[0692] R_C2_abs_values.append(v)

[0693] R_C2_abs_values.sort(reverse=True)

[0694] R_C2_abs_values_sel = R_C2_abs_values[0:topL] # The topL value can be set to different values ​​in different groups as needed.

[0695] for v in R_C2_abs_values_sel:

[0696] for k in R_C2_abs_dict.keys():

[0697] if v == R_C2_abs_dict[k]:

[0698] R_C2_abs_keys_sel.append(k)

[0699] for k in R_C2_abs_keys_sel:

[0700] if R_C2.loc[k[0],k[1]]<0:

[0701] R_C2_abs_dict[k] = -R_C2_abs_dict[k]

[0702] R_C2_sel = pd.DataFrame(np.zeros((topF,topF)), # Here R_C2_sel is defined.

[0703] index=R_C2.columns,

[0704] columns=R_C2.columns)

[0705] for k in R_C2_abs_keys_sel:

[0706] R_C2_sel.loc[k[0],k[1]] = R_C2_abs_dict[k]

[0707] R_C3_abs_dict = {}

[0708] R_C3_abs_values ​​= []

[0709] R_C3_abs_keys_sel = []

[0710] for i in R_C3_abs_upTri.index:

[0711] for j in R_C3_abs_upTri.columns:

[0712] if R_C3_abs_upTri.loc[i,j]>0:

[0713] R_C3_abs_dict[(i,j)]=R_C3_abs_upTri.loc[i,j]

[0714] for v in R_C3_abs_dict.values():

[0715] R_C3_abs_values.append(v)

[0716] R_C3_abs_values.sort(reverse=True)

[0717] R_C3_abs_values_sel = R_C3_abs_values[0:topL] # The topL value can be set to different values ​​in different groups as needed.

[0718] for v in R_C3_abs_values_sel:

[0719] for k in R_C3_abs_dict.keys():

[0720] if v == R_C3_abs_dict[k]:

[0721] R_C3_abs_keys_sel.append(k)

[0722] for k in R_C3_abs_keys_sel:

[0723] if R_C3.loc[k[0],k[1]]<0:

[0724] R_C3_abs_dict[k] = -R_C3_abs_dict[k]

[0725] R_C3_sel = pd.DataFrame(np.zeros((topF,topF)), # Here R_C3_sel is defined

[0726] index=R_C3.columns,

[0727] columns=R_C3.columns)

[0728] for k in R_C3_abs_keys_sel:

[0729] R_C3_sel.loc[k[0],k[1]] = R_C3_abs_dict[k]

[0730] #Multiplication of two dots

[0731] R_C1dotC2 = R_C1_sel*R_C2_sel

[0732] R_C1dotC3 = R_C1_sel*R_C3_sel

[0733] R_C2dotC3 = R_C2_sel*R_C3_sel

[0734] C1dotC2_nz = []

[0735] C1dotC3_nz = []

[0736] C2dotC3_nz = []

[0737] #Calculate the non-zero terms in the dot product matrix and put them into the corresponding set.

[0738] C1dotC2_Lcorr_pos = {}

[0739] C1dotC2_Lcorr_neg = {}

[0740] for i in R_C1dotC2.index:

[0741] for j in R_C1dotC2.columns:

[0742] if R_C1dotC2.loc[i,j]>0:

[0743] C1dotC2_Lcorr_pos[(i,j)] = R_C1dotC2.loc[i,j]

[0744] C1dotC2_nz.append((i,j))

[0745] elif R_C1dotC2.loc[i,j]<0:

[0746] C1dotC2_Lcorr_neg[(i,j)] = R_C1dotC2.loc[i,j]

[0747] C1dotC2_nz.append((i,j))

[0748] C1dotC3_Lcorr_pos = {}

[0749] C1dotC3_Lcorr_neg = {}

[0750] for i in R_C1dotC3.index:

[0751] for j in R_C1dotC3.columns:

[0752] if R_C1dotC3.loc[i,j]>0:

[0753] C1dotC3_Lcorr_pos[(i,j)] = R_C1dotC3.loc[i,j]

[0754] C1dotC3_nz.append((i,j))

[0755] elif R_C1dotC3.loc[i,j]<0:

[0756] C1dotC3_Lcorr_neg[(i,j)] = R_C1dotC3.loc[i,j]

[0757] C1dotC3_nz.append((i,j))

[0758] C2dotC3_Lcorr_pos = {}

[0759] C2dotC3_Lcorr_neg = {}

[0760] for i in R_C2dotC3.index:

[0761] for j in R_C2dotC3.columns:

[0762] if R_C2dotC3.loc[i,j]>0:

[0763] C2dotC3_Lcorr_pos[(i, j)] = R_C2dotC3.loc[i, j]

[0764] C2dotC3_nz.append((i, j))

[0765] elif R_C2dotC3.loc[i, j] < 0:

[0766] C2dotC3_Lcorr_neg[(i, j)] = R_C2dotC3.loc[i, j]

[0767] C2dotC3_nz.append((i, j))

[0768] # Put into the Lcorr set

[0769] Lcorr_pos['C1dotC2_Lcorr_pos'] = C1dotC2_Lcorr_pos

[0770] Lcorr_pos['C1dotC3_Lcorr_pos'] = C1dotC3_Lcorr_pos

[0771] Lcorr_pos['C2dotC3_Lcorr_pos'] = C2dotC3_Lcorr_pos

[0772] Lcorr_neg['C1dotC2_Lcorr_neg'] = C1dotC2_Lcorr_neg

[0773] Lcorr_neg['C1dotC3_Lcorr_neg'] = C1dotC3_Lcorr_neg

[0774] Lcorr_neg['C2dotC3_Lcorr_neg'] = C2dotC3_Lcorr_neg

[0775] print('Lcorr_pos')

[0776] print(Lcorr_pos)

[0777] print('Lcorr_neg')

[0778] print(Lcorr_neg)

[0779] #-----------------------------------------------------------

[0780] #Using gene pairs whose correlation coefficient dot product matrix is ​​non-zero, calculate the partial correlation coefficients of the corresponding genes in these gene pairs and multiply them.

[0781] #And according to the sign of the value, put the gene pair and correlation coefficient value into the corresponding dictionary.

[0782] C1dotC2_Lpcorr_pos = {}

[0783] C1dotC2_Lpcorr_neg = {}

[0784] for k in C1dotC2_nz:

[0785] a = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[1],axis=1)))

[0786] b = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[0],axis=1)))

[0787] c = LA.det(np.array(R_C1.drop(k[1],axis=0).drop(k[1],axis=1)))

[0788] rho_C1 = a / np.sqrt(b*c)

[0789] a = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[1],axis=1)))

[0790] b = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[0],axis=1)))

[0791] c = LA.det(np.array(R_C2.drop(k[1],axis=0).drop(k[1],axis=1)))

[0792] rho_C2 = a / np.sqrt(b*c)

[0793] rho_C1dotC2 = rho_C1*rho_C2

[0794] if rho_C1dotC2 > 0:

[0795] C1dotC2_Lpcorr_pos[k] = rho_C1dotC2

[0796] elif rho_C1dotC2 < 0:

[0797] C1dotC2_Lpcorr_neg[k] = rho_C1dotC2

[0798] C1dotC3_Lpcorr_pos = {}

[0799] C1dotC3_Lpcorr_neg = {}

[0800] for k in C1dotC3_nz:

[0801] a = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[1],axis=1)))

[0802] b = LA.det(np.array(R_C1.drop(k[0],axis=0).drop(k[0],axis=1)))

[0803] c = LA.det(np.array(R_C1.drop(k[1],axis=0).drop(k[1],axis=1)))

[0804] rho_C1 = a / np.sqrt(b*c)

[0805] a = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[1],axis=1)))

[0806] b = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[0],axis=1)))

[0807] c = LA.det(np.array(R_C3.drop(k[1],axis=0).drop(k[1],axis=1)))

[0808] rho_C3 = a / np.sqrt(b*c)

[0809] rho_C1dotC3 = rho_C1*rho_C3

[0810] if rho_C1dotC3 > 0:

[0811] C1dotC3_Lpcorr_pos[k] = rho_C1dotC3

[0812] elif rho_C1dotC3 < 0:

[0813] C1dotC3_Lpcorr_neg[k] = rho_C1dotC3

[0814] C2dotC3_Lpcorr_pos = {}

[0815] C2dotC3_Lpcorr_neg = {}

[0816] for k in C2dotC3_nz:

[0817] a = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[1],axis=1)))

[0818] b = LA.det(np.array(R_C2.drop(k[0],axis=0).drop(k[0],axis=1)))

[0819] c = LA.det(np.array(R_C2.drop(k[1],axis=0).drop(k[1],axis=1)))

[0820] rho_C2 = a / np.sqrt(b*c)

[0821] a = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[1],axis=1)))

[0822] b = LA.det(np.array(R_C3.drop(k[0],axis=0).drop(k[0],axis=1)))

[0823] c = LA.det(np.array(R_C3.drop(k[1],axis=0).drop(k[1],axis=1)))

[0824] rho_C3 = a / np.sqrt(b * c)

[0825] rho_C2dotC3 = rho_C2 * rho_C3

[0826] if rho_C2dotC3 > 0:

[0827] C2dotC3_Lpcorr_pos[k] = rho_C2dotC3

[0828] elif rho_C2dotC3 < 0:

[0829] C2dotC3_Lpcorr_neg[k] = rho_C2dotC3

[0830] # Put into the Lpcorr set

[0831] Lpcorr_pos['C1dotC2_Lpcorr_pos'] = C1dotC2_Lpcorr_pos

[0832] Lpcorr_pos['C1dotC3_Lpcorr_pos'] = C1dotC3_Lpcorr_pos

[0833] Lpcorr_pos['C2dotC3_Lpcorr_pos'] = C2dotC3_Lpcorr_pos

[0834] Lpcorr_neg['C1dotC2_Lpcorr_neg'] = C1dotC2_Lpcorr_neg

[0835] Lpcorr_neg['C1dotC3_Lpcorr_neg'] = C1dotC3_Lpcorr_neg

[0836] Lpcorr_neg['C2dotC3_Lpcorr_neg'] = C2dotC3_Lpcorr_neg

[0837] print('Lpcorr_pos')

[0838] print(Lpcorr_pos)

[0839] print('Lpcorr_neg')

[0840] print(Lpcorr_neg)

[0841] return [Lcorr_pos, Lcorr_neg, Lpcorr_pos, Lpcorr_neg]

[0842] def multiCorrelationMatrix(topF, topHL, topW): # The topHL value can be set to different values ​​for different groups as needed.

[0843] # Generate a data matrix for calculating the multiple correlation coefficient, and perform a filtering using the difference set and dot product set methods.

[0844] H_set = Substraction(topF, topHL)

[0845] L_set = Dotproduct(topF, topHL)

[0846] table_2 = FSelector(topF)

[0847] Set = H_set + L_set

[0848] geneListT = []

[0849] for i in Set:

[0850] for j in i.values():

[0851] for k in j.keys():

[0852] for l in k:

[0853] if l is not in geneListT:

[0854] geneListT.append(l)

[0855] table_3 = table_2.loc[geneListT,:]

[0856] # Calculate the correlation coefficient matrix used to calculate the multiple correlation coefficient

[0857] # Divide table_3 into multiple tables by group

[0858] table_3a = table_3.iloc[:,0:9]

[0859] table_3b = table_3.iloc[:,9:18]

[0860] table_3c = table_3.iloc[:,18:27]

[0861] # Calculate the correlation coefficient matrix for each group separately.

[0862] mul_R_C1 = table_3a.T.corr()

[0863] mul_R_C2 = table_3b.T.corr()

[0864] mul_R_C3 = table_3c.T.corr()

[0865] #Pack the correlation coefficient matrix into a dictionary

[0866] mul_CorMat = {}

[0867] mul_CorMat['mul_R_C1']=mul_R_C1

[0868] mul_CorMat['mul_R_C2']=mul_R_C2

[0869] mul_CorMat['mul_R_C3']=mul_R_C3

[0870] print('correlation coefficient matrix')

[0871] print(mul_CorMat)

[0872] #Calculate the multiple correlation coefficient of each gene in each group and add it to a dictionary.

[0873] miu_C1_genes = {}

[0874] for i in mul_R_C1.index:

[0875] a = LA.det(np.array(mul_R_C1))

[0876] b = LA.det(np.array(mul_R_C1.drop(i,axis=0).drop(i,axis=1)))

[0877] miu_C1_genes[i] = np.sqrt(1-a / b)

[0878] miu_C2_genes = {}

[0879] for i in mul_R_C2.index:

[0880] a = LA.det(np.array(mul_R_C2))

[0881] b = LA.det(np.array(mul_R_C2.drop(i,axis=0).drop(i,axis=1)))

[0882] miu_C2_genes[i] = np.sqrt(1-a / b)

[0883] miu_C3_genes = {}

[0884] for i in mul_R_C1.index:

[0885] a = LA.det(np.array(mul_R_C3))

[0886] b = LA.det(np.array(mul_R_C3.drop(i,axis=0).drop(i,axis=1)))

[0887] miu_C3_genes[i] = np.sqrt(1-a / b)

[0888] print('Multiple Correlation Coefficient')

[0889] print('miu_C1_genes')

[0890] print(miu_C1_genes)

[0891] print('miu_C2_genes')

[0892] print(miu_C2_genes)

[0893] print('miu_C3_genes')

[0894] print(miu_C3_genes)

[0895] #Calculate the changes in the multiple correlation coefficient and add them to the corresponding set.

[0896] Wmcorr_pos = {}

[0897] Wmcorr_neg = {}

[0898] sub_dict_all = {}

[0899] for i in geneListT:

[0900] sub_dict = {}

[0901] C1subC2 = miu_C1_genes[i] - miu_C2_genes[i]

[0902] C1subC3 = miu_C1_genes[i] - miu_C3_genes[i]

[0903] C2subC3 = miu_C2_genes[i] - miu_C3_genes[i]

[0904] sub_keys = ['C1subC2', 'C1subC3', 'C2subC3']

[0905] sub_values = [C1subC2, C1subC3, C2subC3]

[0906] # Take the absolute value, sort, and select

[0907] C1subC2_abs = abs(C1subC2)

[0908] C1subC3_abs = abs(C1subC3)

[0909] C2subC3_abs = abs(C2subC3) <00025​​​​​​​​​​​​​​​​​

[0916] sub_dict_all[i]=sub_dict

[0917] '''print(sub_dict_all)'''

[0918] for i,j in sub_dict_all.items():

[0919] ij_dict = {}

[0920] ij_dict[i] = j

[0921] for k,l in ij_dict.items():

[0922] i_dict_pos = {}

[0923] i_dict_neg = {}

[0924] for m,n in l.items():

[0925] if n>0:

[0926] i_dict_pos[m]=n

[0927] elif n<0:

[0928] i_dict_neg[m]=n

[0929] Wmcorr_pos[i]=i_dict_pos

[0930] Wmcorr_neg[i]=i_dict_neg

[0931] print('Wmcorr_pos')

[0932] print(Wmcorr_pos)

[0933] print('Wmcorr_neg')

[0934] print(Wmcorr_neg)

[0935] multiCorrelationMatrix(4,2,2)

Claims

1. A method for feature engineering of biochip data based on statistical machine learning, characterized in that, Includes the following steps: Step S10: Generate a data matrix; Step S20: Perform z-score standardization; Step S30: Calculation Value, filter large Value gene data; Step S40: Generate the correlation coefficient matrix; Step S50: Screen gene pairs; Step S50 includes: Step S51: Calculate the set of differences: Pick , , ..., The correlation coefficient matrix is ​​subtracted pairwise; where and The difference between the two items The result is The Middle Okay, number Listed as Its absolute value is Set value , ,remember middle The largest Item, for ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; All are subscripts; remember for Remove from matrix The determinant of the remaining matrix in the row and column containing the term is denoted by the partial correlation coefficient. The calculation formula is as follows: ; For all Calculate separately and And calculate their difference. , recorded as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; It's a subscript; Refer to the set of integers, The value represents the number of gene pairs to be selected. , The upper limit is set to ; The superscript "lar" indicates the set of the h largest correlation coefficients selected. The superscript 'neg' in the upper left corner indicates that the value is negative; The top left corner superscript "neg" and the top right corner superscript " "These are all subscripts; parentheses represent a pair of genes that constitute the elements of the set; It is a mathematical set that contains the gene pairs with specific indices mentioned above. The upper right subscript "neg" indicates that the correlation coefficient of the gene pairs it contains is negative. "lar", "pos", " "They are all subscripts;" contrast , , , By analyzing the elements in the table, gene pairs with various combinations of correlation coefficients and partial correlation coefficients can be obtained. Based on the corresponding subscripts, combining the top-left and bottom-left subscripts of the same gene pair allows for the re-obtaining of the dot product set of the gene pairs using the aforementioned subscript representation method. , , , ; Step S52: Calculate the set of dot products: Traverse each , , ..., In each correlation coefficient matrix Item, set value , Calculate and retain middle The largest The first item is set to zero, and all other elements are set to zero to form a new screening correlation coefficient matrix. , , ..., common indivual; Pick , , ..., The filtering matrix is ​​a pairwise dot product, that is, multiplying the elements at positions, where the nth element is the product of the nth element. The and the first The dot product of each is Non-zero terms in the data are denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; For all Calculate separately and , ; ; And find their product. , recorded as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; Traversal ,filter Item, record Take the subscripts of each item. The corresponding gene pairs are denoted as , including all The set is denoted as ; contrast , , , By analyzing the elements in the table, gene pairs with various combinations of correlation coefficients and partial correlation coefficients can be obtained. Based on the corresponding subscripts, combining the top-left and bottom-left subscripts of the same gene pair allows for the re-obtaining of the dot product set of the gene pairs using the aforementioned subscript representation method. , , , ; The above "sel", " Both " and "nz" are subscripts; Step S60: Calculate the multiple correlation coefficient; Step S60 includes: taking , , , , , , , The set of all gene pairs contained in the set. Take set All genes in the group were renumbered as , , … ... ;extract Includes , , … … of Rows are used to form a standardized data matrix for screening. ; Suppose that No. The correlation coefficient matrix generated from the group samples is The Middle Okay, number The elements of the column are No. Each row of the group samples represents In the gene, the _ ... One gene With the One gene correlation coefficient ; The calculation formula is as follows: ; Clearly, the correlation coefficient It has two properties: ; ; then, The format is as follows: ; get , , ..., Total A correlation coefficient matrix; remember for Remove from matrix The determinant of the remaining part of the matrix in the row and column containing the item. Single gene in the group Multiple correlation coefficient Calculation formula: ; Mul, , , They are all subscripts. This indicates the calculation of the determinant of the matrix within the parentheses; Step S70: Changes in the multiple correlation coefficient of marker genes; Step S70 includes: for each group of data, calculating the pairwise values ​​for each gene. The absolute value of the difference in the multiple correlation coefficients between different groups, where and The absolute value of the difference between the multiple correlation coefficients of the two groups is , recorded as Set value , Gene extraction The largest indivual Value, if Record the group classification of the two groups in their subscripts. Generate a set ;like Record the group classification of the two groups in their subscripts. Generate a set .

2. The biochip data feature engineering method based on statistical machine learning according to claim 1, characterized in that, Step S10 includes: after obtaining the biochip data, each gene is... , , … ... Number, and satisfy The group of each sample is... , , … … Number, and satisfy ; Each data sample in the group is , , … ... Number, where " "、" "、…、" "、…、" "These are all subscripts, indicating that the sample is located at the 1st position." Group, and located in the first Group 1 One, of which For the first The number of samples contained in each group, and satisfying the following conditions: A data matrix is ​​generated using gene IDs as row names and sample IDs as column names, with the index of each row and column name serving as the row and column index. Each element in the data matrix represents the raw data of the expression level of a single gene in a single sample on a biochip, where the element is the first element. line, number The data in the column is denoted as The top-right superscript "orig" refers to the data before processing; the bottom-right superscript "orig" refers to the data before processing. "Represents data" Position in the matrix.

3. The biochip data feature engineering method based on statistical machine learning according to claim 1, characterized in that, Step S20 includes: using the formula ; right The Middle Perform z-score normalization on each row; after performing z-score normalization on each row, obtain the result from... Standardized data matrix The upper right corner superscript "z-score" refers to the data matrix. The data within has been standardized using z-score.

4. The method for biochip data feature engineering based on statistical machine learning according to claim 1, characterized in that, Step S30 includes: Let the first The gene represented by the line The total error is Within-group error is The inter-group error is The required statistic is The top right corner It's a subscript; The calculation formula is as follows: ; The calculation formula is as follows: ; according to Relationship, The calculation formula is as follows: ; So The formula for calculating the value is as follows: ; calculate Each line Value, set numerical value , ,extract middle The largest value Rows are used to form a standardized data matrix for screening. In the process, the genes retained after feature selection correspond to China and Israel , , … ... Renumber the numbered genes, in the following order: , , … ... ; The bottom right subscript "sel" indicates that the data in this data matrix has been selected once; It refers to the number of genes that need to be selected from the original gene pool. Represents the set of positive integers, i.e. It can only be a positive integer. , Less than or equal to .

5. The biochip data feature engineering method based on statistical machine learning according to claim 1, characterized in that, Step S40 includes: assuming that by No. The correlation coefficient matrix generated from the group samples is The Middle Okay, number The elements of the column are No. Each row of the group samples represents In the gene, the _ ... One gene With the One gene correlation coefficient ; Top right corner mark "Refers to the correlation matrix being composed of the first..." The results were obtained from group calculations; bottom right corner mark "" indicates that the correlation coefficient comes from a comparison between any pair of genes; The calculation formula is as follows: ; Clearly, the correlation coefficient It has two properties: ; ; then, The format is as follows: ; get , , ..., Total A correlation coefficient matrix.