Use of a recombinant promoter to improve muscle performance in cattle
By constructing an adapter model and using a multi-strategy variant generation method, the problem of low cross-species predictive signal correlation in bovine muscle tissue by the Enformer model was solved, achieving efficient optimization of the MSTN gene promoter function in beef cattle, and improving the efficiency of molecular breeding and gene expression levels.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INSTITUTE OF ANIMAL SCIENCES OF CHINESE ACADEMY OF AGRICULTURAL SCIENCES
- Filing Date
- 2025-12-11
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the Enformer model has low correlation with gene expression signals in predicting bovine muscle tissue across species and lacks an effective closed loop from predicted signals to designed functions, resulting in low efficiency in beef cattle breeding applications.
By constructing an adapter model and combining it with a multi-strategy variant generation method, high-dimensional epigenetic signals of the MSTN gene in beef cattle were extracted using deep learning. Promoter variants with enhanced or weakened expression functions were screened out, and their activities were verified through experiments.
This study achieved efficient and precise optimization of the MSTN gene promoter function in beef cattle, improving the efficiency and reliability of molecular breeding and significantly enhancing gene expression levels.
Smart Images

Figure CN121687205B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of animal genetics and relates to the application of a recombinant promoter in improving bovine muscle performance. Background Technology
[0002] Promoters are core DNA elements that regulate the initiation of gene transcription. Their function directly determines the level of gene expression and they are key targets for achieving precise regulation of biological traits.
[0003] Myostatin (MSTN) is a negative regulator of muscle growth. In beef cattle breeding, the MSTN gene regulates muscle growth, and optimizing its promoter function has significant application value for improving beef cattle production performance.
[0004] In recent years, deep learning technology has provided new solutions for promoter function prediction and promoter improvement design. Among them, the Enformer model developed by DeepMind has made groundbreaking progress in this field. Trained on human and mouse genome data, this model can directly predict various epigenetic signals and gene expression levels with high accuracy from sequences, providing a powerful computational tool for understanding gene regulation mechanisms. However, there are no reports of directly applying general models like Enformer to the optimization of key economic trait genes in non-model organisms such as beef cattle.
[0005] First, there is a significant issue of species variability. The Enformer model, trained on human and mouse data, exhibits low correlation between predicted signals and experimentally measured signals (such as RNA-seq and ATAC-seq of bovine muscle tissue) when directly transferred to beef cattle, which have different genomic structures and cis-regulatory element distributions. This leads to a substantial decrease in prediction accuracy. Consequently, a large portion of the model's massive predicted trajectories becomes noise irrelevant to the actual biological processes of the target species. Second, there is a lack of an effective closed loop from predicted signals to designed functions. Existing methods largely focus on using models for prediction or analysis, lacking a systematic process to transform low-correlation predicted signals into a high-precision sequence-function mapping model that can guide promoter engineering design. This prevents the rapid and accurate screening of optimal candidates with the desired function (such as enhanced or weakened expression) from a vast number of virtual variants, limiting its efficiency in breeding practice.
[0006] Therefore, there is an urgent need in this field for a technical solution that can overcome cross-species prediction bias and realize sequence-function closed-loop intelligent design, thereby efficiently and accurately optimizing the promoter function of the MSTN gene in beef cattle and improving the feasibility and efficiency of molecular breeding applications. Summary of the Invention
[0007] In order to solve the problems existing in the prior art, the first aspect of the present invention provides a biomaterial, wherein the biomaterial is a first biomaterial, a combination of any one or any two of a second biomaterial and a third biomaterial, or a combination of three of them;
[0008] The first biological material is any one of the following: M1-1, M1-2, M1-3, M1-4, M1-5, M1-6 and M1-7;
[0009] M1-1: First promoter
[0010] The sequence of the first promoter is shown in SEQ ID NO.4;
[0011] M1-2: First gene expression cassette
[0012] The promoter of the first gene expression cassette is the first promoter described in M1-1;
[0013] M1-3: First Gene Engineering Vector
[0014] The first gene engineering vector contains a second gene expression cassette, which contains the first promoter, the first exogenous gene insertion site, and the terminator described in M1-1;
[0015] M1-4: The first recombinant gene engineering vector
[0016] The first exogenous gene is inserted into the first exogenous gene insertion site of the first gene engineering vector in M1-3 to form the first recombinant gene engineering vector.
[0017] M1-5: First genetically engineered cell
[0018] The first genetically engineered cell contains the first genetic engineering vector described in M1-3 or the first recombinant genetic engineering vector described in M1-4;
[0019] M1-6: First Composition
[0020] The first composition contains the first genetic engineering vector described in M1-3, the first recombinant genetic engineering vector described in M1-4, or the first genetic engineering cell described in M1-5;
[0021] M1-7: First Reagent Kit
[0022] The kit contains the first genetic engineering vector described in M1-3, the first recombinant genetic engineering vector described in M1-4, or the first genetic engineering cell described in M1-5;
[0023] The second biological material is any one of the following: M2-1, M2-2, M2-3, M2-4, M2-5, M2-6, and M2-7;
[0024] M2-1: Second promoter
[0025] The sequence of the second promoter is shown in SEQ ID NO.3;
[0026] M2-2: Third gene expression cassette
[0027] The promoter of the third gene expression cassette is the second promoter described in M2-1;
[0028] M2-3: Second genetic engineering vector
[0029] The second gene engineering vector contains a fourth gene expression cassette, which contains the second promoter, the second exogenous gene insertion site, and the terminator described in M2-1;
[0030] M2-4: Second recombinant gene engineering vector
[0031] The second exogenous gene is inserted into the second exogenous gene insertion site of the second gene engineering vector in M2-3 to form the second recombinant gene engineering vector.
[0032] M2-5: Second genetically engineered cell
[0033] The second genetically engineered cell contains the second genetic engineering vector described in M2-3 or the second recombinant genetic engineering vector described in M2-4;
[0034] M2-6: Second Composition
[0035] The second composition contains the second genetic engineering vector described in M2-3, the second recombinant genetic engineering vector described in M2-4, or the second genetic engineering cell described in M2-5;
[0036] M2-7: Second reagent kit
[0037] The kit contains the second gene engineering vector described in M2-3, the second recombinant gene engineering vector described in M2-4, or the second gene engineering cell described in M2-5;
[0038] The third biological material is any one of the following: M3-1, M3-2, M3-3, M3-4, M3-5, M3-6, and M3-7;
[0039] M3-1: Third promoter
[0040] The sequence of the third promoter is shown in SEQ ID NO.2;
[0041] M3-2: Fifth gene expression cassette
[0042] The promoter of the fifth gene expression cassette is the third promoter described in M3-1;
[0043] M3-3: Third Genetic Engineering Vector
[0044] The third gene engineering vector contains a sixth gene expression cassette, which contains the third promoter, the third exogenous gene insertion site, and the terminator described in M3-1.
[0045] M3-4: Third recombinant gene engineering vector
[0046] The third exogenous gene is inserted into the third exogenous gene insertion site of the third gene engineering vector in M3-3 to form the third recombinant gene engineering vector;
[0047] M3-5: Third genetically engineered cell
[0048] The third genetically engineered cell contains the third genetic engineering vector described in M3-3 or the third recombinant genetic engineering vector described in M3-4;
[0049] M3-6: Third Composition
[0050] The third composition contains the third gene engineering vector described in M3-3, the third recombinant gene engineering vector described in M3-4, or the third gene engineering cell described in M3-5;
[0051] M3-7: Third reagent kit
[0052] The kit contains the third gene engineering vector described in M3-3, the third recombinant gene engineering vector described in M3-4, or the third gene engineering vector described in M3-5.
[0053] In some implementations, the material is selected from any one of the following: A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, and A13.
[0054] A1: The biomaterials are selected from:
[0055] The first biological material;
[0056] A combination of the first biomaterial and the second biomaterial;
[0057] The combination of the first biomaterial and the third biomaterial;
[0058] The first biomaterial, the combination of the second biomaterial and the third biomaterial;
[0059] A2: The backbone vector of the first gene engineering vector is the pLG3 vector;
[0060] A3: The backbone vector of the first gene engineering vector is the pLG3 vector;
[0061] A4: The backbone vector of the first gene engineering vector is the pLG3 vector;
[0062] A5: The first exogenous gene is the bovine MSTN gene;
[0063] A6: The second exogenous gene is the bovine MSTN gene;
[0064] A7: The third exogenous gene is the bovine MSTN gene;
[0065] A8: The host of the first genetically engineered cell is a muscle cell;
[0066] A9: The host of the second genetically engineered cell is a muscle cell;
[0067] A10: The host of the third genetically engineered cell is a muscle cell;
[0068] A11: The first exogenous gene also includes a signal peptide, a fluorescent protein tag, and / or the coding sequence of a tag peptide used to isolate and purify the protein;
[0069] A12: The second exogenous gene also includes a signal peptide, a fluorescent protein tag, and / or a tag peptide coding sequence for separating and purifying the protein;
[0070] A13: The third exogenous gene also includes the coding sequence of a signal peptide, a fluorescent protein tag, and / or a tag peptide for separating and purifying the protein.
[0071] In some implementations, the following B1 and / or B2 are selected;
[0072] B1: The amino acid sequence of the protein encoded by the bovine MSTN gene is shown in GenBank No. NP_001001525.1;
[0073] B2: The muscle cells described are muscle satellite cell lines derived from fetal bovine cells.
[0074] The second aspect of the present invention provides the use of the biomaterial described in the first aspect of the present invention in the preparation of a formulation for altering the genetic properties of bovine muscle, wherein the biomaterial is selected from a first genetic engineering vector, a first recombinant genetic engineering vector, a first genetic engineering cell, a first composition, a first reagent kit, a second genetic engineering vector, a second recombinant genetic engineering vector, a second genetic engineering cell, a second composition, a second reagent kit, a third genetic engineering vector, a third recombinant genetic engineering vector, a third genetic engineering cell, a third composition, and a third reagent kit;
[0075] Wherein, the first exogenous gene is the bovine MSTN gene; the second exogenous gene is the bovine MSTN gene; and the third exogenous gene is the bovine MSTN gene.
[0076] A third aspect of this invention provides a method for optimizing the design of the MSTN gene promoter in beef cattle based on deep learning and model adaptation, the method comprising the following steps:
[0077] S1: Promoter signal prediction
[0078] The promoter region sequence of the MSTN gene in beef cattle was obtained, and the promoter region sequence was input into the pre-trained Enformer deep learning model to extract high-dimensional epigenetic signal prediction data.
[0079] S2: Adapter Model Construction and Verification
[0080] Using the RNA-seq expression level of the MSTN gene as a supervision label and the high-dimensional epigenetic signal prediction data as input features, a machine learning adapter model is trained; when the prediction accuracy of the adapter model reaches a preset threshold, it is determined as a validation model.
[0081] S3: Variant Generation and Virtual Filtering
[0082] Based on the analysis of cis-regulatory elements in the promoter region of the MSTN gene, a multi-strategy variant generation method is used to construct a promoter variant sequence library of the MSTN gene; the promoter variant sequence library of the MSTN gene is input into the validation model in S2 for virtual high-throughput screening to predict the gene expression potential of each variant;
[0083] S4: Optimize variant output
[0084] Based on the results of the virtual high-throughput screening, a list of optimized variants of the MSTN gene promoter in beef cattle with enhanced or weakened expression functions is output.
[0085] S5: Experimental Verification
[0086] The promoter optimized variant obtained in step S4 is artificially synthesized, and then inserted upstream of the reporter gene coding sequence to obtain a recombinant vector containing the promoter optimized variant and the reporter gene in the same open reading frame. The recombinant vector is used to transfect host cells, and the promoter activity of the promoter optimized variant is verified by detecting the expression intensity of the reporter gene in the host cells.
[0087] In some implementations, one or more combinations selected from the following C1, C2, C3, C4, C5 and C6 are used;
[0088] C1: In step S1, analyze the ATAC-seq data of beef cattle to obtain the promoter regions in the euchromatin region of beef cattle that fall into the ATAC-seqpeak region;
[0089] Analyze the ChIP-seq data of beef cattle to obtain information on the enrichment of beef cattle promoters in euchromatin or heterochromatin;
[0090] Analyze beef cattle RNA-seq data to obtain the correspondence between beef cattle gene expression levels and promoters;
[0091] Based on the aforementioned information, the promoter region sequence of the MSTN gene in beef cattle was obtained;
[0092] C2: In step S2, training a machine learning adapter model includes:
[0093] Feature vectors are obtained by extracting features from the high-dimensional epigenetic signal prediction data;
[0094] The random forest regression model was trained using the feature vector and the corresponding experimental RNA-seq expression levels.
[0095] C3: In step S3, the multi-strategy variant generation method includes at least one of the following strategies:
[0096] Active motif rearrangement strategy, key motif replication strategy, redundant motif deletion strategy, random interval sequence introduction strategy, and Monte Carlo random combination strategy;
[0097] C4: In step S3, the analysis of the cis-regulatory elements based on the promoter sub-region specifically includes:
[0098] By combining ATAC-seq data and histone modification ChIP-seq data, active regulatory regions were screened within the promoter region, and transcription factor binding motifs within them were identified.
[0099] C5: In step S4, the list of output optimization variants includes:
[0100] Based on a comprehensive advantage evaluation, a list of Top-N candidate variants of the target gene is output; wherein the list includes at least the sequence information of the variant, the design strategy used, and its predicted expression score;
[0101] C6: In step S5, the host cell is a muscle cell.
[0102] In some implementations, a combination of one or more of the following D1, D2, D3, D4 and D5 is selected;
[0103] D1: In step S1, the high-dimensional epigenetic signal prediction data includes histone modifications, transcription factor binding regions, and chromatin accessibility in the MSTN gene region;
[0104] D2: In step S2, the feature extraction from the high-dimensional epigenetic signal prediction data specifically involves: performing mean pooling on the signals of multiple trajectories output by the Enformer model on the promoter region, aggregating the signals of each trajectory into a scalar value, which together constitute the feature vector;
[0105] D3: In step S3, the multi-strategy variant generation method specifically includes the operation definition and implementation method of at least one of the following strategies:
[0106] Active motif rearrangement strategy: Based on the active regulatory regions and motif identification results, multiple identified motifs are extracted from their original sequences, and their arrangement order is shuffled using a random rearrangement algorithm to generate a series of promoter variant sequences with new spatial arrangement order;
[0107] Key motif replication strategy: Based on the assessment of ordering or regulatory importance, at least one highly active core motif is selected from all identified motifs. Through sequence replication and insertion operations, a repeat of the core motif is created at or near the original position to enhance its regulatory signal strength.
[0108] Redundant motif deletion strategy: Based on sequence homology analysis and predicted binding affinity, identify and delete at least one motif in the promoter region that is determined to be functionally redundant or has the lowest predicted activity, in order to simplify the regulatory logic and remove potential inhibitory effects.
[0109] Random spacer sequence introduction strategy: Insert a DNA spacer sequence of 5 to 20 base pairs in length, generated by a random algorithm, between key functional motifs to systematically change the spatial distance and three-dimensional conformation between motifs;
[0110] Monte Carlo random combination strategy: Based on the aforementioned strategy, the subset of motifs to be operated on, the operation sites, and the interval length are randomly sampled and combined to generate a large-scale promoter variant sequence library with sequence diversity through iteration.
[0111] D4: In step S4, the comprehensive advantage evaluation step is as follows: based on the predicted expression score of the virtual high-throughput screening, all variants in the promoter variant sequence library are sorted; further, combining sequence complexity evaluation and the diversity of variant generation strategies, the Top-N variants from the high-score segment are selected to form the list.
[0112] D5: In step S5, a control recombinant vector is prepared, which contains a control reporter gene. The host cell is then co-transfected with the recombinant vector and the control recombinant vector.
[0113] In some implementations, a combination of one or more of the following E1, E2, E3, E4 and E5 is selected;
[0114] E1: In step S1, the histone modification includes at least one of H3K27ac, H3K4me1, and H3K4me3; the transcription factor binding region information is derived from ChIP-seq data integrated from beef muscle tissue; the chromatin accessibility information is derived from the analysis of ATAC-seq data from beef muscle tissue.
[0115] E2: In step S2, after training the random forest regression model, the following is also included:
[0116] Feature importance analysis is performed to identify the feature subset that contributes the most to the predicted gene expression level from the feature vector. This feature subset is defined as tissue-specific regulatory features.
[0117] E3: In step S3, the active motif rearrangement strategy uses the random.shuffle algorithm to randomly rearrange the motifs;
[0118] In the key motif replication strategy, by performing the [top_motif]*2 + motifs[1:] operation on the original motif sequence list, the core motif with the highest predicted activity is replicated and placed at the top of the list;
[0119] In the redundant motif removal strategy, the motif with the lowest predicted activity is removed by performing a motifs[:-1] slicing operation on the original motif sequence list.
[0120] In the random interval sequence introduction strategy, a random DNA sequence of a specific length is generated as an interval by calling the random_gap_seq() function;
[0121] In the Monte Carlo random combination strategy, the random.sample(motifs, num_motifs) operation is used to randomly sample a specified number of motifs from the original motif set to form a subset;
[0122] E4: In step S4, the steps of the structured determination system are as follows:
[0123] Based on the standard deviation of the predicted expression score of each variant relative to the predicted activity of the wild-type promoter and the score distribution of all variants, variants are classified into high-potential, medium-activity, and low-activity / inactive variants; among them, high-potential variants or low-activity variants are selected into the Top-N candidate variant list according to the screening criteria and output.
[0124] E5: In step S5, the reporter gene is the first luciferase gene, and the control reporter gene is the second luciferase gene.
[0125] In some implementations, a combination of one or more of the following F1, F2, F3, F4 and F5 is selected;
[0126] F1: In step S1, the promoter region sequence is a DNA sequence extending upstream of the transcription start site of the MSTN gene by no less than 2000 base pairs; before being input into the Enformer model, the sequence is subjected to one-hot encoding and length normalization.
[0127] F2: In step S2, the feature importance analysis specifically involves: calculating the mean reduction in squared error caused by each feature in the random forest regression model when splitting at the decision tree node; defining the features with the top 10% reduction in mean as the tissue-specific regulatory features;
[0128] F3: In step S3, the Monte Carlo random combination strategy is executed at least 1000 times to ensure that the size of the variant sequence library is sufficient to cover the meaningful sequence space;
[0129] F4: In step S4, when the structured judgment system is specifically implemented, the judgment criterion for high potential variants is further defined as: simultaneously satisfying Score_variant≥Score_WT + 1SD and the predicted expression score being in the top 20% of all variants;
[0130] F5: In step S5, the first luciferase gene is the firefly luciferase gene, the second luciferase gene is the kidney luciferase gene, the first luciferase substrate and the second luciferase substrate are added to the host cell, and the promoter strength of the promoter optimized variant is determined by detecting the luminescence intensity of the first luciferase substrate and the second luciferase substrate.
[0131] This application discloses a method for optimizing the promoter function of the MSTN gene in beef cattle based on deep learning and model adaptation, belonging to the fields of bioinformatics and genetic engineering. The method first obtains the promoter sequence of the MSTN gene in beef cattle and extracts its high-dimensional epigenetic prediction signal using a pre-trained Enformer model. An adapter model is trained using the RNA-seq expression level of MSTN as a tag to obtain a validation model capable of accurately assessing sequence regulatory capacity. Combined with feature importance analysis, the most relevant regulatory features to MSTN expression are screened from the Enformer signal, and based on this, a multi-strategy variant generation method is used to functionalize the promoter. After virtual high-throughput screening using the validation model, promoter variants with the potential to enhance or weaken expression are obtained. Furthermore, by artificially synthesizing candidate variants and constructing dual-fluorescent reporter vectors, their transcriptional enhancement activity is verified at the cellular level. The results show that some optimized variants significantly improve gene expression levels. This application can efficiently obtain promoter variants that can precisely regulate MSTN gene expression, improving the efficiency and reliability of regulatory element optimization in beef cattle molecular breeding.
[0132] This invention relates to the fields of bioinformatics and computer-aided biological design, specifically to a method for optimizing the function of the MSTN gene promoter in beef cattle based on deep learning and model adaptation. The method integrates multi-omics data, constructs species- and tissue-specific adaptation models for beef cattle, and combines multi-strategy sequence variant generation and virtual high-throughput screening. Preliminary verification and prediction have shown that this method can achieve precise functional regulation of the MSTN gene promoter, providing an intelligent design tool for molecular breeding of key economic traits in beef cattle.
[0133] According to the specific embodiments provided in this application, this application has the following technical effects:
[0134] This application provides a method for optimizing the design of MSTN gene promoter function in beef cattle based on deep learning and model adaptation. By introducing species- and tissue-specific adapter models, the cross-species prediction signals of general deep learning models are precisely corrected, constructing a high-precision mapping relationship from DNA sequence to gene expression function. Based on this, combined with multi-strategy variant generation and virtual high-throughput screening, promoter variants that can precisely regulate MSTN gene expression can be systematically and efficiently designed and screened. This method overcomes the shortcomings of existing technologies that rely on trial and error and are inefficient, providing a powerful intelligent design tool and directly usable candidate sequences for beef cattle molecular breeding, significantly improving breeding efficiency and accuracy. Attached Figure Description
[0135] Figure 1 This is a flowchart illustrating a method for optimizing the function of the MSTN gene promoter in beef cattle based on deep learning and model adaptation, as provided in one embodiment of this application.
[0136] Figure 2 This is a schematic diagram of the architecture of a model adapter provided in one embodiment of this application.
[0137] Figure 3 This is a schematic diagram of a multi-strategy promoter variant generation method provided in an embodiment of this application.
[0138] Figure 4 This is a schematic diagram of the dual fluorescence verification results. Detailed Implementation
[0139] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
[0140] Example 1: Functional Optimization Design Method for Promoter Region of MSTN Gene in Beef Cattle Based on Deep Learning Enformer Model
[0141] A flowchart illustrating the method for optimizing the function of the MSTN gene promoter in beef cattle based on deep learning and model adaptation, as described in this invention, is provided below. Figure 1 The detailed steps are as follows.
[0142] I. Promoter Signal Prediction
[0143] (I) Acquisition and Preprocessing of Multi-omics Data
[0144] Data Acquisition: The beef cattle (Heyford cattle) reference genome ARS-UCD1.2 and its annotation file (GTF format) were obtained from the public database (NCBI). Multi-omics functional genomic data of beef cattle muscle tissue were downloaded and integrated, including: ATAC-seq data for identifying open chromatin regions (i.e., euchromatin) (GSM4799630, GSM4799629); ChIP-seq data for analyzing the histone modification status of the genome (GSE253395); and RNA-seq data for obtaining gene expression levels (GSM4799579, GSM4799578).
[0145] Data preprocessing: (1) In ATAC-seq data processing, the peak files of the two ATAC-seq samples (GSM4799630, GSM4799629) were sorted by chromosome and genome coordinates (sort-k1,1-k2,2n). Then, BEDToolsmerge was used to integrate the peak information of the two duplicate samples to generate a unified non-redundant chromatin open region file Merge_ATAC_Muscle_Peaks.bed. The overlap between the promoter region and the ATAC-seq peak was determined by the custom overlap analysis function overlap(start1,end1,start2,end2), regions with chromatin accessibility signals were marked, functional feature labels were established, and the intersection of the two intervals was found (i.e., the promoter region in the euchromatin region that falls in the ATAC-seq peak region).
[0146] (2) In ChIP-seq data processing, the bigWig signal trajectory of each sample is first normalized based on the custom function normalize_bigwig(). This function performs maximum value normalization based on the sample's own maximum coverage intensity and further performs log2(signal+1) transformation to eliminate the influence of different sequencing depths and background noise on the signal amplitude, generating a standardized bigWig file with uniform scale. For all positions within the interval, the returned NaN value is replaced with zero, and the average coverage is calculated within the interval to quantify the histone modification enrichment level of the promoter. For each histone modification (H3K27ac, H3K4me1, H3K27me3), signal values were extracted from four biological replicates (i.e., ChIP-seq reads the enrichment of the region; signal intensity reflects the epigenetic activity of the region). The average of the replicates was then calculated to construct a complete and robust promoter epigenetic profile (i.e., quantifying the promoter enrichment in euchromatin or heterochromatin), which was used for subsequent sequence modeling and functional prediction analysis.
[0147] (3) In RNA-seq data processing, the expression matrices of multiple samples from beef cattle muscle tissue (GSM4799578, GSM4799579) were first merged into a unified matrix Merged_RNASeq_Muscle.txt using pd.concat(), which facilitates the integration of the expression levels of all genes. Each sample is a gene expression matrix (gene×sample), with rows corresponding to genes (represented by EnsemblID) and columns corresponding to different samples or replicates. Each element is the expression level of the gene in the sample (original counts or normalized expression value). Multiple sample matrices were merged by row (or column) using pd.concat() to obtain the unified matrix Merged_RNASeq_Muscle.txt, which facilitates subsequent promoter feature association analysis. The statistical characteristics of the data, such as the mean, maximum value, and quantiles, were analyzed using describe() to determine whether the data is original counts or normalized. The nature of the data can be statistically determined using describe(): if the mean and maximum value are very large (tens to tens of thousands), it may be original counts (unnormalized). If the values are concentrated in a small range (e.g., 0–100, or 0–1), they may represent TPM / FPKM / RPKM or log2 normalized expression levels. Finally, `map()` is used to map the normalized expression levels to the corresponding promoter regions, mapping the gene expression values corresponding to each promoter region to that promoter, thus quantifying its potential regulatory activity. If a gene has multiple promoters, each promoter corresponds to the transcriptional level of that gene, thereby establishing a functional feature matrix at the promoter level. Data integration is then completed. Finally, the correspondence between gene expression levels and promoters is obtained.
[0148] (II) Extraction of MSTN gene promoter region
[0149] First, based on the genome annotation GTF file (beef cattle reference genome ARS-UCD1.2 and its annotation file), the MSTN gene entries were screened using the grep command, and the corresponding chromosomal location and transcription start site (TSS) information were extracted. Then, the GTF file was processed using an AWK script to extract the MSTN promoter region based on the reference genome coordinates. The accurate TSS location was determined according to the gene strand direction, and a 2000bp upstream extension was made from this center to construct a 2000bp promoter analysis interval. Boundary correction was performed to ensure that the coordinates did not exceed the chromosomal range. Finally, the processed promoter region information in the genome was output in standard BED format, generating a complete file MSTN_promoter_2kb.bed containing chromosomal coordinates, start and termination positions, and gene names, providing an accurate genomic coordinate basis for subsequent multi-omics data analysis.
[0150] (III) Signal Prediction Using the Enformer Model
[0151] See the schematic diagram of the model adapter architecture. Figure 2 See below for specific steps.
[0152] First, the DNA sequence of the MSTN gene promoter region was extracted from the reference genome (beef cattle reference genome ARS-UCD1.2) and converted into a four-dimensional tensor using one-hot encoding, corresponding to the bases A, T, G, and C. Then, promoter sequences of varying lengths were standardized to the 196,608 base pairs required by the Enformer model through center alignment and end padding. The standardized sequence tensor was then input into an Enformer deep learning model (i.e., a pre-trained model used to learn the relationship between gene sequences and epigenetic signals) trained on human genome data to obtain high-dimensional epigenetic signal prediction data. The standardized sequence tensor was then input into a pre-trained Enformer deep convolutional neural network model. This model, pre-trained based on large-scale epigenetic data from the human genome, can predict various regulatory signals from DNA sequences, including histone modifications, transcription factor binding, and chromatin accessibility. Pre-training, by inputting known sequences and their corresponding epigenetic signal matrices and optimizing network parameters, enables the model to capture the relationship between cis-regulatory elements and regulatory signals.
[0153] Finally, cross-species comparable prediction patterns are preserved, forming a three-dimensional prediction data matrix with dimensions of 80 samples × 896 loci × 5313 epigenetic features (mainly including: ATAC-seq accessibility signal, H3K4me3 modification representing active promoter markers, H3K27ac modification representing enhancer or active regulatory regions, H3K4me1 modification representing auxiliary characterization of enhancer / promoter states, and transcription factor binding prediction signal (TFBS) representing the probability or activity of key transcription factor binding sites). The four-dimensional (base × sequence length) matrix of the input sequence one-hot encoded sequence is processed by the model to generate a three-dimensional prediction matrix (sample × sequence position × feature), which reduces the base dimension and provides a feature basis for subsequent adapter model training.
[0154] II. Adapter Model Construction and Validation
[0155] (a) Feature Engineering and Data Preparation
[0156] First, the high-dimensional epigenetic signals predicted by the Enformer model were dimensionality-reduced. Specifically, mean pooling was performed on the 5313-dimensional features at 896 sites in each promoter region, compressing the original three-dimensional tensor (sample × sequence position × feature) into a two-dimensional matrix, forming a feature matrix of 80 samples × 5313 features. Each row represents the feature vector of a promoter across all features, and each column represents the value of a specific feature across all samples. The rows represent different promoter samples (80 promoter samples), and the columns represent the average activity level of various epigenetic features (5313 epigenetic signal features). Simultaneously, a mapping relationship between gene names and experimental expression levels was established based on the repaired RNA-seq tag files. The repair process mainly targeted missing values (NaN) or infinity (Inf), ensuring that each promoter sample had a valid RNA expression value by removing invalid samples or recalculating the mean. Subsequently, the feature matrix and tag data were cleaned, strictly removing sample rows with missing or invalid data to ensure the integrity and quality of the training data. In the final standardized feature matrix, each feature value represents the average epigenetic activity of the corresponding promoter region. This can be directly used to train the machine learning adapter model, providing a reliable input data foundation for subsequent prediction of bovine-specific promoter regulatory functions. The repair criteria are: checking the RNA expression value corresponding to each promoter; removing missing or infinitely large values or recalculating the mean to ensure that each sample has effective expression levels; the repaired CSV file can directly correspond to promoter and RNA expression values, ensuring complete training labels. Cleaning mainly targets RNA label values (removing NaN or Inf) and the feature matrix (removing rows of samples with invalid corresponding RNA values). The threshold logic strictly excludes invalid data, rather than simply limiting the numerical range, ensuring that each input sample has complete features and labels.
[0157] (ii) Training the machine learning adapter
[0158] A random forest regression algorithm was used to construct a prediction model. The feature matrix of 80 samples × 5313 epigenetic characteristics was used as input, and the experimentally measured RNA-seq expression levels were used as supervision labels for training. The dataset was divided into training and test sets in an 8:2 ratio, with the training set used for model fitting and the test set used to evaluate generalization performance. Five-fold cross-validation was further employed during training, dividing the training data into five subsets. Four subsets were used for training, and one subset was used for validation sequentially. The R-value for each fold was calculated. 2The mean squared error (MSE) was used to evaluate the model's stability and reliability, and the average value was taken. The trained bovine-specific random forest adapter model can capture the nonlinear relationship between input features and gene expression levels. The input features are essentially the comprehensive epigenetic state of each promoter region, reflecting the relative contributions of activation and repression markers, and can be used for downstream promoter optimization design or variant prioritization. The model performance evaluation metric is: R0 2 A value >0.3 is considered suitable for initial design; 0.1-0.3 indicates weaker performance but usable for feature ranking. 2 A value less than 0.1 suggests the need to optimize feature selection or increase the number of training samples.
[0159] (III) Model Performance Validation
[0160] Using R 2 The performance score and root mean square error (RMSE) are used as the core evaluation metrics to comprehensively evaluate the trained random forest adapter model. The results show that the model performs excellently on the test set, with R... 2 The score reached 0.9788, and the training set R... 2 The RSI is 0.9920, and the RMSE on the test set is 174.14. These metrics are significantly better than the pre-defined RSI. 2 A performance threshold criterion with a score greater than 0.3. Although the results of the five-fold cross-validation show significant fluctuations (R0.05...). 2 The score ranges from -3.0 to 1.0, but based on its stable and excellent performance on the test set, the model was deemed valid and serialized into a bovine_expression_adapter_model_fixed.pkl file for storage, providing a validated and reliable predictive tool for subsequent virtual high-throughput screening of promoter variants.
[0161] III. Variant Generation and Virtual Filtering
[0162] (I) Analysis of Cis-Regulator Elements
[0163] Based on the MSTN gene promoter region (2000bp upstream to 0bp upstream of TSS) in the bovine genome, a systematic motif scanning and functional activity assessment workflow was constructed. First, a comprehensive transcription factor binding site identification was performed on the 2000bp promoter sequence. For each candidate site, ATAC-seq openness and ChIP-seq epigenetic characteristics were integrated, and a comprehensive activity score (activity_score_final) was generated after multiple rounds of normalization and weighted calculation. Using the 75th percentile of all motif scores within the promoter as the high-activity threshold, 119 highly active motif instances were finally obtained, belonging to 10 different motif types (motif_MSTN_0 to motif_MSTN_9). Since the same motif type can appear multiple times in the sequence, duplicate instances at different positions were also preserved. Among these elements, motif_MSTN_6 (0.120), motif_MSTN_8 (0.099), and motif_MSTN_2 (0.085) showed significantly enhanced transcriptional regulatory potential. Most highly active motifs were located in highly open regions of the ATAC-seq dataset and were accompanied by enrichment of activating epigenetic markers, exhibiting an upward trend in activity. This suggests that they may be key functional cis-elements regulating MSTN expression, providing reliable biological support for the precise optimization design of subsequent promoter variants.
[0164] (ii) Generation of multi-strategy variants
[0165] See the schematic diagram of the multi-strategy promoter variant generation method. Figure 3 See below for specific steps.
[0166] Based on the established highly active motif names, instead of using real DNA sequences, a standardized motif placeholder [motif_MSTN_x] was used as the basic module for variant generation. Five systematic engineering strategies were employed to construct a promoter variant library, and all generated variant sequences were systematically stored in the MSTN_promoter_variants.csv file. Specifically, this is achieved through the following technical solutions: First, based on the variants_per_strategy parameterized configuration system, the number of variants generated for each strategy is set (10 active sequential variants, 5 motif replication variants, 5 motif deletion variants, 10 random interval variants, and 10 Monte Carlo variants); Second, using the generate_variants_for_promoter core algorithm function, multi-strategy parallel variant generation is performed for each promoter region: In the active sequential variants, the random.shuffle(shuffled) algorithm is used to shuffle the motif arrangement order; in the motif replication variants, the specific repetition of high-activity elements is achieved through the [top_motif]*2+motifs[1:] operation; in the motif deletion variants, the motifs[:-1] slicing technique is used to remove low-activity elements; in the random interval variants, the random_gap_seq() function is combined to generate a 5-20bp random interval sequence; in the Monte Carlo variants, the random.sample(motifs,num_motifs) random sampling algorithm is used to construct diverse motif subsets. Finally, the variant information of all generated variants, including gene name, promoter index, variant ID, design strategy, variant motif order and variant sequence, is systematically saved into the variant sequence library by the variants_df.to_csv(output_csv_file,index=False) method. This provides a standardized and structured design space and data foundation for subsequent virtual high-throughput screening.
[0167] (III) Virtual High-Throughput Filtering
[0168] Based on the previously constructed promoter variant sequence library, a validated bovine-specific random forest adapter model (i.e., step two) is used for systematic functional evaluation. The specific technical solution is as follows: First, load the sequence library containing all designed promoter variants through the pd.read_csv(variant_csv_file) method; then load the serialized bovine-specific random forest adapter model using joblib.load(adapter_model_file); convert the variant sequences into Enformer-compatible 5313-dimensional feature vectors through the get_enformer_features(sequence_list) feature extraction function; use the adapter_model.predict(X_features) prediction interface to predict the expression activity of each variant, generate prediction scores and store them in the predicted_score field; finally, use the sort_values('predicted_score',ascending=False) sorting algorithm to sort all variants in descending order according to their expression potential. The complete results of the entire screening process, including key information such as variant sequences, design strategies, and prediction scores, are systematically saved to the MSTN_promoter_variants_predicted.csv file through the to_csv(predicted_csv_file,index=False) method. This structured data provides a reliable quantitative basis for subsequent screening of candidate top-type variants.
[0169] To achieve the structured screening of different variants, the present invention constructs a unified expression potential determination system based on the predicted activity of the wild-type MSTN promoter and the score distribution of all variants. Specifically, it includes: Score_variant≥Score_WT+1SD or being in the top 20% of the scores of all variants is determined as a high-potential enhancer promoter; Score_WT−0.5SD to Score_WT+1SD is defined as medium activity; Score_variant<Score_WT−0.5SD or being in the bottom 20% is determined as a low-activity or inactivated structure. Only variants in the high-potential group are selected for DNA synthesis and subsequent functional experiments to ensure the reliability and biological feasibility of the design results.
[0170] IV. Optimize variant output
[0171] In the variant output optimization stage, this invention constructs a virtual high-throughput screening process based on the above prediction results and uses a systematic analysis method to output a candidate variant set. First, variants_df_sorted.head(20) is used to extract the top 20 variants in the top range of the predicted expression activity values from the previous step from the predicted score sorting results, generating a candidate variant list MSTN_top20_candidates.csv (where the predicted score range is 2577.730–2853.715, which only indicates the model score and is not used as a direct judgment of expression enhancement). Second, a motif hotspot statistical system is constructed, and the frequency of motif occurrence in different categories of variants is quantified using the pivot(index='motif',columns='category',values='frequency') data pivot structure to identify the distribution pattern of key regulatory elements in different predicted grade variants. Finally, an automated design specification generation program is used to call generate_design_notes() to generate detailed annotation information for each variant, including key motif features, structural logic, and design origins. Finally, to_csv(final_candidate_csv, index=False) is used to output a complete candidate variant file, MSTN_final_candidate_variants.csv, containing information such as variant_id, gene, predicted_score, variant_motif_order, strategy, variant_sequence, and design_notes. This provides a standardized and traceable sequence design basis for subsequent experimental verification and functional testing.
[0172] V. Verification by Dual Fluorescence Experiment
[0173] (a) Design of recombinant promoters
[0174] This invention uses the genomic information of Hereford cattle. The original sequence of the MSTN gene promoter (i.e., 2000 bp upstream of the transcription start site) is named P0 (SEQ ID NO.1).
[0175] The amino acid sequence of the MSTN gene in Hereford cattle can be found in GenBank (NP_001001525.1).
[0176] To verify the transcriptional effect of the promoter variants designed in step four at the cellular level, the three promoter variants with the highest predicted scores, P1 (sequence shown in SEQ ID NO.2), P2 (sequence shown in SEQ ID NO.3), and P3 (sequence shown in SEQ ID NO.4), were selected from the virtual high-throughput screening results in step four.
[0177] In this invention, all promoter sequences are represented using motif placeholders during the design phase, such as [motif_MSTN_1], [motif_MSTN_2], etc., as the basic modules for variant generation. Each motif placeholder represents a regulatory module that has been identified as having high activity, but its actual DNA sequence is not involved at this stage. Therefore, when synthesizing the P1, P2, and P3 sequences, the motif placeholders are removed, and the original DNA sequences are retained.
[0178] During the experimental validation phase, placeholders such as [motif_MSTN_x] serve only as module markers for variant design, guiding variant combination and generation strategies. Ultimately, when constructing the reporter vector or synthesizing the promoter fragment, the actual DNA sequence is generated according to the design scheme, and placeholders are no longer retained in the experimental sequence. This approach ensures complete consistency between the virtual design logic and the experimental construction, while avoiding the processing of lengthy DNA sequences during the computational phase, improving efficiency, and reducing costs and experimental risks.
[0179] In step three, P1 used the Monte Carlo strategy to sort the motifs, and the arrangement of the motifs is as follows:
[0180]
[0181] In step three, P2 used a random interval strategy to sort the motifs, and the arrangement of the motifs is as follows:
[0182]
[0183] In step three, P3 used a random interval strategy to sort the motifs, and the arrangement of the motifs is as follows:
[0184]
[0185] (ii) Synthesis of recombinant promoters
[0186] P0, P1, P2, and P3 were artificially synthesized, and restriction endonuclease linkers Kpn I and Hind III restriction sites and restriction protection sequences were added upstream and downstream, respectively, to obtain four promoter DNA fragments.
[0187] The four promoter DNA fragments and the firefly luciferase reporter vector pLG3 (purchased from Beijing Qingke Biotechnology Co., Ltd., catalog number ZT000097) fragment were digested with restriction endonucleases Kpn I and Hind III, respectively. The digested products of the four promoter DNA fragments were ligated into the linearized pLG3 vector using DNA ligase. Sanger sequencing of the inserted sequences in the recombinant vector was performed using universal sequencing primers for the pLG3 system, verifying that the recombination met the design expectations.
[0188] Four promoter sequences were cloned into the pLG3 vector to construct a reporter system for activity detection. The four recombinant vectors were named PLG3-P0, PLG3-P1, PLG3-P2, and PLG3-P3, respectively.
[0189] The TK plasmid carrying the Renilla luciferase gene (purchased from Beyotime, catalog number D2760-1μg) was used as an internal control to correct for differences in transfection efficiency. The negative control was the empty vector pLG3.
[0190] Preparation of fetal bovine muscle satellite cell lines: In a laminar flow hood, transfer muscle tissue to sterile petri dishes and rinse repeatedly with pre-cooled PBS (containing 2-3 times the amount of antibiotics) until the washings are clear. Carefully remove all visible fat, connective tissue, and fascia. Cut the pure muscle tissue into small pieces of approximately 1-2 mm³. Transfer the tissue pieces to Erlenmeyer flasks containing a digestive enzyme mixture (collagenase I 1.5 mg / mL + dispersin 1 U / mL, dissolved in PBS). Incubate at 37°C on a shaker for 60-90 minutes with low-speed shaking. Gently pipette every 20-30 minutes to observe for tissue dispersion. After digestion, add an equal volume of serum-containing GM to terminate the digestion. Filter the digestion solution through a 100 μm cell sieve and collect the filtrate. Filter again through a 40 μm cell sieve to remove cell clumps and debris. Collect the filtrate, centrifuge at 300-400 × g for 5 minutes, and discard the supernatant. Resuspend the cells in an appropriate amount of GM and count them. This is a crude single-cell suspension containing satellite cells, fibroblasts, endothelial cells, and erythrocyte fragments.
[0191] Satellite cell purification: Seed the crude cell suspension into uncoated cell culture dishes. Incubate at 37°C for 1-2 hours. Most fibroblasts adhere rapidly, while satellite cells adhere more slowly. Carefully collect the uncoated cell suspension. This is a partially enriched satellite cell population. Seed this cell suspension into Matrigel or gelatin-coated culture dishes for formal culture, expansion, and cell line establishment.
[0192] Subsequently, the constructed pLG3-P0, pLG3-P1, pLG3-P2, and pLG3-P3 reporter vectors were mixed with the TK internal control plasmid at a ratio of 10:1 (mass ratio), with the plasmid volume not exceeding 2 μL. 10 μL of commercially available P3 electroporation buffer was then used to transfer the 10 μL of plasmids to the TK internal control plasmid. 6 Muscle satellite cells (MSCs) and plasmids were resuspended and thoroughly mixed. The pLG3 reporter vector and TK internal control plasmid were electroporated into fetal bovine muscle satellite cell lines (MSCs) using the CZ167 program. Cells were evenly seeded in 6-well plates and cultured. After 48 hours, the Firefly and Renilla activities of each group of cells were measured using a dual-luciferase assay system (purchased from Promega, catalog number E1910). The Firefly / Renilla luminescence intensity ratio was used as a quantitative indicator of promoter activity.
[0193] Cell culture and preparation: One day before transfection, trypsin was used to digest cells in the logarithmic growth phase. After counting, cells were seeded at an appropriate density in 6-well plates (5 × 10⁶ cells / well). 5 (cells / well), to achieve cell confluency of 70%-90% at transfection. Once cells have adhered, prepare the following mixture:
[0194] name Proportion volume pLG3 report carrier 10 (2 μg) 1.8μL TK internal reference plasmid 1 (0.2 μg) 0.2μL P3 Electrolytic converter 100μL
[0195] MSC cells were resuspended using this system and electroporated using the CZ167 electroporation program. After electroporation, the cells were evenly seeded in 6-well plates and cultured at 37°C in a 5% CO2 incubator.
[0196] Cell lysis and fluorescence detection: Discard the old culture medium and gently wash the cells once with pre-cooled PBS. Add 100 μL of 1X Passive Lysis Buffer (PLB) to each well. Place the culture plate on a shaker and lyse at room temperature for 15–30 minutes. Transfer the lysis buffer to a 1.5 mL centrifuge tube, centrifuge at 12,000 rpm for 5 minutes, and collect the supernatant for detection.
[0197] Using a multi-functional microplate reader, follow the procedure below for detection:
[0198] Firefly luciferase activity assay: Take 20 μL of cell lysis supernatant and add 100 μL of pre-mixed LAR II (firefly luciferase substrate). Immediately measure the chemiluminescence signal; the reading time is usually 10 seconds. Record the result as the F value.
[0199] Renilla Luciferase Activity Assay: Quickly add 100 μL of Stop & Glo® Reagent (to quench the firefly signal and initiate the Renilla reaction) to the above reaction wells. Immediately measure the chemiluminescence signal and record it as the R value.
[0200] Data processing: For each transfection well, the ratio of Firefly Luciferase activity to Renilla Luciferase activity (F / R) was calculated. This ratio eliminated differences in transfection efficiency and cell number. The relative activities of the experimental groups (P0, P1, P2, P3) were compared with pLG3 (background value) to obtain the promoter strength of P0, P1, P2, and P3.
[0201] See test results Figure 4 The test results showed that all designed promoter variants exhibited significant functional regulatory effects. The reporter activity of the enhanced variant P3 was significantly higher than that of the wild-type promoter P0, with a relative activity approximately 3.8 times that of P0, indicating that it could effectively increase the expression level of the MSTN gene, thereby strengthening the negative regulatory effect on skeletal muscle growth.
[0202] Conversely, the attenuated variants P1 and P2 reduce MSTN transcriptional activity, weakening its inhibitory effect on muscle growth and promoting increased muscle mass and muscle fiber quantity. This demonstrates that different types of promoter variants can achieve precise regulation of MSTN expression, and the enhancing and attenuating variants provide feasible solutions for different molecular breeding strategies in beef cattle, showcasing the broad application potential of the promoter design method of this invention.
[0203] As is known from common technical knowledge, this invention can be implemented through other embodiments that do not depart from its spirit or essential characteristics. Therefore, the disclosed embodiments described above are merely illustrative in all respects and are not the only ones. All modifications within the scope of this invention or its equivalents are included in this invention.
Claims
1. A biomaterial, wherein the biomaterial is a second biomaterial or a third biomaterial; The second biological material is any one of the following: M2-1, M2-2, M2-3, and M2-5; M2-1: Second promoter The sequence of the second promoter is shown in SEQ ID NO.3; M2-2: Third gene expression cassette The promoter of the third gene expression cassette is the second promoter described in M2-1; M2-3: Second genetic engineering vector The second genetic engineering vector contains a third gene expression cassette, which further contains a second exogenous gene insertion site and a terminator; M2-5: Second genetically engineered cell The second genetically engineered cell contains the second genetic engineering vector described in M2-3; The third biomaterial is any one of the following: M3-1, M3-2, M3-3, and M3-5; M3-1: Third promoter The sequence of the third promoter is shown in SEQ ID NO.2; M3-2: Fifth gene expression cassette The promoter of the fifth gene expression cassette is the third promoter described in M3-1; M3-3: Third Genetic Engineering Vector The third gene engineering vector contains a fifth gene expression cassette, which also contains a third exogenous gene insertion site and a terminator. M3-5: Third genetically engineered cell The third genetically engineered cell contains the third genetically engineered vector described in M3-3.
2. The biomaterial of claim 1, wherein, Choose from any one of the following: A3, A4, A6, A7, A9, A10, A12, and A13; A3: The backbone vector of the second gene engineering vector is the pLG3 vector; A4: The backbone vector of the third gene engineering vector is the pLG3 vector; A6: Insert the bovine MSTN gene into the second exogenous gene insertion site of the second recombinant gene engineering vector in M2-3 to form the second recombinant gene engineering vector; A7: Insert the bovine MSTN gene into the third exogenous gene insertion site of the third gene engineering vector described in M3-3 to form the third recombinant gene engineering vector; A9: The host of the second genetically engineered cell is a muscle cell; A10: The host of the third genetically engineered cell is a muscle cell.
3. The biomaterial of claim 2, wherein, Selected from B1 and / or B2 below; B1: The amino acid sequence of the protein encoded by the bovine MSTN gene is shown in GenBank ID NP_001001525.1; B2: The muscle cells described are muscle satellite cell lines derived from fetal bovine cells.
4. The use of the biomaterial as described in claim 2 or 3 in the preparation of a formulation for altering the genetic properties of bovine muscle, wherein the biomaterial is selected from the second genetic engineering vector, the second genetic engineering cell, the third genetic engineering vector, and the third genetic engineering cell; wherein The bovine MSTN gene was inserted into the second exogenous gene insertion site; The bovine MSTN gene was inserted into the third exogenous gene insertion site.