A method for constructing a sequence synthesis cycle prediction model and application thereof

By constructing a deep learning-based sequence synthesis cycle prediction model that comprehensively considers GC and AT content as well as the position of repetitive sequences, the accuracy problem of gene synthesis cycle prediction is solved, and more efficient gene synthesis management is achieved.

CN116665776BActive Publication Date: 2026-06-19SUZHOU HONGXUN BIOTECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SUZHOU HONGXUN BIOTECH CO LTD
Filing Date
2023-05-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately predict gene synthesis cycles, especially due to the limited amount of gene sequence data used in training models and insufficient feature extraction, resulting in inaccurate models that are unable to handle large-scale, complex data.

Method used

A sequence synthesis period prediction model is constructed using Embedding technology, Transformer model and two neural networks in deep learning. It comprehensively considers GC and AT content, repetitive sequence position and enrichment, and uses distributed computing to improve training speed.

Benefits of technology

It improves the accuracy of gene sequence synthesis cycle prediction, can handle large-scale complex data, simplifies the operation process, and improves synthesis efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116665776B_ABST
    Figure CN116665776B_ABST
Patent Text Reader

Abstract

This invention discloses a method for constructing a sequence synthesis cycle prediction model and its application. The method includes selecting several known gene sequences of different lengths and synthesis cycles, extracting sequence features from the known sequences, using the extracted sequence features and the known sequences as training data for a database, and then using embedding technology, a Transformer model, and two neural networks from deep learning to establish a sequence synthesis cycle prediction model. The method of this invention can predict the synthesis cycle of gene sequences of varying complexity, is simple to operate, has high accuracy, and is beneficial for the overall planning of gene synthesis, thereby improving synthesis efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of molecular biology and bioinformatics, specifically relating to a method for constructing a sequence synthesis period prediction model and its application. Background Technology

[0002] The integration of biotechnology and information technology has led to the widespread adoption of gene synthesis technology at a rate exceeding Moore's Law. As a foundation of life sciences, gene synthesis is applied in various fields such as biomedicine and disease research, resulting in continuously rising market demand. Gene synthesis does not rely on sequence templates but rather involves the in vitro chemical synthesis of double-stranded DNA. Synthesized fragments can be relatively long, reaching the kb level. The gene synthesis process is a trial-and-error process involving continuous design, modification, verification, and correction. Currently, most commercial DNA synthesis companies typically use column synthesis, a four-step phosphorous acid chemical synthesis method that synthesizes oligonucleotides on a solid phase. These oligonucleotides can typically be synthesized to reach 100-200 nt, with an error rate of 0.5% or less, and a coupling efficiency of up to 99% per monomer.

[0003] Currently, most gene synthesis is outsourced, with gene synthesis companies designing and synthesizing the genes. As the demand for gene synthesis grows, clients have increasingly specific requirements regarding delivery time. However, the gene sequences to be synthesized vary not only in length but also in synthesis difficulty, making it difficult to accurately estimate the synthesis cycle. Gene synthesis companies typically provide clients with approximate delivery times based on years of experience.

[0004] For example, CN111192629A discloses a gene sequence difficulty analysis model. This model uses several regression algorithms commonly used in machine learning to construct a quantitative prediction model. It selects a certain amount of known sequences for training and finally inputs features extracted from the sequences to predict the difficulty of the gene sequence and thus predict the synthesis cycle of the gene to be tested. However, the amount of gene sequence data used to train the model is limited, resulting in insufficient accuracy. The sequence features only consider the GC content of the sequence, without considering the AT content and AT enrichment. It also only considers the length of the repeating sequences. Although it also considers the proportion of forward and reverse repeats to the total sequence length and the repeating coverage area, its core is still length. It does not consider the relationship between the repeating sequence and its position in the sequence, while the position of the repeating sequence may affect the sequence synthesis difficulty. In addition, it is difficult to handle large-scale complex data using only traditional machine learning regression algorithms. Summary of the Invention

[0005] To address the shortcomings of existing technologies and practical needs, this invention provides a method for constructing a sequence synthesis cycle prediction model and its application. The method can predict the synthesis cycle of gene sequences of different complexities, is simple to operate, has high accuracy, and is conducive to the overall planning of gene synthesis, thereby improving synthesis efficiency.

[0006] To achieve the above objectives, the present invention adopts the following technical solution:

[0007] In a first aspect, the present invention provides a method for constructing a sequence synthesis period prediction model, comprising:

[0008] Select several known sequences, including gene sequences of known lengths and synthesis cycles;

[0009] Sequence features are extracted from the known sequence, and the extracted sequence features and the known sequence are used as training data for the database.

[0010] The training data is used to establish a sequence synthesis period prediction model using Embedding technology, Transformer model and two neural networks in deep learning.

[0011] In deep learning, embedding is a commonly used technique that maps discrete input features into continuous vector representations so that neural networks can understand and process them, thereby improving the model's performance.

[0012] The Transformer is a deep learning model for processing sequential data, initially proposed by Vaswani et al. in 2017. Traditional sequence models, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), suffer from problems when processing long sequences, such as vanishing and exploding gradients. In contrast, the Transformer model takes a completely different approach. Instead of using recurrent or convolutional layers, it employs an attention mechanism to process the input sequence. This mechanism mimics the information processing process of the human brain, focusing limited attention on key information to save resources and quickly obtain relevant information. Compared to traditional sequence models, the Transformer model has the following advantages: it supports parallel computation, thus improving computational efficiency; it supports long sequence modeling, considering all elements in the sequence simultaneously, thus improving model accuracy; and it has good generalization performance, achieving good results in tasks such as machine translation and text generation.

[0013] Preferably, the neural network includes a Linear neural network and a Dense neural network.

[0014] Preferably, the Linear neural network contains four linear transformation layers, and the Dense neural network contains three linear transformation layers.

[0015] Preferably, the sequence features include base type, sequence repetition status, AT / GC enrichment status, sequence length, total repetitive sequence score, AT enrichment score, GC enrichment score, and the length of the longest repetitive subsequence.

[0016] Secondly, the present invention provides the application of the prediction model constructed by the construction method described in the first aspect in predicting the gene sequence synthesis cycle.

[0017] Thirdly, the present invention provides a sequence synthesis period prediction device, comprising:

[0018] The sequence feature extraction unit is used to extract sequence features from known sequences; in addition to preparing training data for the prediction model unit, it also needs to provide services to the prediction unit.

[0019] The database unit is used to acquire gene sequences of known lengths and synthesis cycles, as well as sequence feature information obtained after processing by the sequence feature extraction unit. These sequences are divided into training and testing sets. This data will be input into the prediction model unit to train the model parameters and form the final prediction model.

[0020] The prediction model unit is used to train the training set data in the database unit to build the prediction model.

[0021] The prediction unit is used to input the sequence to be tested, call the sequence feature extraction unit and the prediction model unit, and predict the synthesis period of the sequence.

[0022] Preferably, the prediction model unit includes: a Linear subunit, an Embedding subunit, an Encoder subunit, a Dense subunit, and a Represent subunit.

[0023] Preferably, the Embedding subunit includes a two-layer structure. The first layer uses the Embedding class in PyTorch, and the second layer adds the Embedding results and then uses nn.LayerNorm in PyTorch to implement layer normalization.

[0024] PyTorch is an open-source Python machine learning library based on Torch, providing a wealth of tools and interfaces for building various deep learning models, including convolutional neural networks, recurrent neural networks, variational autoencoders, and more. Furthermore, PyTorch offers many advanced features, such as automatic differentiation and distributed training, making deep learning easier and more efficient.

[0025] Compared with the prior art, the present invention has the following beneficial effects:

[0026] The method for constructing a sequence synthesis cycle prediction model provided by this invention is used to build a sequence database containing more than 20,000 sequences of different lengths and synthesis cycles, all derived from real business cases, thus helping to build a more accurate prediction model. When extracting sequence features, this invention comprehensively considers the content of GC and AT and their respective enrichment in the sequence, assigning different values ​​to the positions of non-repeating sequences, ordinary repeating sequences, and the longest repeating sequence. This distinguishes different repeating situations and records and associates repeating positions with repeating situations, further improving the accuracy of the prediction model. This invention uses excellent models from deep learning, which can not only handle large-scale data but also use distributed computing to improve training speed. Attached Figure Description

[0027] Figure 1 This is a schematic diagram of the gene sequence synthesis cycle prediction model structure;

[0028] Figure 2 Flowchart for preparing data for a database unit;

[0029] Figure 3 This is a schematic diagram of the predictive model unit.

[0030] Figure 4 A schematic diagram of the Linear module structure for the prediction model unit;

[0031] Figure 5 A schematic diagram of the Embedding module structure for a prediction model unit;

[0032] Figure 6 A schematic diagram of the structure of each layer of the Encoder module in the prediction model unit;

[0033] Figure 7 A schematic diagram of the Dense module structure for the prediction model unit;

[0034] Figure 8 This is a flowchart of the prediction unit's workflow. Detailed Implementation

[0035] To further illustrate the technical means and effects of this invention, the following description, in conjunction with embodiments and accompanying drawings, provides a further explanation of the invention. It is understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it.

[0036] Example 1: Construction of a gene sequence synthesis cycle prediction model

[0037] This embodiment provides a gene sequence synthesis cycle prediction model, which includes four modules, such as... Figure 1 As shown, the structure and usage of each module are as follows:

[0038] 1. Sequence Feature Extraction Unit

[0039] The sequence feature extraction unit, for each gene synthesis sequence to be processed, first obtains information such as forward repeats, inverted repeats, palindromes, AT sequences, and GC sequences. Forward repeats refer to two identical subsequences in the gene sequence; inverted repeats refer to two subsequences in the gene sequence with an inverse complementary relationship; palindromes refer to a subsequence in the gene sequence that has an inverse complementary relationship with itself; AT sequences refer to subsequences consisting only of AT bases; and GC sequences refer to subsequences consisting only of GC bases. These sequences must be at least 8 bp in length. Based on this information, the required sequence features are calculated.

[0040] 1) Comprehensive sequence features

[0041] The comprehensive sequence features include five elements: the first element is the character sequence length; the second element is the total repeat sequence score, which is the proportion of gene sequences containing the aforementioned forward repeat sequences, reverse repeat sequences, and palindromic sequences; the third element is the AT enrichment score, calculated from the number and length of the aforementioned AT sequences. One way to calculate the score of an AT sequence of length n is n*min(n-8+1,8), and the AT enrichment score is the sum of the scores of all AT sequences; the fourth element is the GC enrichment score, calculated from the number and length of the aforementioned GC sequences, and the calculation method is similar to that of the AT enrichment score; the fifth element is the length of the longest repeat subsequence.

[0042] 2) Sequence base characteristics

[0043] Read the complete sequence and convert the character sequence into a pure number sequence bit by bit according to the rules shown in Table 1 to obtain the sequence base characteristics.

[0044] Table 1

[0045] character number A 1 T 2 C 3 G 4

[0046] 3) Sequence repetition characteristics

[0047] First, read the relevant information (including repetition type, the first and second occurrences, subsequence length, and subsequence content) of all forward repeating sequences, reverse repeating sequences, and palindromic sequences with a length greater than 8. Then, sort them in descending order of length, with the first record being the longest repeating subsequence.

[0048] Subsequently, a sequence of numeric strings with the same length as the gene sequence (denoted as the sequence repetition feature) is created, and all elements are first set to 1. Then, for each item in the repetition sequence list, the element value at the corresponding position of the sequence repetition feature is set to 2. Finally, the element position of the longest repetition subsequence in the sequence repetition feature is set to 3.

[0049] 4) Sequence enrichment features

[0050] First, read all relevant information (including the position of occurrence, subsequence length, and subsequence content) of AT and GC sequences with a length greater than 8 to form an enriched sequence list.

[0051] Subsequently, a sequence of numeric strings with the same length as the gene sequence (denoted as sequence enrichment feature) is created, and all elements are initially set to 1. Then, for each item in the enriched sequence list, if it is a GC sequence, the element value at the corresponding position of the sequence enrichment feature is set to 2; if it is an AT sequence, the element value at the corresponding position of the sequence enrichment feature is set to 3.

[0052] 2. Database Unit

[0053] The data preparation process for a database unit is as follows: Figure 2 As shown.

[0054] First, business data needs to be collected, including actual gene synthesis sequences and actual synthesis cycles (in days). For each unprocessed sequence, the sequence feature extraction unit is invoked. After the invocation, the system will read three types of data:

[0055] 1) Actual synthesis cycle

[0056] This refers to the actual synthesis period of the current sequence, in days, and is a positive integer.

[0057] 2) Sequence base characteristics, sequence repetition characteristics, and sequence enrichment characteristics

[0058] These are the features obtained after specific calculations of the current sequence, which are related to the base type, sequence repetition, and AT / GC enrichment of the gene sequence. All three features are data lists, with a length (i.e., the number of elements) equal to the length (i.e., the number of bases) of the current sequence, and the values ​​of the elements are all taken from (1, 2, 3, 4).

[0059] 3) Comprehensive sequence features

[0060] This refers to the features (reference sequence feature extraction unit) obtained after the current sequence has undergone specific calculations. It contains five elements: the first element is the length of the character sequence; the second element is the total repeating sequence score; the third element is the AT enrichment score; the fourth element is the GC enrichment score; and the fifth element is the length of the longest repeating subsequence.

[0061] For the second type of data, the system needs to append zeros to the end of the list to make the total length of the list reach a given value, MAX_LENGTH. In practice, this value is usually set to 5000, which is sufficient for most cases. After appending, the three types of data are concatenated into a single set. The first element of this set is the base characteristics of the padded sequence, the second element is the repeat characteristics of the padded sequence, the third element is the enrichment characteristics of the padded sequence, the fourth element is the overall sequence characteristics, and the fifth element is the actual synthesis cycle. This set of data is then added to the results list.

[0062] Repeat the above steps until all sequences have been processed.

[0063] Subsequently, the order of the results list was shuffled, and stratified sampling was used to extract 80% of the data as the training set and 20% as the test set. Ultimately, the system selected more than 20,000 data points as the training set.

[0064] 3. Prediction Model Unit

[0065] 3.1 Model Structure

[0066] The structure of the prediction model unit is as follows Figure 3 As shown. The Linear module is a neural network with 4 linear transformation layers; the Embedding module is a two-layer structure that uses embedding and layer normalization techniques from deep learning; the Encoder module is a multi-layer network that uses 8 Transformer models from deep learning as an encoder; the Dense module is a neural network with 3 linear transformation layers; and the Represent module is a display module that processes the output results.

[0067] During model training (i.e., obtaining the values ​​of various parameters used by each module), the system uses training set data from the database unit and employs early stopping, i.e., it determines whether training needs to be stopped early by monitoring changes in the loss in the validation set. Simultaneously, the system utilizes GPU parallel training, significantly improving training speed. After training is complete, the model and optimal parameters are saved and converted to ONNX format for easy access by the prediction unit.

[0068] 3.2 Linear Module

[0069] The Linear module is a neural network containing four linear transformation layers, such as... Figure 4 As shown.

[0070] The input to the Linear module is the integrated sequence features (see the Database Unit section). The first layer has an input dimension of 5 and an output dimension of 256, using the ReLU (Rectified Linear Unit) activation function. The second and third layers both have an input and output dimension of 256, and both use the ReLU activation function. The fourth layer also has an input and output dimension of 256, but does not use an activation function and directly outputs the result.

[0071] The formula for calculating the ReLU function is as follows:

[0072]

[0073] 3.3 Embedding Module

[0074] The Embedding module actually consists of two layers. The first layer uses four Embedding classes from PyTorch. The second layer sums the Embedding results and then uses PyTorch's nn.LayerNorm to perform layer normalization, such as... Figure 5 As shown.

[0075] In deep learning, embedding is a commonly used technique that maps discrete input features into continuous vector representations, enabling neural networks to understand and process them, thereby improving model performance. Sequence base features, sequence repetition features, and sequence enrichment features are all list-type data extracted from sequences, containing discrete integers, and therefore require processing through embedding layers. The last sequence position feature is not derived from the sequence itself, but rather is a system-preset tensor of the same length as the number of columns in the sequence base features, containing integers such as 0, 1, 2, 3…n.

[0076] In deep learning, normalization is a commonly used technique that ensures consistent feature propagation across the neural network for each sample, thereby improving model stability and training performance. Specifically, the `nn.LayerNorm` class can normalize the mean and variance of each input sample along a specified dimension, and then apply a linear transformation and bias to the normalized result to obtain the final output.

[0077] 3.4 Encoder Module

[0078] The Encoder module uses an 8-layer Transformer model Encoder.

[0079] The Transformer model is a deep learning model used to process sequential data, and it has wide applications in fields such as natural language processing. The encoder is used to transform the input sequence into a set of feature vectors.

[0080] Each encoder layer contains a multi-head attention module and a fully connected feedforward neural network module, such as Figure 6 As shown, in the Multi-HeadAttention module, the model performs attention calculations at each position in the input sequence to obtain a context vector representation. These context vector representations are then fed into a fully connected feedforward neural network module for processing, resulting in the final encoder output.

[0081] 3.5Dense Module

[0082] The Dense module is a neural network containing three linear transformation layers, such as... Figure 7 As shown.

[0083] The Dense module takes the outputs of the Encoder and Linear modules as input. The first layer has an input dimension of 512 and an output dimension of 128, using the ReLU activation function. The second layer has an input dimension of 128 and an output dimension of 128, also using the ReLU activation function. The third layer has an input dimension of 128 and an output dimension of 1, and no activation function is used; the result is output directly.

[0084] 3.6 Represent Module

[0085] The Represent module takes as input the Dense module's output and performs three main operations:

[0086] 1) Substitute the input result into the ReLU function to calculate the result, which can remove negative values.

[0087] 2) Compress dimensions, that is, merge rows with one dimension; reduce dimensions to facilitate information extraction.

[0088] 3) Calculate the result of the exponential function with the natural constant e as the base.

[0089] 4. Prediction Unit

[0090] The workflow of the prediction unit is as follows: Figure 8As shown, first, a real gene synthesis sequence needs to be input. Then, the system calls the sequence feature extraction unit to extract the required features and monitors for errors during this process. If an error occurs, the error message is output, and the process ends; otherwise, the ONNX Runtime framework loads the final prediction model in ONNX format generated by the prediction model unit to obtain the prediction result. Since this result is a list of decimals representing a series of outcomes from high to low probability, the first value is taken and rounded to obtain the integer result, which is the predicted period.

[0091] The applicant declares that the above description is merely an embodiment of the present invention and does not limit the scope of protection of the claims of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the scope of protection of the claims of the present invention.

Claims

1. A method for constructing a sequence synthesis cycle prediction model, characterized by, include: Select several known sequences, including gene sequences of known lengths and synthesis cycles; Sequence features are extracted from the known sequence, including base type, sequence repetition status, AT / GC enrichment status, sequence length, total repetitive sequence score, AT enrichment score, GC enrichment score, and the length of the longest repetitive subsequence. The extracted sequence features and the known sequence are used as training data for the database. The training data is used to establish a sequence synthesis period prediction model using Embedding technology, Transformer model and two neural networks in deep learning. The neural networks include Linear neural network and Dense neural network. The establishment of the sequence synthesis period prediction model specifically includes: constructing a Linear module, which is a neural network containing four linear transformation layers, whose input is a comprehensive sequence feature, which includes five elements: character sequence length, total repeat sequence score, AT enrichment score, GC enrichment score, and longest repeating subsequence length; constructing an Embedding module, which is a two-layer structure. The first layer uses four Embedding classes to process sequence base features, sequence repeat features, sequence enrichment features, and sequence position features respectively. The second layer adds the Embedding results of the first layer and then uses layer normalization; constructing an Encoder module, which is an encoder using an eight-layer Transformer model; and constructing a Dense module, which is a neural network containing three linear transformation layers, whose input is the output of the Encoder module and the Linear module.

2. The application of a prediction model constructed by the method described in claim 1 in predicting gene sequence synthesis cycles.