Methylation detection method and apparatus, computer device, and storage medium

By constructing target base sequences and performing feature alignment and fusion, combined with a feature extraction network based on relative position encoding and self-attention mechanism, the problem of detection accuracy caused by ignoring the electrical signal sequence features of nanopores in the HMM model was solved, achieving higher precision DNA methylation detection.

WO2026137328A1PCT designated stage Publication Date: 2026-07-02BGI HANGZHOU CYCLONESEQ TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
BGI HANGZHOU CYCLONESEQ TECHNOLOGY CO LTD
Filing Date
2024-12-26
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing methylation detection methods based on HMM models ignore the sequence characteristics of nanopore electrical signals, resulting in low detection accuracy.

Method used

By constructing a target base sequence of a preset length, base features are extracted and aligned and fused with the original electrical signal. Contextual features are extracted using a feature alignment network and a feature extraction network, and methylation classification is performed by combining relative position encoding and self-attention mechanism.

Benefits of technology

This improves the accuracy of DNA methylation detection by fully considering the low-dimensional feature space interaction between the target base sequence and the original electrical signal, as well as the influence between bases, thus enhancing the precision of methylation detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2024142790_02072026_PF_FP_ABST
    Figure CN2024142790_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Provided are a methylation detection method and apparatus, a computer device, and a storage medium. The method comprises: constructing, for a target site of a base sequence to be detected, a target base sequence of a preset length (S202); extracting a base feature corresponding to the target base sequence (S204); aligning and fusing the base feature, the target base sequence, and the raw electrical signal of each base in the base sequence to be tested to obtain a fused sequence feature (S206); performing context feature extraction on the fused sequence feature to obtain a context feature (S208); and, on the basis of the context feature, performing methylation classification to obtain a classification result (S210).
Need to check novelty before this filing date? Find Prior Art

Description

Methylation detection methods, apparatus, computer equipment and storage media Technical Field

[0001] This application relates to the field of medical testing technology, and in particular to a methylation detection method, apparatus, computer equipment, storage medium, and computer program product. Background Technology

[0002] DNA methylation, as one of the most widely studied epigenetic modifications, especially the methylation status of CpG islands, plays a crucial role in understanding biological processes. Current methylation detection methods primarily involve creating Hidden Markov Models (HMMs) for *E. coli* and then using these models to detect methylation in *Homo sapiens* nanopore reads. However, in HMM-based methods, the sequence characteristics of nanopore electrical signals are ignored, leading to low accuracy in methylation detection using HMM models. Summary of the Invention

[0003] According to various embodiments of this application, a methylation detection method, apparatus, computer device, computer-readable storage medium, and computer program product are provided.

[0004] In a first aspect, this application provides a methylation detection method, executed by a computer device, the method comprising:

[0005] Construct a target base sequence of a predetermined length at the target site of the base sequence to be tested;

[0006] Extract the base features corresponding to the target base sequence;

[0007] The original electrical signals of each base in the base features, the target base sequence, and the bases in the base sequence to be tested are aligned and fused to obtain fused sequence features;

[0008] Context features are extracted from the fused sequence features to obtain context features;

[0009] Based on the aforementioned contextual features, methylation classification is performed to obtain the classification results.

[0010] In one embodiment, aligning and fusing the original electrical signals of each sub-feature in the base features, the target base sequence, and the bases in the sequence to be tested to obtain the fused sequence features includes:

[0011] The sub-features in the base features, the target base sequence, and the original electrical signal are mapped to the target vector space to obtain the fused sequence features.

[0012] In one embodiment, the step of extracting context features from the fused sequence features to obtain context features includes:

[0013] The fusion sequence features are subjected to position encoding to obtain relative position encoding; wherein, the relative position encoding is used to capture the relative positions between each base;

[0014] The fused sequence features are weighted based on the relative position encoding to obtain weighted sequence features;

[0015] Context features are extracted from the weighted sequence features to obtain context features.

[0016] In one embodiment, the step of extracting context features from the weighted sequence features to obtain context features includes:

[0017] A linear transformation is performed on the weighted sequence features to obtain linearly transformed features; the linearly transformed features include query features, key features, and value features.

[0018] The key features and the value features are weighted using relative position weights to obtain weighted key features and weighted value features;

[0019] The context features are extracted based on the query features, the weighted key features, and the weighted value features.

[0020] In one embodiment, extracting the context features based on the query features, the weighted key features, and the weighted value features includes:

[0021] The attention score is determined based on the query features, the weighted key features, and the scaling factor;

[0022] The attention scores are normalized to obtain the attention weights;

[0023] The context features are obtained by weighting the weighted features based on the attention weights.

[0024] In one embodiment, before constructing a target base sequence of a predetermined length at the target site of the base sequence to be tested, the method further includes:

[0025] The raw electrical signals of each base corresponding to the nucleic acid data in the nanopore are detected; the raw electrical signals are converted into the sequence of the bases to be tested.

[0026] In one embodiment, the method is applied to a detection model, which includes a feature alignment network and a feature extraction network; the feature alignment network is used to align and fuse the base features, the target base sequence, and the original electrical signal; the feature extraction network is used to extract contextual features from the fused sequence features.

[0027] In one embodiment, the method further includes:

[0028] Collect a set of DNA methylation samples;

[0029] Based on the target sites in the methylated sample sequences of the DNA methylation sample set, a sample base sequence of the preset length is constructed;

[0030] Feature extraction is performed on the sample base sequence to obtain the base features of the sample base sequence;

[0031] The detection model is trained based on the sample base sequence, the base characteristics of the sample base sequence, and the original electrical signal.

[0032] Secondly, this application also provides a methylation detection device, the device comprising:

[0033] The construction module is used to construct a target base sequence of a preset length at the target site of the base sequence to be tested;

[0034] The extraction module is used to extract the base features corresponding to the target base sequence;

[0035] The processing module aligns and fuses the original electrical signals of each base in the base features, the target base sequence, and the base sequence to be tested to obtain fused sequence features;

[0036] The representation module extracts context features from the fused sequence features to obtain context features;

[0037] The classification module is used to perform methylation classification based on the context features to obtain the classification result.

[0038] In one embodiment, the processing module is used to map each sub-feature in the base features, the target base sequence, and the original electrical signal to a target vector space to obtain fused sequence features.

[0039] In one embodiment, the characterization module is further configured to perform position encoding processing on the fusion sequence features to obtain relative position encoding; wherein, the relative position encoding is used to capture the relative positions between each base; the fusion sequence features are weighted based on the relative position encoding to obtain weighted sequence features; and context features are extracted from the weighted sequence features to obtain context features.

[0040] In one embodiment, the representation module is further configured to perform a linear transformation on the weighted sequence features to obtain linear transformation features; the linear transformation features include query features, key features, and value features; the key features and the value features are weighted using relative position weights to obtain weighted key features and weighted value features; and the context features are extracted based on the query features, the weighted key features, and the weighted value features.

[0041] In one embodiment, the representation module is further configured to determine an attention score based on the query feature, the weighted key feature, and the scaling factor; normalize the attention score to obtain an attention weight; and weight the weighted feature based on the attention weight to obtain a context feature.

[0042] In one embodiment, the sequencing module is further configured to detect the original electrical signals of each base corresponding to the nucleic acid data in the nanopore; and convert the original electrical signals into the sequence of the bases to be tested.

[0043] In one embodiment, the detection model includes a feature alignment network and a feature extraction network; the feature alignment network is used to align and fuse the base features, the target base sequence, and the original electrical signal; the feature extraction network is used to extract contextual features from the fused sequence features.

[0044] In one embodiment, the device further includes:

[0045] The training module is used to collect a DNA methylation sample set; construct a sample base sequence of a preset length based on the target sites in the methylation sample sequences of the DNA methylation sample set; extract features from the sample base sequence to obtain the base features of the sample base sequence; and train the detection model based on the sample base sequence, the base features of the sample base sequence, and the original electrical signal.

[0046] Thirdly, this application also provides a computer device, the computer device including a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the methylation detection method.

[0047] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the methylation detection method.

[0048] Fifthly, this application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the steps of the methylation detection method.

[0049] Details of one or more embodiments of this application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the specification, drawings, and claims. Attached Figure Description

[0050] Figure 1 shows the application environment of the methylation detection method in one embodiment;

[0051] Figure 2 is a flowchart illustrating a methylation detection method in one embodiment;

[0052] Figure 3 is a flowchart illustrating the methylation detection method in another embodiment;

[0053] Figure 4 is a flowchart illustrating a model training or methylation detection method in one embodiment;

[0054] Figure 5 is a structural block diagram of a methylation detection device in one embodiment;

[0055] Figure 6 is a structural block diagram of the methylation detection device in another embodiment;

[0056] Figure 7 is an internal structure diagram of a computer device in one embodiment. Detailed Implementation

[0057] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0058] The methylation detection method provided in this application embodiment can be applied to the application environment shown in Figure 1. The terminal 102 communicates with the server 104 via a network. A data storage system can store the data that the server 104 needs to process. The data storage system can be integrated onto the server 104, or it can be located in the cloud or on another network server.

[0059] The terminal 102 can be a laptop computer, desktop computer, or other medical testing device with methylation detection program installed, so the steps of the methylation detection method of this application can be executed through the terminal 102.

[0060] Server 104 can be used to store medical data or as a server with a methylation detection program installed. Therefore, the steps of the methylation detection method of this application can be executed through server 104. Server 104 can be a standalone physical server or a service node in a blockchain system. The service nodes in this blockchain system form a peer-to-peer (P2P) network, and the P2P protocol is an application layer protocol running on top of the Transmission Control Protocol (TCP). Furthermore, server 104 can also be a server cluster consisting of multiple physical servers.

[0061] In one embodiment, as shown in Figure 2, a methylation detection method is provided. This method can be executed by a computer device, such as by a server or terminal as shown in Figure 1, or by a server and terminal working together. Taking the execution of this method by the terminal in Figure 1 as an example, it includes the following steps:

[0062] S202, construct a target base sequence of a preset length at the target site of the base sequence to be tested.

[0063] The base sequence to be tested can be a combination of bases arranged in a certain order, such as the DNA base sequence to be tested.

[0064] The target site can be any CG site in the sequence to be tested. The preset length can be a preset sequence length, which can be r, and r can be a positive integer greater than 13.

[0065] For example, the terminal maps the raw signals of basecalling sequencing reads onto the nucleotide sequence, and then constructs a k-mer based on a length of r (r=13) for each target site.

[0066] In one embodiment, before S202, the terminal can sequence the nucleic acid data to obtain the base sequence to be tested.

[0067] The nucleic acid data can be deoxyribonucleic acid (DNA) or ribonucleic acid (RNA).

[0068] In one embodiment, the terminal detects the original electrical signals of each base corresponding to the nucleic acid data in the nanopore; and converts the original electrical signals into the sequence of the bases to be tested. The original electrical signals can be read signals.

[0069] S204, extract the base features corresponding to the target base sequence.

[0070] Among them, base features can also be base statistical features, including but not limited to the mean, variance and length of the electrical signal corresponding to the base in the target base sequence.

[0071] Table 1 below shows the feature extraction results of one CG site within a read signal. The middle CG site of the k-mer is the methylation site to be trained / inferred. The window before and after the CG site contains r / 2 bases, forming a k-mer sequence of length r (i.e., the target base sequence). base_mean, base_std, and base_signal_len are the statistical features corresponding to the k-mer sequence (i.e., the statistical features of the read signal corresponding to the k-mer sequence). k_signals are the read signals corresponding to the k-mer sequence, and label is the methylation label, where 0 indicates unmethylated and 1 indicates methylated.

[0072] Table 1

[0073] S206 aligns and fuses the base features, the target base sequence, and the original electrical signals of each base in the sequence to be tested to obtain the fused sequence features.

[0074] In one embodiment, the terminal can input the base features, the target base sequence, and the original electrical signals of each base in the test base sequence into the detection model. The feature alignment network in the detection model aligns and fuses the original electrical signals of each base in the base features, the target base sequence, and the test base sequence to obtain fused sequence features.

[0075] The detection model can be a network model used to detect methylation of the base sequence to be tested, including a feature alignment network and a feature extraction network. The feature alignment network is used to align and fuse base features, target base sequence and original electrical signal. The feature extraction network is used to extract contextual features from the fused sequence features.

[0076] This feature alignment network can be an embedding-level feature alignment network, composed of multiple DNNs (Deep Neural Networks), where the calculation formula for the DNN is as follows: y = Gelu(f(x)) f(x) = W*x + b

[0077] Gelu is the activation function, W and b are the network training parameters. In addition, a dropout layer is introduced after each DNN to prevent overfitting problems that may occur during the training of multi-layer DNNs.

[0078] In one embodiment, the terminal can map the original electrical signals of each sub-feature in the base features, the target base sequence, and the base sequence to be tested to the target vector space and input them into the feature alignment network. Thus, the feature alignment network can map the original electrical signals of the base features, the target base sequence, and the base sequence to the same vector space, i.e., the target vector space, to obtain the fused sequence features.

[0079] In one embodiment, before inputting the feature alignment network, the terminal can first encode the target base sequence to obtain the corresponding encoding vector. Then, the encoding vector is concatenated with the base features and the original electrical signals of each base in the base sequence to be tested. For example, for the k-mer sequence "GACGACCGCCACC", the k-mer sequence is converted into an index in the encoding vocabulary, i.e., the token input "3123122311211" is encoded. Then, it is concatenated with the base features (i.e., the three statistical features base_mean, base_std, and base_signal_len) to form an r*3 (r=13) vector feature, where r is the length of the k-mer. Finally, the corresponding reads signals are padded to obtain an r*16 dimension vector feature, where 16 is the length of the reads signal corresponding to each base. If the length of the reads signal corresponding to each base is less than 16, it is padded with zeros, and if it is greater than 16, it is truncated.

[0080] S208, extract context features from the fused sequence features to obtain context features.

[0081] In one embodiment, the terminal can input the fused sequence features into the feature extraction network in the detection model, which extracts contextual features from the fused sequence features to obtain contextual features.

[0082] Among them, the feature extraction network can be a context feature extraction network based on BERT, which mainly consists of multiple self-attention mechanisms and feedforward networks. The self-attention mechanism replaces the recursive calculation of forward or backward sequence representations in RNNs, and obtains the context association of the sequence by constructing attention weights on the sequence input itself.

[0083] Contextual feature extraction can be a methylation modification characterization process, and the contextual feature can be a methylation modification characterization vector.

[0084] In one embodiment, the terminal can encode the fused sequence features and then input the resulting embedding vector into a feature extraction network. For example, for a k-mer token sequence in the fused sequence features, each token in the k-mer token sequence can be embedded and encoded, so that an r*1 token is encoded into an r*d embedding vector, where d is the dimension of the embedding vector.

[0085] In another embodiment, the terminal can perform positional encoding on the fusion sequence features to obtain relative positional encoding; wherein, the relative positional encoding is used to capture the relative positions between each base; the fusion sequence features are weighted based on the relative positional encoding to obtain weighted sequence features; and context features are extracted from the weighted sequence features to obtain context features.

[0086] Position embedding is used to characterize the association between the target base and other bases. Considering the signal differences of bases near the target base—for example, the signal influence of bases farther away is smaller, and vice versa—position embedding is introduced to improve the accuracy of methylation modification characterization, thereby effectively improving the accuracy of methylation detection.

[0087] For example, the relative position code can be calculated using the formula for relative position coding, which is as follows:

[0088] Here, t represents position t in the fusion sequence feature, 2i represents the position an even number of units away from position t, and 2i+1 represents the position an odd number of units away from position t. PE can capture the relative positions between bases, therefore signal fluctuations closer to the target base are more easily captured.

[0089] In one embodiment, the terminal can perform a linear transformation on the weighted sequence features to obtain linear transformation features; the linear transformation features include query features, key features, and value features; the key features and value features are weighted using relative position weights to obtain weighted key features and weighted value features; and context features are extracted based on the query features, weighted key features, and weighted value features.

[0090] The relative position weight can be the weight of the relative positions between each base. Specifically, this relative position weight can be the relative position weight between the target base and other bases. Therefore, signal fluctuations closer to the target base are more easily captured. The target base can be the base in the target base sequence that needs to be detected for methylation.

[0091] When performing methylation modification characterization, the terminal can first perform a linear transformation of the weighted sequence features to obtain query features, key features, and value features, such as query and key-value (Q,K,V).

[0092] In one embodiment, the terminal determines an attention score based on query features, weighted key features, and a scaling factor; the attention score is normalized to obtain attention weights; and the weighted features are weighted based on these attention weights to obtain context features. It should be noted that the feature extraction network employs a self-attention mechanism, which can perform a linear transformation on the weighted sequence features and use the resulting (Q, K, V) as input to calculate the attention weights. The specific formula is as follows: Q = W q *x K=W k *xv=W V *x

[0093] Among them, W q W k and W V These are all network parameters, and x can be a fused sequence feature.

[0094] Calculating Q, K, and V using the above formula may result in the loss of relative positional information. To avoid this problem, this application introduces relative positional weights into the attention mechanism, and the corresponding formula can be: clip(x,s) = max(-s,min(s,x))

[0095] in, x represents i with x j The relative position weights (relative position encoding) between them, clip(x,s) represents base x i with x j The distance x between s and s is the relative position of s within the context of s. This function is used to restrict the range of the relative position to [-s, s], that is, when calculating x... i with x j When considering the relative positional weights between them, only the encoded vectors within the range [-s, s] are considered, W K W V These represent the embedding tables used to encode the relative positional information of K and V in the model, respectively. By introducing relative positional weights again when calculating attention weights, even if the linear transformation loses the relative positional encoding, the vector output based on the self-attention mechanism still carries relative positional information, which helps to improve the accuracy of methylation detection.

[0096] Therefore, the above formula for calculating attention weights is optimized as follows:

[0097] Among them, K ′ It can be a weighted key feature obtained by weighting the key features using relative position weights, K ′T V is the transpose of the weighted key features, and V′ can be the weighted value features obtained by weighting the value features using relative position weights.

[0098] S210, methylation classification is performed based on contextual features to obtain the classification result.

[0099] In one embodiment, the terminal uses contextual features as input to a classification layer (softmax), which performs methylation classification and outputs the classification result of the target base methylation.

[0100] After obtaining the classification results, if the methylation detection method is applied to the scenario of model performance evaluation, the classification results can be mapped to the reads of the reference sequence to evaluate the accuracy, recall, and F1 score of all reference sequence target sites.

[0101] To better understand the scheme of this application, further explanation is provided here with reference to Figure 3, as follows:

[0102] 1) Data acquisition: Using nucleic acid data, the nucleic acid dataset is sequenced (basecalling) to obtain the base sequence to be tested.

[0103] 2) Feature extraction: The raw signals of basecalling sequencing reads are mapped onto nucleotide sequences. Then, a base sequence (k-mer) of length r (r=13) is constructed for each target site (CG site), and features are extracted. This yields the target base sequence, mean, variance, and the raw electrical signal (default length of electrical signal is 16) input into the network.

[0104] 3) Model Detection: First, an embedding-level feature alignment network is used to align and fuse the input data. The fused sequence features are then used as input to the BERT context feature extraction network and characterized by methylation modification. Finally, the methylation modification characterization is used as input to the softmax classification layer, which outputs the classification result of the target base methylation, which can be either methylated or unmethylated.

[0105] Understandably, current deep learning-based methods for DNA methylation detection, while considering the sequence features of nanopore electrical signals, cannot simultaneously account for the influence of signals near genomic locations when modeling nanopore electrical signals. For example, bidirectional RNNs can only model forward and reverse sequence events separately, and cannot effectively distinguish the contribution of signals from near and far locations to the target gene location. In reality, electrical signals closer to the target base are more easily affected, while those further away are less affected. Furthermore, current solutions primarily use a pairwise approach to model the target base and nanopore electrical signals, constructing multiple RNN networks to extract features from the input separately. This focuses only on higher-dimensional feature interactions (such as methylation modification characterization) while neglecting the interaction between the original base sequence and the electrical signal, resulting in low accuracy in methylation detection.

[0106] In this embodiment, a target base sequence of a preset length is constructed by targeting the target site of the base sequence to be tested. Base features corresponding to the target base sequence are extracted, and the sub-features in the base features, the target base sequence, and the original electrical signals of each base in the base sequence to be tested are aligned and fused. The low-dimensional feature space interaction between the target base sequence and the original electrical signals is fully considered. Then, context features are extracted from the fused sequence features, and the influence between each base in the fused sequence features is fully considered. Finally, methylation classification is performed based on the obtained context features, which can effectively improve the accuracy of DNA methylation detection.

[0107] In one embodiment, the terminal can also collect a DNA methylation sample set; construct a sample base sequence of a preset length based on the target sites in the methylation sample sequences of the DNA methylation sample set; extract features from the sample base sequence to obtain the base features of the sample base sequence; and train a detection model based on the sample base sequence, the base features of the sample base sequence, and the original electrical signal.

[0108] The DNA methylation sample set may include methylation sample sequences and corresponding raw electrical signals; in addition, the sample base sequence may be constructed based on the target sites in the methylation sample sequences of the DNA methylation sample set.

[0109] For example, the terminal can first collect a standard DNA dataset with 100% methylation or 100% hydroxylation at the CG site. All or part of this standard DNA dataset can be used as a DNA methylation sample set. A training set can be extracted from this DNA methylation sample set. Based on the target sites in the methylation sample sequences of the training set, a sample base sequence of a preset length can be constructed. Feature extraction can be performed on the sample base sequence to obtain its base features. Through the feature alignment network in the initial detection model, the sub-features in the base features, the sample base sequence, and the original electrical signals of each base in the sample base sequence are aligned and fused to obtain the training fused sequence features. Through the feature extraction network in the detection model, contextual features are extracted from the training fused sequence features to obtain training contextual features. Methylation classification is performed based on the training contextual features to obtain the classification result. Based on the loss between the classification result and the labels in the training set, the parameters of the feature alignment network and the feature extraction network are optimized.

[0110] The DNA methylation sample set can be a standard DNA dataset of 5-methylcytosine and 5-hydroxymethylcytosine that can be used for training, validation and testing. Each CG site in the methylation sample sequence of the DNA methylation sample set is 100% unmodified, 100% methylated or 100% hydroxymethylated.

[0111] The DNA methylation sample set is divided into a training set, a validation set, and a test set. The ratio of the training set, validation set, and test set can be 8:1:1.

[0112] This application involves the model training process, therefore a graphics processing unit (GPU) is required to accelerate the training process. The specific hardware and software development environments are shown in Table 1 below:

[0113] Table 1

[0114] For details regarding the training process, please refer to Figure 4, as follows:

[0115] (1) Data acquisition:

[0116] For the DNA standard datasets of 5-methylcytosine and 5-hydroxymethylcytosine, a single-channel nanopore sequencing device and a PC were used. The nanopore sequencing device was used to collect data and write it to the PC hard drive in real time.

[0117] (2) Feature extraction:

[0118] The feature extraction process requires mapping DNA reads onto the original electrical signals, obtaining the original signal input corresponding to k-mers, as well as the mean, variance, and length of the electrical signals corresponding to each base position. The PyTorch and NumPy libraries are mainly used to extract and save the features.

[0119] (3) Main algorithm:

[0120] 1) Embedding-level based feature alignment network:

[0121] Given a target base sequence, base features (mean, variance, and electrical signal length), and the original electrical signals corresponding to the bases, the target base sequence needs to be encoded. For example, for the k-mer sequence "GACGACCGCCACC", it is converted into an index in the encoding vocabulary, i.e., the token input is encoded as "3123122311211". Then, it is concatenated with the base features (i.e., the three statistical features base_mean, base_std, and base_signal_len) to form an r*3 (r=13) vector feature, where r is the length of the k-mer. Finally, the corresponding reads signals are padded to form an r*16 dimension vector feature, where 16 is the length of the reads signal corresponding to each base. If the length of the reads signal corresponding to each base is less than 16, it is padded with zeros; if it is greater than 16, it is truncated.

[0122] This feature alignment network maps multiple input features to the same vector space and extracts the feature relationships between their embedding representations. Specifically, the feature alignment network consists of multiple Deep Neural Networks (DNNs), where the DNN is calculated using the following formula: y = Gelu(f(x)) f(x) = W*x + b

[0123] Gelu is the activation function, W and b are the network training parameters, and a dropout layer is introduced after each DNN to prevent overfitting problems that may occur during the training of multi-layer DNNs.

[0124] 2) Based on BERT context feature extraction network:

[0125] BERT primarily consists of multiple self-attention mechanisms and a feedforward network. The self-attention mechanism replaces the recursive computation of forward or backward sequence representations in RNNs, acquiring the contextual association of the sequence by constructing an attention weighted sum of the sequence input itself. Furthermore, considering the differential base signals near the target base—that is, the farther away the base signal, the smaller its influence, and vice versa—relative position encoding is introduced into the BERT input, with the following formula:

[0126] Where t represents position t in the sequence, 2i represents the position an even number of units away from position t, and 2i+1 represents the position an odd number of units away from position t. PE can better capture the relative positions between bases, therefore, base signal fluctuations closer to the target base are more easily captured.

[0127] The target base sequence, base features, and original electrical signal features are processed by a feature alignment network and then weighted with relative position encoding as the input to BERT. It's worth noting that the original BERT uses a self-attention mechanism, which requires a linear transformation of the sequence input to obtain the query and key-value pairs (Q, K, V) as inputs for calculating the attention weights. The specific formula is as follows: Q = W q *x K=W k *xv=W V *x

[0128] It should be noted that since the linear transformation of x is a prerequisite for the attention mechanism, Q, K, and V after the linear transformation may lose relative position information. Therefore, relative position encoding is introduced into the attention mechanism, and the specific formula is as follows: clip(x,s) = max(-s,min(s,x))

[0129] in, x represents i with x j The relative position weights (relative position encoding) between them, clip(x,s) represents base x i with x j The distance x between s and s is the relative position of s within the context of s. This function is used to restrict the range of the relative position to [-s, s], that is, when calculating x... i with x j When considering the relative positional weights between them, only the encoded vectors within the range [-s, s] are considered, W K W V These represent the embedding tables used to encode the relative positional information of K and V in the model, respectively. By introducing relative positional weights again when calculating attention weights, even if the linear transformation loses the relative positional encoding, the vector output based on the self-attention mechanism still carries relative positional information, which helps to improve the accuracy of methylation detection.

[0130] Therefore, the above formula for calculating attention weights is optimized as follows:

[0131] Among them, K ′ It can be a weighted key feature obtained by weighting the key features using relative position weights, K′ TV is the transpose of the weighted key features, and V′ can be the weighted value features obtained by weighting the value features using relative position weights.

[0132] (4) Model training:

[0133] Because the algorithm's parameters are complex, and the model training parameters have a direct impact on the model results—and because there is no comparability between different network structure parameters and training parameters—it is crucial to provide specific model network structure parameters and training parameters.

[0134] It should be noted that the feature extraction process of this application can also construct more base features as input to the feature alignment network; in addition, the model structure of the feature alignment network and the feature extraction network of this application can be fine-tuned, such as the number of network layers and the parameters of the hidden layers.

[0135] The embedding-level feature alignment network proposed in this application can fully consider the interaction between the target base sequence and the original electrical signal in the low-dimensional feature space. Secondly, a BERT-based context feature extraction network is used to characterize the base sequence with context features, and relative position encoding is introduced to fully explore the correlation between bases near the target base, which can effectively improve the effect of DNA methylation detection.

[0136] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0137] Based on the same inventive concept, this application also provides a methylation detection device for implementing the methylation detection method described above. The solution provided by this device is similar to the solution described in the above method; therefore, the specific limitations in one or more methylation detection device embodiments provided below can be found in the limitations of the methylation detection method described above, and will not be repeated here.

[0138] In one embodiment, as shown in FIG5, a methylation detection device is provided, comprising: a construction module 502, an extraction module 504, a processing module 506, a characterization module 508, and a classification module 510, wherein:

[0139] Construction module 502 is used to construct a target base sequence of a preset length at the target site of the base sequence to be tested;

[0140] Extraction module 504 is used to extract base features corresponding to the target base sequence;

[0141] The processing module 506 is used to align and fuse the original electrical signals of each sub-feature in the base features, the target base sequence, and the bases in the base sequence to be tested, to obtain fused sequence features;

[0142] The representation module 508 is used to extract context features from the fused sequence features to obtain context features;

[0143] Classification module 510 is used to perform methylation classification based on contextual features to obtain classification results.

[0144] In one embodiment, the processing module 506 is used to map each sub-feature in the base features, the target base sequence, and the original electrical signal to the target vector space to obtain fused sequence features.

[0145] In one embodiment, the characterization module 508 is further configured to perform position encoding processing on the fusion sequence features to obtain relative position encoding; wherein, the relative position encoding is used to capture the relative positions between each base; the fusion sequence features are weighted based on the relative position encoding to obtain weighted sequence features; and context features are extracted from the weighted sequence features to obtain context features.

[0146] In one embodiment, the characterization module 508 is further configured to perform a linear transformation on the weighted sequence features to obtain linear transformation features; the linear transformation features include query features, key features, and value features; the key features and value features are weighted using relative position weights to obtain weighted key features and weighted value features; and context features are extracted based on the query features, weighted key features, and weighted value features.

[0147] In one embodiment, the characterization module 508 is further configured to determine an attention score based on query features, weighted key features, and scaling factors; normalize the attention score to obtain attention weights; and weight the weighted features based on the attention weights to obtain context features.

[0148] In one embodiment, the sequencing module 512 is also used to detect the original electrical signals of each base corresponding to the nucleic acid data in the nanopore; and to convert the original electrical signals into the sequence of the bases to be tested.

[0149] In one embodiment, the detection model includes a feature alignment network and a feature extraction network; the feature alignment network is used to align and fuse base features, target base sequences, and original electrical signals; the feature extraction network is used to extract contextual features from the fused sequence features.

[0150] In one embodiment, as shown in FIG6, the device further includes:

[0151] Training module 514 is used to collect DNA methylation sample sets; construct sample base sequences of a preset length based on target sites in the methylation sample sequences of the DNA methylation sample sets; extract features from the sample base sequences to obtain base features of the sample base sequences; and train the detection model based on the sample base sequences, base features of the sample base sequences, and the original electrical signals.

[0152] In the above embodiments, a target base sequence of a preset length is constructed for the target site of the base sequence to be tested. Base features corresponding to the target base sequence are extracted. The sub-features in the base features, the original electrical signals of the target base sequence and the bases in the base sequence to be tested are aligned and fused. The low-dimensional feature space interaction between the target base sequence and the original electrical signal is fully considered. Moreover, context features are extracted from the fused sequence features, taking into account the influence between the bases in the fused sequence features. Therefore, methylation classification based on the obtained context features can effectively improve the accuracy of methylation detection results.

[0153] Each module in the aforementioned methylation detection device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0154] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram is shown in Figure 7. The computer device includes a processor, memory, input / output interface, communication interface, display unit, and input device. The processor, memory, and input / output interface are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interface. The processor of the computer device provides computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The input / output interface of the computer device is used for exchanging information between the processor and external devices. The communication interface of the computer device is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements a methylation detection method. The display unit of the computer device is used to form a visually visible image. It can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0155] Those skilled in the art will understand that the structure shown in Figure 7 is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0156] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the methylation detection method described above.

[0157] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the methylation detection method described above.

[0158] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the methylation detection method described above.

[0159] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0160] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0161] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0162] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A methylation detection method, performed by a computer device, wherein, The method includes: Construct a target base sequence of a predetermined length at the target site of the base sequence to be tested; Extract the base features corresponding to the target base sequence; The original electrical signals of each base in the base features, the target base sequence, and the base sequence to be tested are aligned and fused to obtain fused sequence features; Context features are extracted from the fused sequence features to obtain context features; Based on the aforementioned contextual features, methylation classification is performed to obtain the classification results.

2. The method of claim 1, wherein, The step of aligning and fusing the original electrical signals of each base in the base features, the target base sequence, and the base sequence to be tested to obtain the fused sequence features includes: The sub-features in the base features, the target base sequence, and the original electrical signal are mapped to the target vector space to obtain the fused sequence features.

3. The method of claim 1, wherein, The context feature extraction of the fused sequence features to obtain the context features includes: The fusion sequence features are subjected to position encoding to obtain relative position encoding; wherein, the relative position encoding is used to capture the relative positions between each base; The fused sequence features are weighted based on the relative position encoding to obtain weighted sequence features; Context features are extracted from the weighted sequence features to obtain context features.

4. The method according to claim 3, wherein, The step of extracting context features from the weighted sequence features to obtain context features includes: A linear transformation is performed on the weighted sequence features to obtain linearly transformed features; the linearly transformed features include query features, key features, and value features. The key features and the value features are weighted using relative position weights to obtain weighted key features and weighted value features; The context features are extracted based on the query features, the weighted key features, and the weighted value features.

5. The method according to claim 4, wherein, The extraction of the context features based on the query features, the weighted key features, and the weighted value features includes: The attention score is determined based on the query features, the weighted key features, and the scaling factor; The attention scores are normalized to obtain the attention weights; The context features are obtained by weighting the weighted features based on the attention weights.

6. The method according to any one of claims 1 to 5, wherein, Before constructing a target base sequence of a predetermined length at the target site of the base sequence to be tested, the method further includes: The original electrical signals of each base corresponding to the nucleic acid data in the nanopore; The original electrical signal is converted into the base sequence to be tested.

7. The method according to any one of claims 1 to 5, applied to a detection model, wherein, The detection model includes a feature alignment network and a feature extraction network; The feature alignment network is used to align and fuse the base features, the target base sequence, and the original electrical signal; the feature extraction network is used to extract contextual features from the fused sequence features.

8. The method according to claim 7, wherein, The method further includes: Collect a set of DNA methylation samples; Based on the target sites in the methylated sample sequences of the DNA methylation sample set, a sample base sequence of the preset length is constructed; Feature extraction is performed on the sample base sequence to obtain the base features of the sample base sequence; The detection model is trained based on the sample base sequence, the base characteristics of the sample base sequence, and the original electrical signal.

9. A methylation detection device, wherein, The device includes: The construction module is used to construct a target base sequence of a preset length at the target site of the base sequence to be tested; The extraction module is used to extract the base features corresponding to the target base sequence; The processing module is used to align and fuse the original electrical signals of each base in the base features, the target base sequence, and the base sequence to be tested to obtain fused sequence features; The representation module is used to extract context features from the fused sequence features to obtain context features; The classification module is used to perform methylation classification based on the context features to obtain the classification result.

10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein... When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.

11. A computer-readable storage medium having a computer program stored thereon, wherein, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.

12. A computer program product comprising a computer program, wherein, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.