Method and apparatus for base recognition, electronic device, and storage medium

By using a deep learning model to process the fluorescence matrix for spatial crosstalk and contextual error correction, the problem of insufficient base recognition accuracy in next-generation sequencing technology is solved, achieving efficient and reliable base recognition results, which are suitable for high-throughput sequencing environments.

CN119694407BActive Publication Date: 2026-06-19GENEMIND BIOSCIENCES CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GENEMIND BIOSCIENCES CO LTD
Filing Date
2024-10-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing next-generation sequencing technologies face challenges in base recognition accuracy, especially on high-density array chips. Due to signal crosstalk between fluorescent clusters, resolution limitations of optical imaging systems, and lag and lead phenomena in the sequencing reaction, data accuracy and reliability are insufficient.

Method used

A deep learning model is used to process the fluorescence matrix. By acquiring the position and brightness information of the fluorescence clusters, spatial crosstalk and contextual effects are corrected. The signal of each fluorescence cluster is processed independently, and the feature matrix is ​​used as input to the model for base identification.

Benefits of technology

It improves the accuracy and efficiency of base identification, reduces sequencing errors, enhances data reliability and consistency, and is suitable for high-throughput sequencing environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119694407B_ABST
    Figure CN119694407B_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, computing device, and storage medium for base identification, belonging to the field of bioinformatics. The base identification method includes: acquiring a feature matrix of a sequencing-by-synthesis reaction, the feature matrix comprising a fluorescence matrix composed of the positional and brightness information of multiple fluorescent clusters; and inputting the feature matrix into a trained prediction model to output the base identification results of the multiple fluorescent clusters. This method can fully utilize the spatial positional information of the fluorescent clusters to correct the base signal, effectively improving the accuracy and reliability of the sequencing results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of bioinformatics, and in particular to methods, apparatus, electronic devices and storage media for base identification. Background Technology

[0002] In the field of bioinformatics, gene sequencing technology is a crucial scientific tool that allows researchers to accurately read the genetic information of organisms. With technological advancements, next-generation sequencing (NGS) has become mainstream, capable of generating large amounts of genetic data in a short time and at a relatively low cost. Examples of NGS techniques include sequencing by synthesis (SBS), which uses multiple rounds of detection of fluorescently labeled nucleotides to determine the base sequence on a DNA template.

[0003] However, despite significant advancements in efficiency and cost reduction, next-generation sequencing (NGS) technology still faces challenges in base identification accuracy. In practical sequencing processes, crosstalk from fluorescence signals, resolution limitations of optical imaging systems, and lag and lead phenomena in the sequencing reaction can all lead to base identification errors, affecting data accuracy and the reliability of subsequent analysis. These problems are particularly pronounced in high-density array-based chips, where the reduced distance between fluorescent clusters and spatial resolution limitations exacerbate signal crosstalk. In other words, achieving both high-throughput sequencing data and high-accuracy sequencing data is often a difficult balance to strike.

[0004] Therefore, current base identification methods still need further improvement. Summary of the Invention

[0005] This application aims to solve at least one of the aforementioned technical problems. To this end, this application proposes a base identification method that effectively improves the accuracy of base identification.

[0006] Specifically, this application provides the following technical solution:

[0007] In a first aspect of this application, a method for base identification is proposed. According to an embodiment of this application, the method includes: acquiring a feature matrix of a sequencing-by-synthesis reaction, the feature matrix including a fluorescence matrix composed of positional and brightness information of multiple fluorescent clusters; and inputting the feature matrix into a trained prediction model to output base identification results of the multiple fluorescent clusters.

[0008] According to the embodiments of this application, the method can make full use of the spatial location information of fluorescent clusters to correct base signals, effectively improving the accuracy and reliability of sequencing results.

[0009] In a second aspect, this application proposes an apparatus for base identification. According to an embodiment of this application, the apparatus includes: a predictive feature acquisition unit for acquiring a feature matrix of a sequencing-by-synthesis reaction, the feature matrix including a fluorescence matrix composed of positional and brightness information of multiple fluorescent clusters; and a base identification unit for inputting the feature matrix into a trained predictive model and outputting base identification results for the multiple fluorescent clusters.

[0010] According to embodiments of this application, the device can effectively implement the base identification method of the first aspect, effectively improve base identification efficiency, and improve the accuracy and reliability of sequencing results.

[0011] In a third aspect of this application, a computer program product is provided. According to an embodiment of this application, the computer program product includes: when some or all of the computer instructions are executed on a computer, causing the method for base recognition as described in the first aspect to be performed.

[0012] In a fourth aspect, this application proposes a computing device. According to an embodiment of this application, the device includes: a processor and a memory; the memory for storing a computer program; and the processor for executing the computer program to implement the method for base recognition as described in the first aspect.

[0013] In a fifth aspect, this application provides a computer-readable storage medium. According to an embodiment of this application, the storage medium includes computer instructions that, when executed by a computer, cause the computer to implement the method for base identification as described in the first aspect.

[0014] In some examples of this application, the aforementioned computer program products, computing devices, and computer-readable storage media achieve highly efficient automation through the automatic execution of base recognition methods using computer instructions, thereby improving recognition efficiency and accuracy. Furthermore, the instruction-based nature of these methods ensures high consistency and reliability across various environments.

[0015] It should be noted that the features and advantages described in this article are mutually applicable.

[0016] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] in:

[0019] Figure 1 A schematic flowchart of a base identification method according to an embodiment of this application is shown;

[0020] Figure 2 The diagram illustrates a flowchart of a predictive model being invoked to process input data and obtain base identification results according to an embodiment of this application.

[0021] Figure 3 This illustration shows a schematic diagram of converting an entire image into a set of blocks and writing the set of blocks as a matrix as input data, according to an embodiment of this application.

[0022] Figure 4 A flowchart illustrating the training of a prediction model according to an embodiment of this application is shown;

[0023] Figure 5 A schematic flowchart illustrating the process of obtaining base identification results based on U-Net processing of input data according to an embodiment of this application is shown;

[0024] Figure 6 A schematic diagram of a U-Net structure with an added CBAM module according to an embodiment of this application is shown;

[0025] Figure 7 A schematic diagram of the structure of an input convolutional layer according to an embodiment of this application is shown;

[0026] Figure 8 A schematic diagram of the structure of a CBAM layer according to an embodiment of this application is shown;

[0027] Figure 9 A schematic diagram of the structure of a downsampling layer according to an embodiment of this application is shown;

[0028] Figure 10 A schematic diagram of the structure of an upsampling layer according to one embodiment is shown.

[0029] Figure 11 A schematic diagram of a base recognition device according to another embodiment of this application is shown;

[0030] Figure 12A schematic diagram of the structure of an electronic device according to another embodiment of this application is shown. Detailed Implementation

[0031] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0032] In this application, the terms "first," "second," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number or order of the indicated technical features. In the description of this application, "a plurality of" means two or more, unless otherwise expressly defined.

[0033] The term "sequencing" in this document refers to gene sequencing, which is the determination of the nucleotide sequence of nucleic acid molecules, including DNA sequencing and / or RNA sequencing. It includes long-fragment sequencing and / or short-fragment sequencing. Sequencing is the identification and determination of the types of bases at multiple consecutive or non-consecutive positions on a nucleic acid molecule sequence. In some examples of this application, sequencing is self-sequencing by synthesis (SBS), which includes the process of nucleotides or nucleotide analogs binding to a template, also known as base extension reaction. Furthermore, the term "sequencing by synthesis" as used in this application refers to the process of identifying the type of at least one base on a template through a single round of sequencing, and also includes sequencing similar to SBS, such as sequencing by hybridization by hybridization (SBH) and sequencing by ligation by ligation (SBL).

[0034] According to embodiments of this application, a single sequencing run typically involves multiple repeats. For example, a repeat using an SBS with a reversible terminator includes an extension reaction, signal acquisition, and excision; this process can also be referred to as a cycle. Sequencing generally includes multiple cycles, each cycle involving the identification of one or more bases linked to or incorporated into a series of templates. A sequencing cycle, also known as a "sequencing round," can be defined as the process of completing one extension of four nucleotides / bases. In other words, a sequencing cycle can refer to the process of identifying and detecting a single base position on any template.

[0035] For example, a sequencing run includes the following steps: under conditions suitable for polymerization, four nucleotides (including any analogues) are sequentially or simultaneously contacted with a template to induce a base extension reaction. The signal from the base extension reaction is captured or acquired. According to base pairing principles, such as under the catalysis of DNA polymerase, the substrate (nucleotide or analogue) is ligated to sequencing primers and / or the template, pairing with a base at a specific position on the template. A sequencing run may include one or more base extensions (repeat). For example, if four nucleotides are added sequentially to the reaction system, with base extension and signal acquisition occurring after each addition, a sequencing run may contain four base extensions. If nucleotides are added in specific combinations (e.g., pairwise or triadic combinations), with base extension and signal acquisition occurring separately for each combination, a sequencing run may contain two base extensions. If the four nucleotides are added to the reaction system simultaneously, with base extension and signal acquisition completed in one step, a sequencing run contains one base extension.

[0036] In this application, the term "fluorescent cluster" refers to the nucleic acid molecule to be tested or the template; in some embodiments, it specifically refers to a collection of nucleic acid molecule clones that have generated a specified sequencing reaction signal. A so-called nucleic acid molecule clone to be tested contains multiple nucleic acid molecules with the same nucleotide sequence, and is also called an amplicon, cluster, group, DNA nanosphere, etc. These fluorescent clusters emit detectable signals in the reaction through fluorescently labeled bases or other markers, thereby enabling the identification of nucleotide sequences on the DNA template through signal detection. Specifically, the fluorescent clusters are, for example, imaged and detected during sequencing, and the base arrangement order of the DNA template at the corresponding position is determined based on the intensity and location information of the detected signal.

[0037] In this application, the term "fluorescence matrix" refers to matrix-like data composed of the position and brightness information of multiple fluorescent clusters. Each row or column represents a fluorescent cluster, and its position on the chip surface is indicated by position coordinates in the matrix, while the brightness information represents the signal intensity of that fluorescent cluster. The fluorescence matrix is ​​constructed based on the original image obtained from chip imaging or the original image after processing, such as denoising and background removal.

[0038] The location and intensity / brightness (fluorescence brightness) of fluorescent clusters are determined based on information from the original image, including the processed image. Fluorescence brightness can be a true or objective absolute value, or a relative value. This includes various transformations of pixel values, sub-pixel values, or sub-pixel values ​​based on the location, such as magnification, reduction, or values ​​based on a certain scale or relationship, such as normalization. Generally, when it involves the intensity / pixel size of one or more images or bright spots (fluorescent clusters) or locations, these images or bright spots or locations have intensity / pixel sizes that have undergone the same processing, such as all being objective pixel values ​​or all being pixel values ​​after the same transformation processing. When it involves extracting or analyzing information from one or more images of the same region, such as the same field of view, it is preferable to first align these images to be located in the same coordinate system. The term "pixel" or "pixel value" includes, but is not limited to, pixels, sub-pixels, and sub-pixels. In some examples of this application, the coordinates of the fluorescent clusters are accurate to the sub-pixel level. For example, the brightness value at the sub-pixel level can be determined by interpolation. This provides more accurate or stable and reliable training data and / or input data, which is beneficial for building a model with accurate and reliable prediction results or for obtaining accurate prediction results based on the built model.

[0039] The aforementioned "bright spots" or "peaks" refer to locations in an image where the signal is greater than the background, with a typical bright spot occupying at least one pixel. At least a portion of the bright spots are fluorescent clusters reflecting a specified biochemical reaction. In some embodiments, fluorescent clusters and bright spots can be used interchangeably, as will be understood by those skilled in the art from the context.

[0040] In this application, the "prediction model" is a computational model or algorithm that automatically predicts classification results by processing and analyzing input data. In some examples of this application, the prediction model is a single model, such as a neural network capable of semantic segmentation (sometimes simply referred to as a semantic segmentation model); in other examples, the prediction model is a cascaded model, consisting of multiple cascaded machine learning sub-models. Each machine learning sub-model can employ various algorithms and techniques, such as deep learning / neural networks, support vector machines, decision trees, random forests, etc., and can be trained and optimized through supervised learning, unsupervised learning, or reinforcement learning.

[0041] The chip referred to in the embodiments of this application, also known as a surface, includes random chips and array chips. The array chip, also known as a patterned chip or surface, has a large number of tiny reaction units (such as holes or micropores) arranged regularly on it, allowing nucleic acid molecules or clusters to be tested to be attached within the pores, thereby achieving high-throughput detection and analysis of nucleic acid molecules. Specifically, for example, DNA molecules to be tested are attached to the chip surface, causing the nucleic acid molecules on the surface to react and detecting the reaction signals from the surface, for example, by imaging the surface, and based on the processing and analysis of these images, at least a portion of the sequence of the DNA molecule to be tested can be determined.

[0042] Furthermore, for patterned surfaces, the positions and relative distances of fluorescent clusters / pores are predetermined, or the relative positional relationship between the reaction region and the labeled region (tracer) is known. Generally, the position of the labeled region can be determined by identifying the signal from the labeled region, and then the position of each pore (fluorescent cluster) in the reaction region can be determined.

[0043] Understandably, in pursuit of higher throughput, it is generally desirable to minimize, or even eliminate, the gaps between wells or to partially overlap them (multiple fluorescent clusters connected within a single well). However, limitations imposed by detection methods, such as the diffraction limit inherent in optical imaging, mean that smaller spacing between different fluorescent clusters makes the detection and identification of a specific cluster signal more susceptible to interference from signals from other clusters or its own context. Therefore, this application provides a base identification scheme based on a prediction model, which can eliminate or reduce the impact of various crosstalk interferences on the accuracy of base identification results.

[0044] This application is based on the inventor's discovery, testing, and verification of the following problems:

[0045] In high-throughput sequencing, sequencing results are affected by adjacent bases from the same or neighboring fluorescent clusters. Factors such as phase imbalance within the same fluorescent cluster (asynchronous reactions of multiple molecules with the same sequence within a single fluorescent cluster), color difference, energy transfer of signals from the same or different fluorescent clusters, crosstalk, and biochemical reaction efficiency often affect the accurate identification or detection of the signal from a given fluorescent cluster in a given round of reaction. Therefore, corrections are often performed for this type of signal interference, such as phasing / prephasing correction and crosstalk correction. In traditional algorithms for base identification in array-based chips, the location of the holes (fluorescent clusters) on the image is first determined, then the brightness information at each hole location is extracted, and finally, the base identification result for each hole location is determined based on these brightness characteristics. However, traditional base identification methods also consider the influence of signals from preceding and following reactions on the detection of the current reaction signal. For example, they perform phasing correction and prephasing correction. However, these methods typically use a uniform correction parameter, meaning a single parameter is used to uniformly correct the brightness information of all fluorescent clusters in that reaction. This makes it difficult to accurately correct the signals of individual fluorescent clusters. Furthermore, traditional base identification algorithms usually process the signal of each fluorescent cluster independently, determining the base type by extracting fluorescence signal features. Therefore, traditional methods do not fully consider the influence of the spatial location information of each fluorescent cluster and the context of the cluster on the occurrence and identification of the current reaction signal for a given fluorescent cluster.

[0046] To address the aforementioned shortcomings, the inventors of this application propose a method for base identification using the brightness information of the preceding and following rounds on a single fluorescent cluster basis. Specifically, the inventors propose recording the fluorescence brightness of each fluorescent cluster in an array chip in a matrix form according to its spatial arrangement, processing this matrix using a deep learning model, and considering the contextual information of the brightness matrices of adjacent fluorescent clusters. Using the deep learning model, spatial fluorescence crosstalk between adjacent fluorescent clusters is corrected, while the contextual influence of each fluorescent cluster is processed independently, including phasing correction, prephasing correction, the influence of contextual bases (or bases on the template) on the current base fluorescence brightness, and the elimination of pattern error bias caused by contextual base arrangement characteristics. This completes the invention, and the inventors have found that this method has a significant advantage in processing efficiency. Because only the brightness information at the center of the fluorescent cluster is input into the model in matrix form, compared to directly inputting the original image into traditional algorithms, the data size is reduced, improving the algorithm's running efficiency and computation speed.

[0047] Specifically, in the first aspect, this application proposes a method for base recognition, referring to... Figures 1-9 The method includes:

[0048] S100: Obtain the feature matrix of the sequencing-by-synthesis reaction. The feature matrix includes a fluorescence matrix composed of the position and brightness information of multiple fluorescent clusters.

[0049] Sequencing by Synthesis (SBS) is a high-throughput sequencing technology commonly used in next-generation sequencing platforms. In SBS, DNA fragments are immobilized on a microarray, and complementary strands are synthesized by progressively adding nucleotides with different fluorescent labels. The type of nucleotide added each time (A, T, C, or G) is detected and recorded by the color of its fluorescent tag. As the sequencing process progresses, the base sequence of the target DNA template can be progressively constructed by continuously adding nucleotides and recording the fluorescence signal.

[0050] Sequencing-by-synthesis based on surface fluorescence imaging involves multiple sequencing reactions. Each sequencing reaction includes acquiring an image of a designated region (fluorescent cluster) on the surface after the extension reaction, and then analyzing and processing images of the target region or multiple rounds before and after the target region to determine the extension reaction information of the target region. In this step, a feature matrix of the sequencing-by-synthesis reaction is obtained. This feature matrix is ​​a simplified representation of the extension reaction information of the fluorescent clusters on the designated region image or surface, reflecting the target signal of the fluorescent clusters in one or more images of a certain surface region in a certain round. Specifically, the sequencing data is provided by the fluorescence information of each fluorescent cluster. A fluorescent cluster is a group of fluorescence signals generated by a single DNA template fixed on the chip and amplified. For array-type surfaces, the fluorescent clusters (or holes connected to fluorescent clusters) are regularly arranged, and the spatial relationship between the reaction area where the fluorescent clusters are located and the labeled area (tracer area) is known. Thus, by identifying the location of the labeled area, the location of each fluorescent cluster can be quickly determined. In addition, according to the embodiments of this application, the brightness information refers to the intensity of the signal from the fluorescent cluster. According to the embodiments of this application, taking four-color fluorescence sequencing as an example, each field of view (FOV) will obtain four images in each round of sequencing, namely fluorescence images taken under the four fluorescence channels A, G, C, and T. By further analyzing these fluorescence images, the signal intensity of each fluorescence cluster can be obtained, and combined with the location information of each fluorescence cluster, a series of fluorescence matrices can be obtained for subsequent analysis.

[0051] According to embodiments of this application, in this step, the obtained feature matrix includes a fluorescence matrix composed of the position and brightness information of multiple fluorescent clusters. The fluorescence matrix is ​​a data structure containing the position and brightness information of all fluorescent clusters. Each element in the matrix corresponds to a specific fluorescent cluster. The position of the element in the matrix reflects the position of the cluster on the image or surface, and the value of the element reflects the brightness information of the cluster. By processing the fluorescence matrix input data through a prediction model, the type of bases connected to or incorporated into each fluorescent cluster in a specified field of view can be quickly and accurately predicted. According to embodiments of this application, the sequencing-by-synthesis reaction is performed on an array-type chip, and the position information is determined based on the center position of the aperture where the fluorescent cluster is located. According to embodiments of this application, the brightness information is determined based on the original fluorescence brightness of the fluorescent cluster.

[0052] According to embodiments of this application, the brightness information is obtained by performing at least one of background removal and correction operations on the original fluorescence brightness. According to embodiments of this application, the correction operation includes at least one of phasing correction, pre-phasing correction, and crosstalk correction. According to embodiments of this application, the correction operation is not particularly specific and can be performed using a method that determines correction coefficients. Specifically, according to embodiments of this application, in a platform for determining nucleic acid sequences based on optical imaging, after acquiring an image, for base identification, the position of each hole on the fluorescence image is first determined according to the known chip design layout. Then, the brightness information at the hole position, i.e., the fluorescence signal corresponding to the fluorescent cluster in that hole, is extracted. Next, based on the brightness characteristics corresponding to each hole position, the base identification result at each hole is determined. Next, by examining the brightness characteristics of the front and rear cycles, phasing (i.e., phase loss due to incomplete reaction) and prephasing (i.e., phase loss due to failure to block or reaction of more than one base) corrections are performed. For example, the brightness information of the front and rear cycles can be statistically analyzed to represent the brightness distribution characteristics corresponding to all fluorescent clusters, and uniform correction parameters can be calculated. Furthermore, the brightness information obtained by all fluorescent clusters in this sequencing round will be corrected using the same set of correction parameters for phasing and prephasing. According to the embodiments of this application, crosstalk correction can also be performed in a similar manner. Specifically, the data of each fluorescent cluster can be corrected by calculating the crosstalk correction coefficient between the signals of each channel.

[0053] According to embodiments of this application, the acquired images are processed, including simplifying and extracting information about the target signals on the images and writing it into a matrix form that is easily recognized and processed by machines. This significantly reduces the size of the input data, lowers the requirements for storage and / or computing power, and facilitates rapid processing of the input data to obtain base identification results. Through deep learning or other machine learning techniques, useful patterns and associations can be extracted from these feature matrices, thereby improving the accuracy and reliability of sequencing. This method is particularly suitable for eliminating or reducing sequencing errors in base identification results introduced by factors such as spatial and / or contextual crosstalk of fluorescence signals, including response lag and lead. This method allows for more effective correction of these errors, resulting in high-quality sequencing data.

[0054] According to embodiments of this application, the prediction model is a deep learning model, which is used to perform at least one of spatial crosstalk correction and context effect correction. According to embodiments of this application, the deep learning model is used to output a base identification result matrix for the Nth extension reaction based on the fluorescence brightness matrix of the Nmth to N+nth extension reactions, where m and n are each independently integers, preferably m is not less than n, and more preferably m is greater than n. According to embodiments of this application, with this design, the prediction model can capture long-term dependencies in the sequencing process. By considering information from multiple rounds, the model can better understand and correct sequencing errors caused by signal crosstalk and context effects. Furthermore, the base identification in the current round may be affected by base additions in previous rounds. In this way, the model can more accurately predict the base type of each fluorescent cluster, thereby improving the accuracy and reliability of sequencing.

[0055] According to an embodiment of this application, the fluorescence matrix is ​​obtained by taking a picture of the array chip during the sequencing-by-synthesis process and processing the resulting picture (image).

[0056] According to an embodiment of this application, the process includes: dividing the photo into multiple sub-regions; and identifying the multiple sub-regions respectively to obtain multiple feature matrices.

[0057] According to embodiments of this application, dividing a photograph into multiple sub-regions can improve the locality of data processing, allowing related data (spatially close data points) to be stored and processed together, thereby reducing data access time and memory bandwidth requirements. Since sub-regions can be independently identified, this method allows data to be processed in parallel on multiple processors or computing cores, significantly improving data processing speed and efficiency. Through sub-region division, the system can flexibly adapt to different hardware configurations and computing resources. For example, with more computing resources, the number of sub-regions can be increased to improve parallelism. If there is a problem with the data in a certain sub-region, this method allows only that specific region to be reprocessed, rather than the entire image, thereby improving the system's fault tolerance. Independent identification of sub-regions can improve the accuracy of feature extraction because recognition parameters can be adjusted for the specific features of each region, rather than using a uniform processing method for the entire image. Furthermore, after obtaining multiple feature matrices, information from different sub-regions can be combined using feature fusion techniques to obtain more comprehensive data analysis results. Therefore, this method can adapt to fluorescent clusters of different densities and distributions because it allows the system to adjust the size and shape of the sub-regions according to the actual fluorescent cluster distribution. Furthermore, by processing only specific portions of the image, computational resource consumption can be reduced, especially when processing large-scale image data, where this method can significantly reduce the required computing power and storage demands. According to embodiments of this application, sub-region processing can reduce the impact of uneven illumination or localized damage that may exist on the array chip on the entire dataset, improving the overall robustness of the data. The resulting multiple feature matrices can be used for subsequent analysis and processing, such as base identification, variant detection, or other bioinformatics analyses, providing researchers with greater flexibility and control.

[0058] According to embodiments of this application, the plurality of feature matrices contain at least one piece of information determined based on the location of a sub-region. This facilitates the analysis and localization of the sub-regions. According to embodiments of this application, at least a portion of the plurality of sub-regions overlap. According to embodiments of this application, creating overlapping regions between sub-regions increases data redundancy. This redundancy helps improve robustness during data processing because even if the data in one region is affected by noise or corruption, the data in the overlapping regions can provide additional information to compensate for these losses. Overlapping regions ensure continuity of feature matrices between different sub-regions at boundaries. This is particularly important for analyses requiring continuous data input, such as in image stitching or feature tracking. When merging data from multiple sub-regions into a complete dataset, overlapping regions can act as buffers, reducing stitching errors. By comparing and aligning the data in overlapping regions, the data from individual sub-regions can be merged more accurately. During data analysis, overlapping regions can be used for error correction. For example, if the data quality in one region is poor, data from adjacent overlapping regions can be used for correction or replacement.

[0059] In a specific example, Figure 3 The image is transformed into a set of sub-region matrices (block matrices). Block matrices of different sizes are all zero-padded to the same size before being used as input. Therefore, the entire image is converted into a total of 3*3, or 9 blocks. Since the dimensions of the blocks may differ, after splitting into 9 blocks, the edges of all blocks smaller than b*y are padded with 0 values ​​to bring them to b*y size. Finally, an additional positional encoding layer is added according to the block's position number in the entire image. For example, the bottom right block is numbered 9, so it is padded with a Pos(P) positional encoding layer with all values ​​being 9, resulting in the final input matrix. This further improves the processing speed.

[0060] S200: After obtaining the feature matrix, the aforementioned feature matrix can be input into the trained prediction model to output the base recognition results of multiple fluorescent clusters.

[0061] refer to Figures 2-5According to embodiments of this application, by employing a feature matrix, the position and intensity information of multiple fluorescent clusters are integrated, enabling the model to simultaneously consider these factors for base identification, thus improving the accuracy of identification. According to embodiments of this application, compared to processing the entire original image, the feature matrix undergoes dimensionality reduction, reducing the amount of data and thereby lowering computational complexity and resource consumption. According to embodiments of this application, by employing a feature matrix, the model can learn and predict faster, especially in large-scale data processing, significantly improving computational efficiency. According to embodiments of this application, as mentioned above, the feature matrix is ​​pre-corrected with preliminary phasing, pre-phasing, and crosstalk corrections, which help eliminate common errors in the sequencing process. Furthermore, according to embodiments of this application, employing a feature matrix format explicitly extracts and represents the position and intensity information of each fluorescent cluster, facilitating the analysis and interpretation of the model's prediction results. Simultaneously, the feature matrix effectively reflects the information of the target (amplifier) ​​on the image, and its data format is easily recognized and processed by computers, allowing for convenient integration with other models or algorithms, facilitating future technology upgrades and functional expansion. By taking into account multiple relevant factors, the feature matrix can help the model predict base sequences more accurately, thereby improving the quality of the final sequencing results.

[0062] According to embodiments of this application, the prediction model is a single model, specifically a network capable of semantic segmentation (referred to as a semantic segmentation model). In some examples, the semantic segmentation model is selected from at least one of U-Net, DeepLab, SegFormer, PSPNet, FCN, GCN, and their respective variants. Using this prediction model to process input data enables rapid and accurate prediction results, making it particularly suitable for work environments with robust computing capabilities.

[0063] According to embodiments of this application, the prediction model is a machine learning model, which includes at least one of the following: a first sub-model for spatial crosstalk correction; and a second sub-model for contextual effect correction. Optionally, according to embodiments of this application, the first sub-model is used to output a base classification prediction probability matrix for a single-round extension reaction based on the fluorescence matrix of that single-round extension reaction; the second sub-model is used to output a base identification result matrix for the Nth round extension reaction based on the base classification prediction probability matrix for the Nmth to N+nth round extension reactions, wherein the base classification prediction probability matrix for the Nmth to N+nth round extension reactions is generated by the first sub-model. Wherein, N and m are both positive integers, and N is not less than m, m and n are different integers, preferably m is not less than n, and more preferably m is greater than n.

[0064] According to embodiments of this application, the prediction model consists of two sub-models, each responsible for a different correction task, further improving the accuracy and efficiency of base identification. The first sub-model focuses on correcting spatial crosstalk effects, which typically occur between high-density fluorescent clusters where signals from adjacent clusters may interfere with each other. By specifically correcting for spatial crosstalk, the accuracy of base identification for neighboring fluorescent clusters can be significantly improved. The second sub-model handles contextual effects, meaning that the identification of the current base may be influenced by bases preceding and following it in the sequence. This correction helps reduce identification errors due to sequence dependence. Furthermore, the first sub-model can independently generate a base classification prediction probability matrix based on a single-round fluorescence matrix. This allows the model to focus on specific features of each round, improving the accuracy of single-round base identification. The second sub-model, based on the multi-round base classification prediction probability matrix obtained from the first sub-model, considers information from multiple rounds, which can reveal long-term dependencies in base identification, improving overall identification accuracy. The term "long-term dependency" as used herein refers to the correlation or influence between elements or events that are far apart in the sequence. This dependency suggests that the current state or value in a sequence may be influenced by states observed earlier in the sequence, and these influences may span multiple sequence positions.

[0065] According to embodiments of this application, by analyzing extended responses from Nm to N+n rounds, the model can learn the short-term and long-term dependencies of the sequence, which helps the model better understand and predict complex sequence patterns.

[0066] Furthermore, according to embodiments of this application, the design of the two sub-models allows for independent optimization and updates, making the models more flexible and easier to maintain. The modularity also facilitates adjustments and improvements for specific problems.

[0067] According to embodiments of this application, through a two-stage correction and prediction process, the model can more accurately predict the base type of each fluorescent cluster. This two-stage method can progressively refine the prediction results and improve the reliability of the final prediction.

[0068] Those skilled in the art will understand that merging spatial and contextual information into a single, complete input feature data for training results in excessively large input feature dimensions, requiring a larger amount of training data and increasing model complexity, thereby increasing the computational and time costs required for training. In this embodiment, however, by training and modeling spatial and contextual information separately, each sub-model has a smaller feature dimension, thus reducing the required amount of training data and complexity for each model. Independent training and operation of sub-models not only optimizes the use of computational resources and reduces overall computational costs but also improves data processing speed through parallel processing, fully leveraging the advantages of modern computing architectures. This method, which comprehensively utilizes spatial crosstalk correction and contextual influence correction, significantly improves the accuracy and efficiency of base identification, providing a powerful analytical tool for high-throughput sequencing.

[0069] Furthermore, according to embodiments of this application, a sub-model for contextual impact correction can be first used to predict the base classification prediction probability matrix for a given round of extension reactions, and a sub-model for spatial crosstalk correction can be used to output the base identification result for a given round of extension reactions. Specifically, according to embodiments of this application, the prediction model is a machine learning model, which may include at least one of the following: a third sub-model for contextual impact correction; and a fourth sub-model for spatial crosstalk correction. According to embodiments of this application, the third sub-model is used to output the base classification prediction probability matrix for the Nth round of extension reactions based on the fluorescence matrix of each fluorescence channel from the Nmth to the N+nth rounds of extension reactions; the fourth sub-model is used to output the base identification result matrix for the Nth round of extension reactions based on the base classification prediction probability matrix of the Nth round of extension reactions.

[0070] According to embodiments of this application, the aforementioned sub-models (including the first, second, third, and fourth sub-models) are trained independently using non-overlapping training samples. According to embodiments of this application, each sub-model uses an independent training dataset, reducing the risk of decreased generalization ability due to overfitting on the training data. Independent training helps ensure that each sub-model learns general features of the data, rather than simply memorizing specific noise or outliers from the training samples. According to embodiments of this application, since each sub-model is optimized for a different dataset, the entire system has better adaptability and robustness to sequencing data from different sources or types. Each sub-model can focus on learning to correct specific types of sequencing errors (such as spatial crosstalk or contextual effects), thereby improving the professionalism and effectiveness of the correction. According to embodiments of this application, because the sub-models are trained independently, model training can be performed in parallel, which helps to accelerate the entire training process, especially when there are ample computing resources. Independently trained sub-models are easier to maintain and update. If improvements to a specific type of correction are needed, only the corresponding sub-model can be retrained. According to embodiments of this application, if the final prediction result is incorrect, it is easier to trace which sub-models may have malfunctioned, allowing for targeted adjustments. According to embodiments of this application, using different training samples ensures that each sub-model learns different aspects of the data, increasing the diversity of features learned by the model. In summary, through this independent training method, each sub-model can achieve optimal performance in its specialized domain, while the entire system can integrate the advantages of each sub-model to provide more accurate and reliable base identification results.

[0071] According to embodiments of this application, the sub-models (including the first, second, third, and fourth sub-models) each independently employ a U-Net-based neural network structure.

[0072] According to embodiments of this application, the sub-models independently employ at least one of a semantic segmentation network structure, an encoder-decoder network structure, or a Transformer network structure. According to embodiments of this application, the encoder-decoder network structure includes at least one of a U-Net network or a variant thereof, or DeepLap. According to embodiments of this application, the Transformer network structure includes Segformer. According to embodiments of this application, at least one layer of the U-Net-based neural network structure is provided with an attention mechanism; optionally, the attention mechanism includes at least one of a channel attention unit and a spatial attention unit. According to embodiments of this application, the channel attention unit and the spatial attention unit independently employ at least one of an SE attention mechanism and a CBAM attention mechanism. According to embodiments of this application, both the channel attention unit and the spatial attention unit independently employ the CBAM attention mechanism. According to embodiments of this application, the spatial attention unit uses the output of the channel attention unit as its input feature. According to embodiments of this application, the attention mechanism only includes a channel attention unit employing the SE attention mechanism.

[0073] According to a more specific embodiment, at least one layer of the U-Net-based neural network structure includes an attention mechanism, which comprises at least one of a channel attention unit and a spatial attention unit. Thus, according to embodiments of this application, when setting an attention mechanism in the U-Net-based neural network structure, the crosstalk between channels and the spatial influence between fluorescent clusters are indeed taken into account. Specifically, the design of the channel attention unit allows the model to evaluate the importance of different channels, which can help reduce or correct the crosstalk between channels. Through global pooling layers and MLP subunits, the model can identify which channels contain more relevant information, thereby reducing noise and interference from irrelevant features during feature extraction. The spatial attention unit enables the model to focus on key spatial regions, which helps identify and correct spatial crosstalk caused by the close proximity of fluorescent clusters. By merging the two-dimensional matrix of each channel and applying a compression layer, the model can enhance the identification of the spatial distribution characteristics of fluorescent clusters, thereby improving the spatial analysis accuracy of sequencing data. According to embodiments of this application, by employing an attention mechanism, the model is designed to account for the complexity of channels and spatial dimensions in sequencing data, as well as their potential impact on base recognition accuracy. By independently focusing on channels and spatial dimensions, the model can more flexibly adapt to different data characteristics and improve robustness to various sequencing errors. Furthermore, the attention mechanism enables the model to more accurately locate and emphasize important fluorescence signals during feature extraction, while suppressing or reducing errors caused by crosstalk.

[0074] According to an embodiment of this application, the channel attention unit includes: a first pooling subunit, which uses a first global max pooling layer and a first global average pooling layer to merge channels in the input data of the channel attention unit, thereby reducing the number of channels in the input data; a neural network MLP subunit, which uses shared parameters to process the output data of the first global max pooling sublayer and the output data of the first global average pooling sublayer to obtain max pooling transformed data and average pooling transformed data, respectively; a convolution processing subunit, which uses multiple convolution kernels to restore channels in the max pooling transformed data and average pooling transformed data, respectively; and an output layer, which merges and activates the output data of the convolution processing subunit and multiplies it with the input data of the channel attention unit to obtain the output data of the channel attention unit. According to an embodiment of this application, the first pooling subunit reduces the number of channels through global max and average pooling, which helps the model identify the most important channel features. The MLP subunit further processes these features, enabling the model to encode the importance of each channel and selectively emphasize key information. By multiplying the results of the convolutional processing subunits with the original input at the output layer, the channel attention unit can dynamically adjust the contribution of each channel to the final feature representation. This gating mechanism allows the model to flexibly adjust under different conditions, enhancing useful features and suppressing noise. Furthermore, the channel restoration step helps recover the channel dimensions reduced by pooling operations, while enhancing the expressiveness of features and ensuring that the model does not lose important information.

[0075] According to an embodiment of the application, the channel attention unit includes: the channel attention unit employs an SE attention mechanism, which may include:

[0076] A compression subunit, wherein the compression subunit compresses the feature matrix of each channel by global average pooling, optionally compressing it into a single value;

[0077] The activation subunit learns the weight relationship between the channels and outputs the weight of each channel through a fully connected layer. Optionally, the activation subunit further includes a first fully connected layer, a ReLU activation layer, a second fully connected layer, and a Sigmoid activation layer.

[0078] A recalibration layer is used to multiply the weights of each channel by the feature matrix of each channel to obtain the output of the attention unit.

[0079] According to embodiments of this application, the SE attention mechanism can be used to enhance or suppress the features of certain channels.

[0080] According to an embodiment of this application, the spatial attention unit includes: a second pooling subunit, which employs a second global max pooling layer and a second global average pooling layer to merge the two-dimensional matrices of each channel of the output data of the channel attention unit; a compression layer, which uses a convolution kernel to compress and activate the output data of the second global max pooling layer and the second global average pooling layer; and an output layer, which multiplies the output data of the compressed layer with the output data of the channel attention unit to obtain the output data of the spatial attention unit. According to an embodiment of this application, the second pooling subunit performs spatial pooling on the output of the channel attention unit, which helps the model identify key spatial regions in the image. This operation allows the model to focus on the spatial features most critical to the prediction task. The compression layer compresses the output of the second pooling subunit, extracting representative information of the spatial features to form a more abstract feature representation. This helps the model capture broader spatial context information, not just local details. The output layer of the spatial attention unit multiplies the result of the compression layer with the output of the channel attention unit, achieving dynamic adjustment of the spatial features. This operation allows the model to balance its focus across different regions, enhancing the features of important regions while ignoring irrelevant background.

[0081] According to embodiments of this application, the structure of the channel attention unit focuses on identifying and emphasizing the most important features in the channel dimension, while the structure of the spatial attention unit focuses on capturing and emphasizing key regions in the spatial dimension. The channel attention unit performs complex feature transformations through an MLP, while the spatial attention unit abstracts and compresses spatial information through a compression layer. The output of the channel attention unit directly affects the contribution of each channel, while the output of the spatial attention unit affects the spatial feature representation of the entire image. This structural difference enables the channel attention unit and the spatial attention unit to work collaboratively, optimizing the selection and emphasis of feature channels on the one hand, and strengthening the correlation and expression of spatial features on the other, jointly improving the model's ability to analyze sequencing data and its prediction accuracy.

[0082] It should be noted that, according to the embodiments of this application, SE (Squeeze-and-Excitation) and CBAM (Convolutional Block Attention Module) are two different attention mechanisms, both of which can be used to improve the feature recognition ability of sequencing data, especially in the processing of fluorescence signals in gene sequencing. According to the embodiments of this application, the SE mechanism is implemented through global average pooling and two fully connected layers, resulting in a simple structure and high computational efficiency. Due to its simple structure, the SE module requires fewer parameters, reducing model complexity and the risk of overfitting. Furthermore, SE can learn the importance of different channels, automatically highlighting useful features and suppressing irrelevant features, and can be easily embedded into existing convolutional neural networks to improve existing models without major architectural adjustments. According to the embodiments of this application, CBAM considers attention in both channel and spatial dimensions, enabling more comprehensive feature capture. Through spatial attention, CBAM can identify key spatial locations in images, making it particularly effective for processing spatially correlated data. CBAM enhances the model's ability to express features by fine-grained control of channel and spatial features.

[0083] In the context of gene sequencing, the choice between SE (Sequencing Array) and CBAM (Cyclic Block Amplifier) ​​should be based on specific task requirements. If the task primarily focuses on inter-channel relationships and requires a lightweight solution, SE may be a better choice. However, if the task needs to consider both channel and spatial information, especially when performing sequencing data analysis on high-density array chips, CBAM may provide more comprehensive feature coverage, thereby improving sequencing accuracy and reliability. Although CBAM has a higher computational cost, the performance improvements it offers can be crucial in complex bioinformatics data analysis.

[0084] For ease of understanding, please refer to the following: Figures 2 to 10 The training process of the model is described in detail:

[0085] Construct a training dataset with known, accurate base identification results (or highly reliable base sequences), and train a model based on this dataset to learn the association between input data and base identification results. In short, refer to... Figure 2 and Figure 3 The model training process includes data extraction and preprocessing, and the model training phase. More specifically, it can be carried out through the following steps:

[0086] Collect raw data generated by the sequencing platform, which typically includes raw images or signal intensity data obtained from the sequencing instrument;

[0087] Perform quality control on the raw data and remove low-quality data, such as sequencing data with excessively high error rates or noise;

[0088] High-quality reference genomes are used to align sequencing data with the data to determine known base sequences.

[0089] The sequencing data is aligned with a reference sequence to determine the base type at each position in the sequencing data. This step may use various alignment tools, such as BWA, Bowtie, etc.

[0090] Based on the alignment results, a target matrix or tag matrix is ​​constructed, which contains the correct base sequence for each sequencing position;

[0091] Features such as fluorescence intensity and signal patterns are extracted from the raw sequencing data and used as input to the model.

[0092] The input features are preprocessed, including normalization, background noise removal, and correction operations (such as hysteresis correction, lead correction, and crosstalk correction).

[0093] The dataset is divided into training, validation, and test sets to evaluate model performance and prevent overfitting.

[0094] Ensure that each sample in the training dataset has the correct label, i.e., a known base sequence.

[0095] In addition, according to embodiments of this application, the diversity and coverage of the dataset can be increased by using data augmentation techniques to improve the generalization ability of the model. When necessary, it is necessary to ensure that the dataset represents the diversity of the target genome to avoid bias, which is crucial for improving the applicability of the model on different samples.

[0096] According to embodiments of this application, data enhancement techniques include, but are not limited to, at least one of the following:

[0097] Rotation and flipping: Rotating or flipping image data to simulate different viewing angles helps the model learn more robust features.

[0098] Scaling and cropping: Adjust the size of an image or crop different parts of an image, which can simulate different fields of view and resolutions.

[0099] Color transformation: Transform the color space of an image, such as adjusting brightness, contrast, and saturation, so that the model can adapt to different lighting conditions.

[0100] Adding noise: Random noise is added to the data to simulate signal noise that may occur during actual sequencing, thereby enhancing the model's robustness to noise.

[0101] Sample interpolation: Generating new samples through interpolation techniques, for example, creating new data points in sequencing data by linear or non-linear interpolation between known samples.

[0102] Sequence perturbation: Small perturbations to the DNA sequence, such as substitution, insertion, or deletion of a single base, to simulate mutations that may occur during sequencing.

[0103] Mixed samples: Combining features or signals from multiple samples to simulate complex sequencing scenarios.

[0104] Time series jitter: If the dataset contains time series information, time jitter can be used to change the relative order of events.

[0105] Domain randomization: Randomizing data across different domains (such as different sequencing platforms or experimental conditions) improves the model's adaptability to new environments. The training dataset built through these steps provides the machine learning model with the necessary information to learn how to accurately predict base sequences from sequencing data.

[0106] This allows us to obtain high-quality training datasets, which is key to achieving efficient and accurate sequencing analysis.

[0107] Specifically, according to the embodiments of this application, in order to construct the training dataset, the sequencing epochs are consistent with the longest sequencing epochs in actual applications. To obtain more comprehensive training data, the aforementioned sequencing should be repeated multiple times on different instruments of the same model by different operators using different batches of reagents and chips to accumulate the training dataset. Taking four-color fluorescence sequencing as an example, each field of view (FOV) will obtain 4 images in each sequencing epoch, namely fluorescence images taken under the four fluorescence channels A, G, C, and T.

[0108] According to embodiments of this application, conventional base identification tools are used to analyze the obtained sequencing data to obtain base identification results, such as fastq files.

[0109] According to embodiments of this application, aligning a FastQ file with a reference genome can generate a BAM or SAM file of the alignment results. The aforementioned reference genome is a pre-determined sequence, which can be a pre-assembled DNA and / or RNA sequence or a publicly available DNA and / or RNA sequence determined by another party. It can be any reference template from the biological category of the sample source individual / target individual, for example, all or at least a portion of a publicly available genome assembly sequence from the same biological category. If the sample source individual or target individual is human, its genome reference sequence (also called the reference genome or reference chromosome set) can be selected from human reference genomes provided by the UCSC, NCBI, or ENSEMBL databases, such as HG19, HG38, GRCh36, GRCh37, GRCh38, etc.

[0110] According to an embodiment of this application, after obtaining the alignment results, a target value matrix is ​​constructed based on the alignment results, the correct base sequence corresponding to each fluorescent cluster is recorded, and low-quality or unreliable sequences are removed.

[0111] For example, for the Nth round of sequencing, the target value is the "correct answer" for base identification corresponding to each fluorescent cluster during the Nth round of sequencing, i.e., the corresponding base on the reference genome after alignment. The correct answers for base identification corresponding to each fluorescent cluster are arranged into a matrix according to the actual spatial position of the fluorescent cluster, i.e., the target value matrix. For example, assuming that the correct answer for the fluorescent cluster in the 2nd row and 8th column on the chip in the Nth round of sequencing should be A, then A is written at the [1,7] coordinate position (counting from 0) in the target value matrix.

[0112] It should be noted that, when constructing the target value matrix, sequences with poor sequencing quality or unreliable alignment positions were removed to ensure the reliability of the training data. For sequences that failed to align to the reference genome or whose alignment results were unreliable, the position of their corresponding fluorescent clusters in the target value matrix was marked as "N", and subsequent positions marked as N were not included in the training.

[0113] Based on the array-type chip design layout, the center position of the hole on each round of fluorescence images is located, and the fluorescence intensity is obtained as the original fluorescence brightness. Brightness information is determined based on the original fluorescence brightness.

[0114] For example, in a four-color fluorescence sequencing scenario, each fluorescence cluster corresponds to four fluorescence intensity values ​​for the four fluorescence channels (A, G, C, and T) in each round of sequencing. In some examples of this application, the fluorescence intensity on the original fluorescence image corresponding to the center position of each fluorescence cluster is referred to as the "original fluorescence brightness".

[0115] In some examples of this application, after obtaining the original fluorescence brightness information, the process further includes at least one of background removal and correction operations. The fluorescence brightness obtained from the background removal operation is used to obtain the background-removed fluorescence brightness. In some examples of this application, a further correction operation is performed on the background-removed fluorescence brightness to obtain the corrected fluorescence brightness. By performing a background removal operation on the original fluorescence brightness, the influence caused by the background of the image is removed. The correction operation eliminates or reduces errors and biases that may be introduced during sequencing. The aforementioned correction operation includes at least one of phasing correction, pre-phasing correction, and crosstalk correction.

[0116] For example, based on the background fluorescence brightness distribution characteristics of the (n-1)th and Nth rounds, phasing correction is performed on the background fluorescence brightness of the Nth round; based on the background fluorescence brightness distribution characteristics of the Nth and n+1th rounds, prephasing correction is further performed; and then, based on the crosstalk parameters of the fluorescence signals of the four fluorescence channels (determined by the spectrum of the fluorescent molecules and the parameters of the optical instruments), the signal crosstalk of the fluorescence channels is corrected.

[0117] Each fluorescent cluster receives a fluorescence intensity value in each fluorescence image. For each fluorescence image, the fluorescence intensity corresponding to each fluorescent cluster is arranged into a matrix according to the actual spatial location of the cluster. For example, the fluorescence intensity of the fluorescent cluster in the 2nd row and 8th column on the chip is written in the [1,7] coordinates of the matrix (counting from 0). Therefore, for each fluorescence image, the fluorescence matrix is ​​determined based on the location and intensity information of the fluorescent clusters, resulting in a fluorescence intensity matrix. The location information is determined based on the center position of the hole where the fluorescent cluster is located. The fluorescence intensity matrix (including the original fluorescence intensity, the background-removed fluorescence intensity, or the corrected fluorescence intensity) is input as a feature parameter into the prediction model.

[0118] refer to Figures 1-10 In some examples of this application, the aforementioned prediction model is trained as follows:

[0119] In some examples of this application, the aforementioned prediction model can be a machine learning model, including at least one of decision trees, random forests, logistic regression models, convolutional neural networks, generative adversarial networks, and recurrent neural networks. The trained prediction model is determined through the following steps:

[0120] Obtain first training set and first test set data, which have fluorescence brightness matrices and corresponding known base type matrices obtained in the aforementioned manner; input the fluorescence brightness matrix into the prediction model, use the known base type matrix as labels, and perform supervised training on the prediction model to obtain a trained prediction model.

[0121] Specifically, the model training process includes steps such as building a training dataset, data processing, building a deep learning model, model training, model evaluation, and optimization.

[0122] First, select appropriate FOVs (Fields of View) and their corresponding rounds (e.g., round N) from the sequencing data as training samples. Use the fluorescence intensity matrices from round Nm to round N+n of each FOV as the features of the training samples. Use the target value matrix of round N of each FOV, i.e., the base sequence corresponding to the fluorescent cluster, as the target value of the training samples. Evaluate the model's performance based on the error rate of the output sequences.

[0123] Then, when constructing the training dataset, only data with high confidence in the alignment results are selected for training. For example, in step 4 above, when constructing the target value matrix, if the corresponding sequence in the SAM or BAM file has insertions or missing values, it can be suspected that the alignment position is not reliable enough. In this case, the target value matrix value of the entire corresponding sequence can be written as 'N'. In subsequent training, the positions with target values ​​of N can be excluded from training. For example, the positions corresponding to N can be ignored when calculating the loss function.

[0124] Next, a prediction model is established, and the model architecture (including the number of network layers, the number of neurons, activation functions, etc.), loss function (such as cross-entropy loss function), optimizer (such as stochastic gradient descent optimizer), etc. are determined. The model is trained by dividing the training set and validation set, and the model parameters are continuously updated through the backpropagation algorithm so that the model gradually converges to the optimal solution.

[0125] According to the embodiments of this application, refer to Figure 2 and Figure 3 Different models can be used for base identification, for example, the following models can be used: Figure 2 and Figure 3 Schemes 1, 2.1, and 2.2 are selected, with Schemes 2.1 and 2.2 being preferred.

[0126] The following is for reference. Figure 2 and Figure 4 Each of these options will be described in detail:

[0127] Option 1 (for reference) Figure 2 and Figure 3 Option 1)

[0128] In this approach, the fluorescence brightness matrices of the current round and all adjacent rounds are directly input into the deep learning model.

[0129] First, establish the training dataset. In short, taking the sequencing data of the Nth round of a certain FOV as an example, the target value of this data is the target value matrix of the Nth round of the FOV, and the feature values ​​are all the fluorescence brightness matrices from the nmth to the N+nth round, a total of (2m+1)*4 (including 2m+1 rounds, with 4 fluorescence channels in each round).

[0130] Next, a deep learning model is built using the training dataset described above. The final model's performance is evaluated by assessing the error rate of the output sequence.

[0131] Option 2 (for reference) Figure 2 and Figure 4 Schemes 2.1 and 2.2)

[0132] This scheme employs a two-step concatenation approach, combining a spatial crosstalk model and a context model.

[0133] Specifically, the main idea of ​​Scheme 2 is to use a two-step approach: a spatial crosstalk correction model and a context effect correction model. Either spatial crosstalk or context effect can be corrected first. Here, we will first refer to Scheme 2.1 and use the example of correcting spatial crosstalk first and then context effect as an example for explanation:

[0134] First, construct the training dataset for Model 1. Each training data point corresponds to a sequencing data point of one round (assuming it is the Nth round) of FOV. The target value of this data point is the target value matrix of the Nth round of FOV, and the eigenvalues ​​are the fluorescence brightness matrices of the Nth round, totaling 4.

[0135] Next, based on the training dataset of Model 1, a deep learning model 1 is built. Model 1 outputs a matrix of base identification results for each round of sequencing, and also outputs the probability of base classification at each position in each round calculated by the model, that is, the probability that the base at each position is A, G, C, or T. The above base classification probabilities are also output in matrix form, with the matrix coordinates corresponding one-to-one with the spatial position of the fluorescent cluster.

[0136] Construct the training dataset for Model 2. Each data point corresponds to the sequencing data of a specific fluorescent cluster location in a particular round (let's say round N). The target value is the reference base (i.e., the "standard answer" after alignment) for that fluorescent cluster in round N. The feature values ​​are the base classification probabilities from round Nm to round N+n, totaling (m+n+1)*4 values. Additionally, the feature values ​​may include round number information (i.e., the N value), and parameters related to the distribution of the four base classification probabilities.

[0137] Next, Model 2 is trained using the training dataset from Model 2. Model 2 can be either a deep learning model or a machine learning model. The output of Model 2 is the final base identification result for each fluorescent cluster in each round.

[0138] Based on the embodiments of this application, those skilled in the art will also understand that the order of Model 1 and Model 2 can be reversed before modeling. This only requires swapping the fluorescence intensity and base classification probability in the above steps and remodeling. See [link to relevant documentation]. Figure 2 and Figure 3 Scheme 2.2.

[0139] It is important to note that, in order to prevent potential overfitting, the training data for Model 1 and Model 2 should be completely independent and disjoint.

[0140] It should be noted that Schemes 1 and 2 involve numerous data selection and model parameter tuning methods during model building. According to the embodiments of this application, when constructing the training dataset, only data with high confidence in the alignment results are selected for training. For example, as mentioned above, when constructing the target value matrix, if the corresponding sequence in the sam or bam file has insertions or missing values, it can be suspected that the alignment position is not reliable enough, and the target value matrix value of the entire corresponding sequence can be written as 'N'. In subsequent training, positions with target values ​​of N can be excluded from training. For example, the position corresponding to N can be ignored when calculating the loss function.

[0141] In some examples of this application, the loss function needs to be optimized according to the specific model, especially considering the parameters for evaluating sequencing quality in the sequencing scenario. For example, when calculating the loss function, in addition to considering the traditional loss function calculation method, parameters specific to the sequencing scenario can also be added, such as sequencing error rate, pattern error characteristic parameters, etc.

[0142] For sequencing with a large number of rounds, the sequencing quality and data characteristics may differ significantly at different stages (e.g., fewer rounds in the early stages and more rounds in the later stages). Different models can be built for different round ranges. For example, one model can be used for 0-50 rounds, another for 51-100 rounds, and so on.

[0143] During model training, the hyperparameters of the model can be adjusted based on metrics such as accuracy and loss value to improve the model's generalization ability.

[0144] Finally, the trained model is evaluated and optimized, including calculating various metrics (such as accuracy, precision, recall, etc.), analyzing performance, trying different optimization methods (such as different model architectures, loss functions, optimizers, etc.), and then applying it to new datasets for prediction.

[0145] According to embodiments of this application, the model that can be used is a machine learning model, which includes at least one of the following: a first sub-model for spatial crosstalk correction; and a second sub-model for contextual effect correction. The first and second sub-models are each independently selected from at least one of decision trees, random forests, logistic regression models, convolutional neural networks, generative adversarial networks, and recurrent neural networks.

[0146] The aforementioned first sub-model is used to output the base classification prediction probability matrix of the aforementioned single-round extension reaction based on the fluorescence matrix of the single-round extension reaction; the aforementioned second sub-model is used to output the base identification result matrix of the Nth round extension reaction based on the base classification prediction probability matrix of the Nmth to N+nth round extension reactions, wherein the aforementioned base classification prediction probability matrix of the Nmth to N+nth round extension reactions is generated by the aforementioned first sub-model.

[0147] This embodiment first constructs a first sub-model (correcting spatial crosstalk), and then constructs a second sub-model (correcting contextual effects) based on this. The training method for each sub-model is the same as above, the difference being that the training dataset is different.

[0148] The training dataset for the first sub-model includes: selecting FOVs (Fields of View) and their corresponding rounds from sequencing data as training samples; determining the target value matrix for each FOV in round N, i.e., the base sequence corresponding to the fluorescent cluster, as the target value of the training sample; and determining the feature values, i.e., all fluorescence brightness matrices from round Nm to round N+n of each FOV, as the features of the training sample. Based on the above training dataset, the first sub-model is trained using the same method to obtain the trained model. The output data of the first sub-model is the base identification result matrix for each round of sequencing, and it can also output the probability of base classification at each position in each round calculated by the model, i.e., the probability that the base at each position is A, G, C, or T. The above base classification probabilities are also output in matrix form, with the matrix coordinates corresponding one-to-one with the spatial position of the fluorescent cluster.

[0149] The training dataset for the second sub-model includes: the target base for sequencing in the Nth round of any fluorescent cluster, and the base classification probability values ​​for rounds Nm to N+m. In some examples of this application, it may further include round number information, the magnitude distribution relationship of the four base classification probabilities, and other related parameters. Based on the above training dataset, the second sub-model is trained in the same manner as described above. The output data of the second sub-model is the base identification result for each fluorescent cluster in each round.

[0150] The first sub-model trained and the second sub-model trained together constitute the trained prediction model.

[0151] In some further examples of this application, the aforementioned prediction model is a machine learning model, which includes at least one of the following: a third sub-model for performing contextual effect correction; and a fourth sub-model for performing spatial crosstalk correction. The third and fourth sub-models are each independently selected from at least one of decision trees, random forests, logistic regression models, convolutional neural networks, generative adversarial networks, and recurrent neural networks.

[0152] The aforementioned third sub-model is used to output the base classification prediction probability matrix of the Nth extension reaction based on the fluorescence matrix of each fluorescence channel in the Nmth to N+nth extension reactions; the aforementioned fourth sub-model is used to output the base identification result matrix of the Nth extension reaction based on the aforementioned base classification prediction probability matrix of the Nth extension reaction.

[0153] This embodiment first constructs a third sub-model (correcting contextual effects), and then constructs a fourth sub-model (correcting spatial crosstalk) based on this. The training method for each sub-model is the same as above, the difference being that the training dataset is different.

[0154] The training dataset for the third sub-model includes: the target bases for the Nth round of sequencing of any fluorescent cluster, and the base classification probability values ​​for rounds Nm to N+n. In some examples of this application, it may further include round number information, the magnitude distribution relationship of the four base classification probabilities, and related parameters. Based on the above training dataset, the third sub-model is trained in the same way as described above. The output data of the third sub-model is the base identification result matrix for each round of sequencing, and it can also output the probability of base classification at each position in each round calculated by the model, that is, the probability that the base at each position is A, G, C, or T. The above base classification probabilities are also output in matrix form, with the matrix coordinates corresponding one-to-one with the spatial position of the fluorescent cluster.

[0155] The training dataset for the fourth sub-model includes: the target base for sequencing in the Nth round of any fluorescent cluster, and the base classification probability values ​​for rounds Nm to N+n. In some examples of this application, it may further include round number information, the magnitude distribution relationship of the four base classification probabilities, and related parameters. Based on the above training dataset, the fourth sub-model is trained in the same manner as described above. The output data of the fourth sub-model is the base identification result for each fluorescent cluster in each round.

[0156] The trained third sub-model and the trained fourth sub-model together constitute the trained prediction model.

[0157] In some examples of this application, m is an integer less than or equal to N and greater than 0.

[0158] In some examples of this application, the first sub-model and the aforementioned second sub-model are trained independently using non-overlapping training samples.

[0159] In some examples of this application, the aforementioned sub-models independently employ at least one of a semantic segmentation network structure, an encoder-decoder network structure, or a Transformer network structure. The aforementioned encoder-decoder network structure includes at least one of the U-Net network or its variants, or DeepLap; the aforementioned Transformer network structure includes Segformer.

[0160] In some examples of this application, at least one layer of the aforementioned U-Net-based neural network structure is provided with an attention mechanism, which includes at least one of a channel attention unit and a spatial attention unit.

[0161] In some examples of this application, the aforementioned channel attention unit includes: a first pooling subunit, which employs a first global max pooling layer and a first global average pooling layer to perform channel merging on the input data of the aforementioned channel attention unit, thereby reducing the number of channels in the input data; a neural network MLP subunit, which uses shared parameters to process the output data of the aforementioned first global max pooling sublayer and the output data of the aforementioned first global average pooling sublayer, respectively, to obtain max pooling transformed data and average pooling transformed data, respectively; a convolution processing subunit, which uses multiple convolution kernels to perform channel restoration on the max pooling transformed data and the average pooling transformed data, respectively; and an output layer, which merges and activates the output data of the aforementioned convolution processing subunit and multiplies it with the input data of the aforementioned channel attention unit to obtain the output data of the aforementioned channel attention unit.

[0162] In some examples of this application, the aforementioned spatial attention unit includes: a second pooling subunit, which employs a second global max pooling layer and a second global average pooling layer to merge the two-dimensional matrix of each channel of the output data of the aforementioned channel attention unit; a compression layer, which employs a convolution kernel to compress and activate the output data of the aforementioned second global max pooling layer and the second global average pooling layer; and an output layer, which multiplies the output data of the aforementioned compression layer with the output data of the aforementioned channel attention unit to obtain the output data of the aforementioned spatial attention unit.

[0163] The base identification scheme in this application utilizes a deep learning model, combined with the spatial arrangement information of fluorescent clusters, to automatically correct spatial fluorescence crosstalk between adjacent fluorescent clusters. Simultaneously, it can independently correct for the contextual effects of each fluorescent cluster, effectively improving the accuracy of base identification. Compared to traditional methods, this method reduces computational power consumption and processing speed by decreasing the input data size, offering significant advantages. It is more suitable for array-based chips, enabling more efficient processing of sequencing data from array-based chips and contributing to improved accuracy and reliability of sequencing results.

[0164] The following is for reference. Figures 4-9 The following is a detailed description of the running examples of the U-Net structured neural network according to the embodiments of this application.

[0165] This model employs a U-Net-based network structure, incorporating spatial and channel attention mechanisms. After inputting 32*4*400*320(B,C,H,W) data X (B: batch size; C: channel; H: height; W: width), it undergoes the following processing:

[0166] First, it passes through an input layer (in_conv) and an attention mechanism (CBAM) layer, the structure of which is as follows: Figure 6 and 7 As shown, the in_conv layer includes convolutional kernels for dimensionality upscaling (e.g., a 3x3 convolution can be used to change the size to 32x32x400x320), followed by normalization and ReLU activation, and finally, another convolutional kernel for dimensionality upscaling (e.g., a 3x3 convolution), normalization and ReLU activation, resulting in the dimensionality upscaled X0 (e.g., X0 of 32x32x400x320).

[0167] The data is then input into the CBAM layer. In the CBAM layer's channel attention layer, data X0 passes through a global max pooling layer and a global average pooling layer, and then through a shared-parameter neural network MLP. In the MLP, the data first passes through a dimensionality-reducing convolutional kernel (e.g., a 1x1 convolution), reducing the number of channels, for example, to 1 / 16 of the original. Then it passes through the ReLU activation function, and finally through another dimensionality-reducing convolutional kernel (e.g., a 1x1 convolution) to restore the number of channels. Finally, the two data points are added together and passed through a Sigmoid activation function to obtain X. 0C Perform element-wise multiplication with X0 to obtain the final output X. 0MID .

[0168] In the spatial attention of CBAM, for data X 0MIDThe global sum and global maximum are calculated based on the channels. The two results are then concatenated, processed by a convolutional kernel (e.g., a 7x7 convolution), compressed into a single channel, and finally activated by a sigmoid function to obtain X. 0S to X 0MID Element-wise multiplication yields the final CBAM output X. 0FIN .

[0169] Next, after four downsampling layers, the first three downsampling layers each include an attention mechanism CBAM layer and a downsampling layer (Down layer), while the last layer has only one Down layer. The Down layer structure is as follows: Figure 8 As shown. Each time a data layer passes through a Down layer, the number of data channels C doubles (except for the last Down layer), while the height and width H and W are halved.

[0170] Taking the first layer as an example (input is X) 0FIN The output is X 1FIN ):

[0171] X 0FIN The input is a downsampling layer, which passes through a max pooling layer, a dimensionality-reducing convolutional kernel, normalization, and a ReLU activation function to obtain the dimensionality-reduced X1. For example, it can pass through a 2*2 max pooling layer with a stride of 2, making the data 32*32*200*160. Then, it passes through a 3*3 convolution, making the size 32*64*200*160. After that, it is normalized and activated by a ReLU function. Finally, it passes through a 3*3 convolution, normalization, and ReLU activation function again to finally obtain X1 of 32*64*200*160.

[0172] Then, after four upsampling layers, the structure is as follows: Figure 9 As shown. Each upsampling layer has two inputs: one is the output of the previous layer (X4 for the first downsampling layer, Y1 for the second layer), and the other is the corresponding downsampling result (X for the first layer). 3FIN The last layer corresponds to X 0FIN For each upsampling layer, the number of data channels C is halved (except for the last upsampling), while the height and width H and W are doubled.

[0173] Taking the first layer upsampling as an example (input is X4, X...) 3FIN (The output is Y1):

[0174] The input X4 is subjected to bilinear interpolation, and then X4 is interpolated with X... 3FIN Concatenated according to channel dimension C, then subjected to dimensionality-upgrading convolution kernel processing, normalization processing, and ReLU activation function, finally yielding the dimensionality-upgraded Y1. For example, it can be obtained through bilinear interpolation, with the format becoming 32*256*50*40. Then X4 and X...3FIN The data is concatenated according to channel dimension C, resulting in a format of 32*512*50*40. Then, the result is subjected to a 3*3 convolution, reducing the size to 32*256*50*40. This is followed by normalization and ReLU activation, and finally, another 3*3 convolution, normalization, and ReLU activation, resulting in a final size of 32*128*50*40.

[0175] Y1 and X 2FIN The next upsampling layer is input to obtain the output result Y2, and so on. Finally, Y4 is input into a 1*1 convolution out_conv, and the number of output channels is the number of categories, resulting in the final prediction result Out: 32*5*400*320.

[0176] In other examples of this application, a deconvolution kernel can be used instead of the bilinear interpolation method described above. The output matrix size of both methods is the same, and will not be elaborated further here.

[0177] On the other hand, this application proposes a device for base recognition. (Reference) Figure 11 The device includes a predictive feature acquisition unit 100 and a base recognition unit 200. In some examples of this application, the aforementioned predictive feature acquisition unit 100 and base recognition unit 200 are connected, and the connection includes a network connection.

[0178] The predictive feature acquisition unit 100 is used to acquire the feature matrix of the sequencing-by-synthesis reaction. The aforementioned feature matrix includes a fluorescence matrix composed of the position information and brightness information of multiple fluorescent clusters.

[0179] The base recognition unit 200 is used to input the aforementioned feature matrix into the trained prediction model and output the base recognition results of the aforementioned multiple fluorescent clusters.

[0180] The device of this application is used to implement the above-mentioned base identification method. The advantages of the above-mentioned base identification method also apply to this aspect, and will not be repeated here.

[0181] On another front, this application proposes a computer program product comprising computer instructions that, when some or all of the aforementioned computer instructions are run on a computer, cause the method for base identification described in this application to be executed.

[0182] When implemented using software, it can be implemented entirely or partially as a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.

[0183] In another aspect, this application proposes a computing device, comprising: a processor and a memory; the aforementioned memory for storing a computer program; and the aforementioned processor for executing the aforementioned computer program to implement the method for base recognition as described in this application.

[0184] The term "electronic device" is intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Computing devices can also refer to various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0185] like Figure 12 As shown, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes based on a computer program stored in ROM (Read-Only Memory) 502 or a computer program loaded from storage unit 508 into RAM (Random Access Memory) 503. The RAM 503 can also store various programs and data required for the operation of the device 500. The computing unit 501, ROM 502, and RAM 503 are interconnected via a bus 504. An I / O (Input / Output) interface 505 is also connected to the bus 504.

[0186] Multiple components in device 500 are connected to I / O interface 505, including: input unit 506, such as keyboard, mouse, etc.; output unit 507, such as various types of monitors, speakers, etc.; storage unit 508, such as disk, optical disk, etc.; and communication unit 509, such as network card, modem, wireless transceiver, etc. Communication unit 509 allows device 500 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0187] The computing unit 501 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, CPUs (Central Processing Units), GPUs (Graphics Processing Units), various special-purpose AI (Artificial Intelligence) computing chips, various computing units running machine learning model algorithms, DSPs (Digital Signal Processors), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as methods for predicting the state of microsatellites. For example, in some embodiments, the method for predicting the state of microsatellites may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and / or installed on device 500 via ROM 502 and / or communication unit 509. When the computer program is loaded into RAM 503 and executed by the computing unit 501, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the aforementioned method for base identification by any other suitable means (e.g., by means of firmware).

[0188] In another aspect, this application proposes a computer-readable storage medium comprising computer instructions that, when executed by a computer, cause the computer to implement a method for base identification.

[0189] In this application, the ordered list of executable instructions for implementing logical functions can be specifically implemented in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-based system, or other system that can fetch and execute instructions from, or in conjunction with, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, a computer-readable medium can even be paper or other suitable media on which the aforementioned program can be printed, because the aforementioned program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory. The various computer-readable storage media described in this application can represent one or more devices and / or other machine-readable storage media for storing information. The term "machine-readable storage medium" can include, but is not limited to, wireless channels and various other media capable of storing, containing, and / or carrying instructions and / or data.

[0190] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0191] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The aforementioned program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

[0192] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

[0193] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0194] The following examples illustrate this application, but should not be construed as limiting the scope of the subject matter of this application to the following examples. All technologies implemented based on the above content of this application fall within the scope of this application.

[0195] Example 1: Application of Deep Learning Models in Gene Sequencing Data Analysis

[0196] This example aims to verify the potential of deep learning models in improving gene sequencing quality, with a particular focus on the correction of spatial fluorescence crosstalk. This example uses high-density array-based microarray sequencing images provided by Zhenmai Biotechnology, selecting P150 paired-end sequencing data within a chosen region as the research object. The training dataset consists of 1000 independent array-based microarray sequencing samples, originating from different experimental conditions, reagent batches, and microarray batches, ensuring data diversity and breadth.

[0197] refer to Figures 5-9 This embodiment, based on the U-Net network structure, further integrates spatial attention and channel attention mechanisms to enhance the model's ability to recognize sequencing image features. Data processing: The input data consists of ACGT images acquired by the camera in each sequencing cycle under a fixed field of view (FOV). The pore coordinates and light intensity information of the central region of the image are extracted to form a two-dimensional matrix (High: 381, Width: 304). The matrices of the four color channels are merged to form a four-channel three-dimensional matrix, providing input to the model. The input data is standardized, adjusted to a light intensity value range of 0-255, and edge-padded to adapt to the model's input size requirements.

[0198] Standardization and Training: The input data is standardized using the mean and variance calculated from a large dataset, ensuring a mean of 0 and a standard deviation of 1. Data labels are generated based on reference genome information, encoding the ACGT bases as numbers 1-4, and performing appropriate edge padding.

[0199] Model training: Stochastic gradient descent (SGD) was used as the optimizer, with an initial learning rate of 0.05, momentum of 0.9, and weight_decay of 1e-4.

[0200] During training, a learning rate update strategy is applied to dynamically adjust the learning rate in order to optimize training results.

[0201] We use a weighted combination of cross-entropy loss, dice loss, and a custom miss_match loss as the loss function, focusing on loss calculation for the four ACGT classes.

[0202] Performance Evaluation: After training, padding regions are removed by pruning, and the output is compared with the reference genome label to calculate the base recognition error rate. An early stopping strategy is implemented to prevent overfitting; training is stopped when the loss on the validation set fails to improve within 30 consecutive epochs.

[0203] Model performance: On a test dataset with 100 FOVs, the model successfully reduced the sequencing error rate by 28% and improved the alignment rate by 1.2%, demonstrating the significant effect of deep learning models in improving the accuracy of gene sequencing.

[0204] Example 2: Deep Learning Model Considering Contextual Influence

[0205] The purpose of this embodiment is to verify the potential of deep learning / machine learning models in correcting the influence of context on current base recognition. In this embodiment, the influence of spatial fluorescence crosstalk is disregarded, and the focus is on the sequence information of fluorescent clusters.

[0206] We used high-density array-based microarray sequencing images provided by Zhenmai Biotechnology to focus on a fixed region within the PE150 paired-end sequencing pipeline. The training dataset consists of 80 different fields of view (FOVs), each derived from independent experiments, different reagents, and microarray batches, ensuring data diversity and breadth.

[0207] Model Construction: Training data is presented as individual fluorescent clusters, with each data point recording the fluorescence intensity value of a particular sequencing round, excluding spatial location information. Feature values ​​include the corrected fluorescence intensity values ​​of the fluorescent cluster in the current sequencing round and the three rounds before and after it, as well as the sequencing round number for the current round. The target value is the aligned reference sequence base corresponding to the fluorescent cluster in the current sequencing round.

[0208] Model selection and results:

[0209] A machine learning model was built using the lightGBM algorithm to improve the accuracy of base identification.

[0210] The test dataset contains 20 FOVs, derived from different experimental conditions, and all samples are human genomes.

[0211] After the model was applied, the sequencing error rate was significantly reduced by 37%, and the alignment rate was improved by 2.3%, demonstrating the effectiveness of the model when considering the influence of context.

[0212] Example 3: Model Performance Verification

[0213] This embodiment is used to verify the effectiveness of the complete algorithm. This embodiment uses a fixed region from a pre-research-grade high-density array-based microarray sequencing image from Zhenmai Biotechnology, employing a four-color fluorescence next-generation sequencing workflow, PE150 (paired-end sequencing 150+150), and using a "corrected fluorescence brightness" matrix. The training dataset consists of human genome samples.

[0214] The modeling process in this embodiment can be referred to Figure 2 and Figure 3 Part 2.1 of the scheme.

[0215] The modeling method for Model 1 is essentially the same as in Example 1. Each training data point corresponds to the sequencing information for a specific FOV and a specific round of sequencing. The training data consists of 1000 FOV array-based microarray sequencing data, P150 (150+150 paired-end sequencing), sourced from different experiments using different reagents and microarray batches. The modeling method for Model 2 is similar to that in Example 2. The training data is based on individual fluorescent clusters, with each training data point representing the sequencing information for a specific fluorescent cluster and a specific round of sequencing, excluding the spatial location information of the fluorescent clusters. The training data consists of 80 FOV array-based microarray sequencing data, P150 (150+150 paired-end sequencing), sourced from different experiments using different reagents and microarray batches. It is important to note that the training data for these 80 FOVs used in Model 2 completely differs from the training data for the 1000 FOVs used in Model 1. It is also important to note that the training data (sequencing data) for Model 2 needs to be processed first through the pre-trained Model 1 to obtain base classification probability values. Then, the base classification probability values ​​corresponding to each fluorescent cluster are used as feature parameters and input into Model 2 for training. Here, Model 2 is still constructed using the lightGBM model.

[0216] The test data consisted of 20 FOV array-based microarray sequencing data, PE150 (150+150 paired-end sequencing), from different experiments using different reagents and microarray batches. All samples were human genome samples. After using the above model, the overall sequencing error rate decreased by 49%, and the alignment rate improved by 3.1%.

[0217] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0218] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention without departing from the principles and spirit of the present invention.

Claims

1. A method for base recognition, characterized in that, include: Obtain the feature matrix of the sequencing-by-synthesis reaction, the feature matrix including a fluorescence matrix composed of the position information and brightness information of multiple fluorescent clusters; and The feature matrix is ​​input into the trained prediction model, and the base identification results of the multiple fluorescent clusters are output. The prediction model is a cascaded model comprising a first sub-model and a second sub-model: the first sub-model is used for spatial crosstalk correction; the first sub-model is used to output a base classification prediction probability matrix for a single round of extension reaction based on the fluorescence matrix of that single round of extension reaction; the second sub-model is used for contextual effect correction; the second sub-model is used to output a base identification result matrix for the Nth round of extension reaction based on the base classification prediction probability matrix for the Nmth to N+nth rounds of extension reaction, wherein the base classification prediction probability matrix for the Nmth to N+nth rounds of extension reaction is generated by the first sub-model; or The prediction model is a cascaded model including the third sub-model and the fourth sub-model: the third sub-model is used for contextual effect correction; the third sub-model is used to output the base classification prediction probability matrix of the Nth extension reaction based on the fluorescence matrix of each fluorescence channel in the Nmth to N+nth extension reactions; the fourth sub-model is used for spatial crosstalk correction; the fourth sub-model is used to output the base recognition result matrix of the Nth extension reaction based on the base classification prediction probability matrix of the Nth extension reaction. The sub-models are trained independently using non-overlapping training samples.

2. The method according to claim 1, characterized in that, The sequencing-by-synthesis reaction is performed on an array-type chip, and the location information is determined based on the center position of the aperture where the fluorescent cluster is located.

3. The method according to claim 2, characterized in that, The fluorescence matrix is ​​obtained by taking pictures of the array chip and processing the resulting pictures during the sequencing-by-synthesis process.

4. The method according to claim 3, characterized in that, The process includes: The photograph is divided into multiple sub-regions; and Processing the multiple sub-regions separately includes determining the location and brightness information of the fluorescent clusters in each sub-region in the photograph, in order to obtain multiple feature matrices.

5. The method according to claim 4, characterized in that, The plurality of feature matrices contain at least one piece of information reflecting the location determination of the sub-region in the photograph.

6. The method according to claim 4, characterized in that, At least a portion of the plurality of sub-regions overlap.

7. The method according to claim 1, characterized in that, The prediction model includes a semantic segmentation network, which is selected from at least one of networks with an encoder-decoder structure or a Transformer-based network.

8. The method according to claim 7, characterized in that, The network with the encoder-decoder structure is selected from U-Net or its variants, and at least one of DeepLap.

9. The method according to claim 8, characterized in that, The Transformer-based network includes Segformer or its variants.

10. The method according to claim 9, characterized in that, The U-Net or U-Net variants, or Segformer or Segformer variants, add an attention mechanism.

11. The method according to claim 10, characterized in that, The attention mechanism includes at least one of a channel attention unit and a spatial attention unit.

12. The method according to claim 11, characterized in that, The attention mechanism is selected from at least one of the SE attention mechanism and the CBAM attention mechanism.

13. The method according to claim 11, characterized in that, The spatial attention unit uses the output of the channel attention unit as its input feature.

14. The method according to any one of claims 11-13, characterized in that, The channel attention unit includes: The first pooling subunit employs a pooling layer to reduce the number of input data channels; The neural network MLP subunit is used to process the output data of the first global max pooling sublayer and the output data of the first global average pooling sublayer using shared parameters, respectively, to obtain max pooling transformation data and average pooling transformation data. The convolution processing subunit is used to perform channel restoration on at least one of the max pooling transformed data and the average pooling transformed data using multiple convolution kernels; The output layer is used to merge and activate the output data of the convolution processing subunit and then multiply it with the input data of the channel attention unit to obtain the output data of the channel attention unit.

15. The method according to claim 14, characterized in that, The pooling layer includes an average pooling layer.

16. The method according to claim 14, characterized in that, The pooling layer further employs a first global max pooling layer and a first global average pooling layer to perform channel merging on the input data of the channel attention unit.

17. The method according to any one of claims 11-13, characterized in that, The channel attention unit includes: A compression subunit, wherein the compression subunit compresses the feature matrix of each channel by global average pooling; An activation subunit learns the weight relationships between the channels and outputs the weights of each channel through a fully connected layer. A recalibration layer is used to multiply the weights of each channel by the feature matrix of each channel to obtain the output of the attention unit.

18. The method according to claim 17, characterized in that, The compression includes compressing the data into a single numerical value.

19. The method according to claim 17, characterized in that, The activation subunit further includes a first fully connected layer, a ReLU activation layer, a second fully connected layer, and a Sigmoid activation layer.

20. The method according to any one of claims 11-13, characterized in that, The spatial attention unit includes: The second pooling subunit employs a second global max pooling layer and a second global average pooling layer to merge the two-dimensional matrix of each channel of the output data of the channel attention unit. A compression layer, wherein the compression layer uses convolutional kernels to compress and activate the output data of the second global max pooling layer and the second global average pooling layer; The output layer is used to multiply the output data of the compression layer with the output data of the channel attention unit to obtain the output data of the spatial attention unit.

21. A device for base recognition, characterized in that, include: A predictive feature acquisition unit is used to acquire the feature matrix of the sequencing-by-synthesis reaction, wherein the feature matrix includes a fluorescence matrix composed of the position information and brightness information of multiple fluorescent clusters; and The base recognition unit is used to input the feature matrix into the trained prediction model and output the base recognition results of the multiple fluorescent clusters. The prediction model is a cascaded model comprising a first sub-model and a second sub-model: the first sub-model is used for spatial crosstalk correction; the first sub-model is used to output a base classification prediction probability matrix for a single round of extension reaction based on the fluorescence matrix of that single round of extension reaction; the second sub-model is used for contextual effect correction; the second sub-model is used to output a base identification result matrix for the Nth round of extension reaction based on the base classification prediction probability matrix for the Nmth to N+nth rounds of extension reaction, wherein the base classification prediction probability matrix for the Nmth to N+nth rounds of extension reaction is generated by the first sub-model; or The prediction model is a cascaded model including the third sub-model and the fourth sub-model: the third sub-model is used for contextual effect correction; the third sub-model is used to output the base classification prediction probability matrix of the Nth extension reaction based on the fluorescence matrix of each fluorescence channel in the Nmth to N+nth extension reactions; the fourth sub-model is used for spatial crosstalk correction; the fourth sub-model is used to output the base recognition result matrix of the Nth extension reaction based on the base classification prediction probability matrix of the Nth extension reaction. The sub-models are trained independently using non-overlapping training samples.

22. A computer program product, characterized in that, include: Computer instructions; When some or all of the computer instructions are executed on a computer, the method for base identification as described in any one of claims 1 to 20 is performed.

23. A computing device, characterized in that, include: Processor and memory; The memory is used to store computer programs; The processor is configured to execute the computer program to implement the method for base recognition as described in any one of claims 1 to 20.

24. A computer-readable storage medium, characterized in that, The storage medium includes computer instructions that, when executed by a computer, cause the computer to implement the method for base identification as described in any one of claims 1 to 20.