Base classification method, gene sequencer, computer readable storage medium
By employing end-to-end base classification methods based on fluorescence image brightness data features and neural network models in gene sequencers, the complexity of existing base classification algorithms is solved, thereby simplifying parameter settings and improving sequencing accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MGI SHENZHEN SOFTWARE TECH CO LTD
- Filing Date
- 2021-04-16
- Publication Date
- 2026-06-26
AI Technical Summary
The base classification algorithms of existing second-generation gene sequencing instruments are complex and rely on multiple manually set parameters, resulting in insufficient sequencing accuracy and simplification.
A base classification method is adopted, which obtains the brightness data features of fluorescence images and uses a neural network model to perform end-to-end base classification, reducing parameter settings and simplifying the algorithm process.
It achieves efficient base classification, simplifies the algorithm process, and improves sequencing accuracy while simplifying parameter settings.
Smart Images

Figure CN115240189B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of gene sequencing, specifically to a base classification method, a gene sequencer, and a computer-readable storage medium. Background Technology
[0002] This section is intended to provide background or context for implementing the embodiments of the invention as set forth in the claims and detailed description. The description herein is not intended to imply that it is prior art simply because it is included in this section.
[0003] The research and application of gene sequencing technology covers almost all areas of human society, including health, agriculture, energy, and national defense and security. It has played a huge role in promoting life sciences, biomedicine and related industries, and has a more profound impact than the information economy era.
[0004] Base classification is the most critical part of gene sequencer algorithms, and its accuracy directly determines sequencing quality. Current mainstream second-generation gene sequencers' sequencing algorithms mainly consist of steps such as fluorescence image acquisition, data processing, and clustering. The data processing often includes registration, correction, and normalization, introducing a significant number of manually set parameters, making the algorithm quite complex. Summary of the Invention
[0005] In view of the above, it is necessary to propose a base classification method, a gene sequencer, and a computer-readable storage medium that can achieve end-to-end classification from the brightness data of fluorescence images to the base categories. While ensuring the base classification effect, it requires very few parameters, which can greatly simplify the current algorithm.
[0006] The base classification method includes: acquiring a fluorescence image to be identified; identifying the location of each DNA nanosphere in the fluorescence image; extracting brightness data features of the fluorescence image, the brightness data features including brightness data in M dimensions corresponding to the location of each DNA nanosphere, wherein the brightness data in M dimensions corresponding to the location of each DNA nanosphere includes: brightness data of the location of each DNA nanosphere, brightness data of each DNA nanosphere in multiple neighborhoods of the location of each DNA nanosphere, and the average brightness of the trajectory intersection points corresponding to the block where each DNA nanosphere is located; and inputting the extracted brightness data features into a preset base recognition model to obtain the base category corresponding to the fluorescence image.
[0007] Optionally, the brightness data of the M dimensions further includes: brightness data of the DNA nanospheres corresponding to the positions of each DNA nanosphere in the previous and subsequent cycles.
[0008] Optionally, before extracting the brightness data features of the fluorescence image, the method further includes: identifying a first trajectory line that is parallel to each other in the horizontal direction and a second trajectory line that is parallel to each other in the vertical direction from the fluorescence image; determining blocks based on two adjacent first trajectory lines and two adjacent second trajectory lines and determining the trajectory intersection points corresponding to the blocks; determining the block where each DNA nanosphere is located according to the location of each DNA nanosphere, and determining the trajectory intersection points corresponding to the block where each DNA nanosphere is located.
[0009] Optionally, the method further includes training the base recognition model, including: acquiring a preset number of training samples, each training sample including brightness data features of a sample image and a label of the base category, the brightness data features of each sample image including M-dimensional brightness data corresponding to the location of each DNA nanosphere in each sample image, including: brightness data of the location of each DNA nanosphere in each sample image, brightness data of each DNA nanosphere in the eight neighborhood of the location of each DNA nanosphere, the average brightness of the trajectory intersection point corresponding to the block where each DNA nanosphere is located, and brightness data of the DNA nanosphere corresponding to the location of each DNA nanosphere in the previous and next cycles; dividing the preset number of sample data into a training set and a validation set, and using the training set to train a neural network to obtain the base recognition model, and using the validation set to verify the accuracy of the base recognition model; and using preset training strategies such as early stopping, learning rate decay, etc. to train the neural network, and automatically terminating the training when the accuracy no longer improves.
[0010] Optionally, the neural network includes an input layer, a hidden layer, and an output layer; the input layer includes M neurons, which correspond to the brightness data of the M dimensions respectively; the hidden layer includes D neurons; and the output layer includes four neurons, which correspond to the four base categories A, T, C, and G respectively.
[0011] Optionally, the method further includes: normalizing the brightness data features of each sample image in the hidden layer; using a linear rectified function as the activation function in the hidden layer; using a sigmoid / softmax function as the activation function in the output layer, and restricting the output result to (0, 1), so that the output result corresponds to the confidence value of the four base categories A, T, C, and G.
[0012] Alternatively, the optimizer used in this method is SGD, RMSprop, or a combination of Adam and RMSprop, and the loss function used is the cross-entropy loss function.
[0013] Optionally, the method retrieves the base class label corresponding to each training sample from a pre-stored file.
[0014] Optionally, the method further includes: obtaining the correspondence between the copy number of the DNA nanospheres and the accuracy of the base recognition model; determining the abnormal point of the copy number corresponding to the abnormality of the accuracy based on the correspondence; and performing segmented training on the base recognition model based on the abnormal point of the copy number.
[0015] The gene sequencer includes a processor for implementing the base classification method when executing a computer program stored in a memory.
[0016] The computer-readable storage medium has a computer program stored thereon, which, when executed by a processor, implements the base classification method.
[0017] The base classification method, gene sequencer, and computer-readable storage medium described in this embodiment of the invention can achieve end-to-end classification from brightness data of fluorescence images to base categories. While ensuring the base classification effect, it can greatly simplify the current algorithm because very few parameters need to be set. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0019] Figure 1 This is a flowchart of a base classification method provided in a preferred embodiment of the present invention.
[0020] Figure 2A This is a schematic diagram of the region containing the DNA nanospheres and the corresponding trajectory intersections provided in this embodiment of the invention.
[0021] Figure 2B This is a schematic diagram of the DNA nanospheres within the eight fields provided in this embodiment of the invention.
[0022] Figure 2C This is a schematic diagram of a neural network provided in an embodiment of the present invention.
[0023] Figure 2D This is a schematic diagram of the distribution of different types of bases before and after copy number calibration provided in an embodiment of the present invention.
[0024] Figure 2EThis is a distribution diagram of DNB luminance values before and after normalization calibration provided in an embodiment of the present invention.
[0025] Figure 2F This is a schematic diagram illustrating the processing effect of PE150 data provided in an embodiment of the present invention.
[0026] Figure 2G This is a schematic diagram of segmented processing of long read data provided in an embodiment of the present invention.
[0027] Figure 3 This is a schematic diagram of a gene sequencer provided in a preferred embodiment of the present invention.
[0028] Figure 4 This is a functional block diagram of the base recognition system provided in a preferred embodiment of the present invention.
[0029] The following detailed description, in conjunction with the accompanying drawings, will further illustrate the present invention. Detailed Implementation
[0030] To better understand the above-mentioned objects, features, and advantages of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments of the present invention and the features thereof can be combined with each other.
[0031] Numerous specific details are set forth in the following description to provide a thorough understanding of the invention. The described embodiments are merely some, not all, of the embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.
[0032] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
[0033] Figure 1 This is a flowchart of the base classification method provided in the embodiments of the present invention.
[0034] This embodiment describes the application of the base classification method to a gene sequencer. The base classification method specifically includes the following steps; the order of these steps in the flowchart can be changed, and some steps can be omitted, depending on different needs.
[0035] Step S1: The gene sequencer acquires the fluorescence image to be identified.
[0036] In this embodiment, the fluorescence image to be identified can refer to an image of a field of view (FOV) obtained by the gene sequencer using a micro-imaging optical system to take a single picture of the biochip during gene sequencing. That is, the fluorescence image to be identified may be a fluorescence image obtained by the micro-imaging optical system when the fluorescent groups of A, G, C, or T bases are excited.
[0037] It should also be noted that the biochip can be any version. Taking version V1 as an example, the size of the biochip is approximately 75mm*25mm. The total number of fields of view (FOV, referring to the range that the objective lens of the micro-imaging optical system can observe in a single operation) included by the biochip can be, for example, 576 FOVs. This is for illustrative purposes only and should not be construed as a limitation of the present invention.
[0038] Step S2: The gene sequencer identifies the location of each DNA nanoball in the fluorescence image.
[0039] In one embodiment, a gene sequencer can obtain the coordinates of DNA nanospheres within each block by precisely locating the intersections of the trajectories.
[0040] It should be noted that the biochip is an arrayed chip with track lines distributed parallel to each other in the horizontal and vertical directions. Therefore, the field of view (FOV) image obtained by the micro-imaging optical system in a single shot of the biochip will include these track lines distributed in the horizontal and vertical directions. For ease of explanation, the track line distributed in the horizontal direction will be referred to as the first track line, and the track line distributed in the vertical direction will be referred to as the second track line.
[0041] In this embodiment, the intersection of the first trajectory line and the second trajectory line is called the trajectory intersection point. The region between two adjacent first trajectory lines and two adjacent second trajectory lines is called a block. Several array spots are evenly distributed on each block, and one DNA nanosphere can be loaded on each spot. The array spots can be ordered arrays, such as rectangles (including squares), chessboards (the positions of the dots adjacent to the corners rather than the edges, such as the black boxes on a chessboard), or hexagonal dots, or arranged in a disordered manner. The adjacent spots of the DNA nanosphere at a specified spot are called neighborhood spots. The number of neighborhood spots can be multiple depending on the arrangement of the spots. For example, three neighborhood spots may be distributed in a triangle around the specified DNA nanosphere, or eight neighborhood spots may be arranged in a rectangle around the specified DNA nanosphere, etc. The number of neighborhood spots is not limited.
[0042] Step S3: The gene sequencer extracts the brightness data features of the fluorescence image. The brightness data features include brightness data in M dimensions corresponding to the location of each DNA nanosphere. In this embodiment, M is a positive integer.
[0043] In this embodiment, the brightness data in the M dimensions corresponding to the location of each DNA nanosphere includes, but is not limited to, the brightness data of the location of each DNA nanosphere, the brightness data of each DNA nanosphere in multiple neighborhoods, such as eight neighborhoods, of the location of each DNA nanosphere, and the average brightness of the trajectory intersection points corresponding to the block where each DNA nanosphere is located.
[0044] In this embodiment, the brightness data in the M dimensions corresponding to the location of each DNA nanosphere further includes the brightness data of the DNA nanosphere corresponding to the position of each DNA nanosphere in the preceding and following cycles. Therefore, if Two-Color sequencing technology is used, i.e., two colors of fluorescence are used to label different types of bases, then M equals 24.
[0045] In this embodiment, before extracting the brightness data features of the fluorescence image, the gene sequencer performs the following operations: identifying first trajectory lines parallel to each other in the horizontal direction and second trajectory lines parallel to each other in the vertical direction from the fluorescence image; determining blocks based on two adjacent first trajectory lines and two adjacent second trajectory lines, and determining the trajectory intersection points corresponding to the blocks; determining the block where each DNA nanosphere is located based on the location of each DNA nanosphere, and determining the trajectory intersection points corresponding to the blocks where each DNA nanosphere is located. Thus, the gene sequencer can obtain the brightness data of the location of each DNA nanosphere in the fluorescence image, the brightness data of each DNA nanosphere within the eight neighborhoods of the location of each DNA nanosphere, and the average brightness of the trajectory intersection points corresponding to the blocks where each DNA nanosphere is located.
[0046] For example, suppose Figure 2A The image shown is the fluorescence image obtained at the 10th field of view (FOV) of the biochip. Using image recognition algorithms, such as template matching, the region where the DNA nanosphere 110 is located is identified as 1210. This region 1210 corresponds to four trajectory intersection points: 1211, 1212, 1213, and 1214. Therefore, the average brightness of the trajectory intersection point corresponding to region 1210 is the average brightness of the four trajectory intersection points 1211, 1212, 1213, and 1214.
[0047] To clearly illustrate the brightness data of each DNA nanosphere within its eight-neighborhood, please refer to the following. Figure 2B As shown, for DNA nanosphere 110, the DNA nanospheres in the eight-neighborhood of the location of DNA nanosphere 110 include 111, 112, 113, 114, 115, 116, 117, and 118. Therefore, the brightness data of each DNA nanosphere in the eight-neighborhood of the location of DNA nanosphere 110 includes the brightness data of the locations of DNA nanospheres 111, 112, 113, 114, 115, 116, 117, and 118, respectively.
[0048] It should also be noted that, in this embodiment, for any DNA nanosphere in the fluorescence image to be identified, the brightness data of the DNA nanosphere corresponding to the position in the preceding and following cycles includes: brightness data of a first target position in the first target fluorescence image and brightness data of a second target position in the second target fluorescence image; wherein, the first target fluorescence image refers to a fluorescence image obtained before the fluorescence image to be identified is captured, and the second target fluorescence image refers to a fluorescence image obtained after the fluorescence image to be identified is captured; wherein, the coordinates of the first target position are the same as the coordinates of the DNA nanosphere in the fluorescence image to be identified; and the coordinates of the second target position are the same as the coordinates of the DNA nanosphere in the fluorescence image to be identified.
[0049] For example, assuming the fluorescence image to be identified is a fluorescence image obtained by the micro-imaging optical system in the nth cycle of a certain FOV of the biochip, then the first target fluorescence image refers to the fluorescence image obtained in the (n-1)th cycle of that certain FOV of the biochip, and the second target fluorescence image is the fluorescence image obtained in the (n+1)th cycle of that certain FOV of the biochip.
[0050] Step S4: The gene sequencer inputs the extracted brightness data features into a preset base recognition model to obtain the base category corresponding to the fluorescence image.
[0051] In this embodiment, the gene sequencer is pre-trained to obtain the base recognition model. The steps for the gene sequencer to obtain the base recognition model include (a)-(b):
[0052] (a) Obtain a preset number of training samples. Each training sample includes the brightness data features of a sample image and the label of the base category. The brightness data features of each sample image include the brightness data of M dimensions corresponding to the location of each DNA nanosphere in each sample image, including: the brightness data of the location of each DNA nanosphere in each sample image, the brightness data of each DNA nanosphere in the eight neighborhood of the location of each DNA nanosphere, and the average brightness of the trajectory intersection point corresponding to the block where each DNA nanosphere is located.
[0053] In one embodiment, the brightness data of the M dimensions further includes: brightness data of the DNA nanospheres corresponding to the position of each DNA nanosphere in the previous and next cycles.
[0054] In this embodiment, the sample image is a fluorescence image. In one embodiment, the gene sequencer can obtain the base class label corresponding to each sample from a pre-stored file (e.g., a SAM file, which can be generated by other sequencing algorithms (such as clustering)).
[0055] In one embodiment, the pre-stored file records the base class corresponding to each sample.
[0056] In one embodiment, if the base classification of the sample recorded in the pre-stored file is incorrect, the gene sequencer will only use the correctly classified samples as training samples.
[0057] In other embodiments, the gene sequencer may also use a portion of the samples and their corresponding labels to train a neural network to obtain a label recognition model, then use the label recognition model to generate labels for other samples, and then use the portion of the samples and their corresponding labels, along with the other samples and their corresponding labels, as training samples to train the neural network to obtain the base recognition model.
[0058] (b) Divide the preset number of training samples into a training set and a validation set, and use the training set to train a neural network to obtain the base recognition model, and use the validation set to verify the accuracy of the base recognition model.
[0059] In this embodiment, the neural network includes an input layer, a hidden layer, and an output layer. The input layer includes M neurons, each corresponding to one of the M dimensions of brightness data. The hidden layer includes D neurons, and the output layer includes four neurons, each corresponding to one of the four base classes: A, T, C, and G. M and D are positive integers. The number of neurons D in the hidden layer is greater than the number of neurons in the output layer but less than twice the number of neurons in the input layer, and is generally adjusted based on experience and actual training results.
[0060] In this embodiment, see Figure 2C As shown, the M neurons in the input layer correspond to the brightness data in the M dimensions. For example, M can be equal to 24, and D can be equal to 14.
[0061] In this embodiment, the neural network normalizes the brightness data features of each sample image in the hidden layer; a linear rectified function is used as the activation function in the hidden layer; and a sigmoid / softmax function is used as the activation function in the output layer, limiting the output result to (0, 1), so that the output result corresponds to the confidence value of the four base categories A, T, C, and G.
[0062] In this embodiment, the optimizer used by the neural network is SGD (Stochastic Gradient Descent), RMSprop (root mean square prop), or a combination of Adam (Adaptive gradient algorithm) and RMSprop, and the loss function used is the cross entropy loss function.
[0063] In this embodiment, training strategies such as early stopping and learning rate decay can be used to train the neural network, and training will automatically terminate when the accuracy no longer improves.
[0064] It should be noted that, in other embodiments, the gene sequencer may also collect data (i.e., fluorescence images) for two FOVs respectively; use the data collected for one FOV as the training sample, and use the training sample to train and validate the base recognition model; and use the data collected for the other FOV to test the base recognition model.
[0065] It should also be noted that, in one embodiment, in step (a), the gene sequencer can also calibrate the brightness data of the location of each DNA nanosphere in each sample image.
[0066] Specifically, because DNA fragments are of varying lengths when broken down, during rolling circle amplification (RoB) to generate DNBs within time t: shorter fragments replicate more times, resulting in higher light intensity; longer fragments replicate less times, resulting in lower light intensity. Therefore, copy number calibration is necessary. The gene sequencer can first calculate the copy number coefficient. The calculation process includes: training a first machine learning model using data without copy number calibration; using this first machine learning model to determine the bases in the first X (e.g., 10) cycles of each DNB; counting the number of cycles in the first X cycles of each DNB that are classified as non-G bases, denoted as n; calculating the sum of the channel brightness values for A, T, and C bases, denoted as m; and calculating the copy number coefficient for each DNB as n / m.
[0067] The gene sequencer then multiplies the brightness value of each DNB by a coefficient to obtain the copy number, thereby calibrating the brightness value of each DNB. For each DNB brightness value, normalization calibration is then performed separately for the H and L channel data of different cycles.
[0068] The gene sequencer then trains a neural network based on the luminance data of each calibrated DNB to obtain the base recognition model.
[0069] See Figure 2D The diagram shows the DNB luminance value distribution before and after copynumber calibration: the left side shows the DNB luminance value distribution before copynumber calibration, and the right side shows the DNB luminance value distribution after copynumber calibration. This demonstrates that the copynumber algorithm increases the distance between base classes and decreases the distance within each class, thus improving the accuracy of base classification.
[0070] See Figure 2E The diagram shows the distribution of DNB luminance values for 100 cycles and 130 cycles before and after normalization calibration, respectively. This demonstrates that normalization makes the base distribution more consistent, which helps improve the accuracy of base classification.
[0071] It should also be noted that, because the enzyme controlling the reaction has a probability of not reacting, this effect becomes increasingly pronounced with the number of cycles. This is reflected in the test results, where the mismatch rate for long-read test samples increases more significantly with each additional cycle. For example... Figure 2F and 2GAs shown, taking PE150 as an example, mismatches increased too rapidly after 100 cycles. To improve the accuracy of the results, in this embodiment, the gene sequencer can divide the PE150 data into segments of 1-100 cycles and 101-150 cycles for machine learning training and testing. Similarly, for longer read data, if there are N points with excessively rapid mismatch growth, the model can be divided into N+1 segments for corresponding processing. That is, in this invention, the gene sequencer can obtain the correspondence between the copy number of the DNA nanospheres and the accuracy of the base recognition model, determine the abnormal copy number points corresponding to the abnormal accuracy based on the correspondence, and perform segmented training of the base recognition model based on the abnormal copy number points.
[0072] In this embodiment, N is a positive integer.
[0073] Figure 3 This is a schematic diagram of the internal structure of the gene sequencer provided in an embodiment of the present invention.
[0074] In a preferred embodiment of the present invention, the gene sequencer 3 includes, but is not limited to, a memory 31, at least one processor 32, a microscopic imaging optical system 33, and a base recognition system 30 that are directly electrically connected to each other.
[0075] Those skilled in the art should understand that Figure 3 The structure of the gene sequencer 3 shown is not intended to limit the embodiments of the present invention. The structure of the gene sequencer 3 can be either a bus topology or a star topology. The gene sequencer 3 may also include more or fewer other hardware or software than shown, or different component arrangements.
[0076] Although not shown, the gene sequencer 3 may also include a power supply (such as a battery) to power the various components. Preferably, the power supply can be logically connected to the at least one processor 32 via a power management device, thereby enabling functions such as charging, discharging, and power consumption management. The power supply may also include one or more DC or AC power supplies, recharging devices, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components. The gene sequencer 3 may also include other components, such as biochips, sensors, Wi-Fi modules, etc., which will not be described in detail here.
[0077] It should be understood that the embodiments described are for illustrative purposes only and are not limited to this structure in the scope of the patent application.
[0078] In some embodiments, the microscopic imaging optical system 33 is used to capture fluorescence images of each field of view (FOV) of the biochip.
[0079] In some embodiments, the memory 31 is used to store program code and various data, such as the base recognition system 30 installed in the gene sequencer 3, and to achieve high-speed and automatic access to programs or data during the operation of the gene sequencer 3. The memory 31 includes read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable storage medium capable of carrying or storing data.
[0080] In some embodiments, the at least one processor 32 may be composed of integrated circuits, such as a single-packaged integrated circuit or multiple integrated circuits packaged with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The at least one processor 32 is the control unit of the gene sequencer 3, connecting various components of the gene sequencer 3 via various interfaces and lines. It executes programs or modules stored in the memory 31 and calls data stored in the memory 31 to perform various functions of the gene sequencer 3 and process data, for example... Figure 1 The base recognition function shown.
[0081] Reference Figure 4 As shown, the base recognition system 30 may include one or more computer instructions in the form of a program, which are stored in the memory 31 and executed by the at least one processor 32. In one embodiment, the base recognition system 30 may be integrated into the at least one processor 32. In other embodiments, the base recognition system 30 may be independent of the processor 32. See also Figure 4 As shown, the base recognition system 30 may include one or more modules, such as Figure 4 The acquisition module 301 and the execution module 302 are shown.
[0082] Specifically, the acquisition module 301 can acquire a fluorescence image to be identified; the execution module 302 can identify the location of each DNA nanosphere in the fluorescence image; and extract the brightness data features of the fluorescence image, the brightness data features including brightness data in M dimensions corresponding to the location of each DNA nanosphere. The execution module 302 can also input the extracted brightness data features into a preset base recognition model to obtain the base category corresponding to the fluorescence image.
[0083] The term "module" as used in this specification refers to a hardware or firmware form, or a set of software instructions written in a programming language such as JAVA or C. One or more software instructions in a module may be embedded in firmware, such as in an erasable programmable memory. The module described in this embodiment can be implemented as a software and / or hardware module and can be stored in any type of non-transitory computer-readable storage medium or other storage medium, such as memory 31, and executed by the at least one processor 32 to implement, for example... Figure 1 The base recognition function is shown. It should be noted that in this embodiment, the modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0084] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0085] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be considered illustrative and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other elements or, and the singular does not exclude the plural. Multiple elements or devices recited in the apparatus claims may also be implemented by a single element or device in software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any particular order.
[0086] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims
1. A method for classifying bases, characterized in that, The method includes: Acquire the fluorescence image to be identified; Identify the location of each DNA nanosphere in the fluorescence image; Identify first trajectory lines that are parallel in the horizontal direction and second trajectory lines that are parallel in the vertical direction from the fluorescence image; determine blocks based on two adjacent first trajectory lines and two adjacent second trajectory lines and determine the trajectory intersection points corresponding to the blocks; determine the block where each DNA nanosphere is located according to the location of each DNA nanosphere and determine the trajectory intersection points corresponding to the block where each DNA nanosphere is located. The brightness data features of the fluorescence image are extracted. These brightness data features include M-dimensional brightness data corresponding to the location of each DNA nanosphere. The M-dimensional brightness data corresponding to the location of each DNA nanosphere includes: the brightness data at the location of each DNA nanosphere, the brightness data of each DNA nanosphere within multiple neighborhoods of the location of each DNA nanosphere, and the average brightness of the trajectory intersection points corresponding to the block where each DNA nanosphere is located. The extracted brightness data features are input into a preset base recognition model to obtain the base category corresponding to the fluorescence image. The base recognition model is a machine learning model that learns the mapping relationship between the brightness data features of the M dimensions and the base category through training.
2. The base classification method as described in claim 1, characterized in that, The brightness data in the M dimensions also includes the brightness data of the DNA nanospheres corresponding to the positions of each DNA nanosphere in the previous and next cycles.
3. The base classification method as described in claim 2, characterized in that, For any DNA nanosphere in the fluorescence image, the brightness data of the DNA nanosphere corresponding to the position in the preceding and following cycles includes: brightness data of a first target position in the first target fluorescence image and brightness data of a second target position in the second target fluorescence image; wherein, the first target fluorescence image refers to a fluorescence image obtained before the first fluorescence image is captured, and the second target fluorescence image refers to a fluorescence image obtained after the first fluorescence image is captured; wherein, the coordinates of the first target position are the same as the coordinates of the DNA nanosphere in the fluorescence image; and the coordinates of the second target position are the same as the coordinates of the DNA nanosphere in the fluorescence image.
4. The base classification method as described in claim 1, characterized in that, The method also includes training the base recognition model, including: Obtain a preset number of training samples. Each training sample includes the brightness data features of a sample image and the label of the base category. The brightness data features of each sample image include the brightness data of M dimensions corresponding to the location of each DNA nanosphere in each sample image, including: the brightness data of the location of each DNA nanosphere in each sample image, the brightness data of each DNA nanosphere in the eight neighborhood of the location of each DNA nanosphere, the average brightness of the trajectory intersection point corresponding to the block where each DNA nanosphere is located, and the brightness data of the DNA nanosphere corresponding to the position of each DNA nanosphere in the previous and next cycles. The sample data of the preset number of parts is divided into a training set and a validation set. The neural network is trained using the training set to obtain the base recognition model, and the accuracy of the base recognition model is verified using the validation set. This includes: training the neural network using a preset training strategy, and automatically terminating the training when the accuracy no longer improves.
5. The base classification method as described in claim 4, characterized in that, The neural network includes an input layer, a hidden layer, and an output layer; the input layer includes M neurons, which correspond to the brightness data of the M dimensions respectively; the hidden layer includes D neurons; and the output layer includes four neurons, which correspond to the four base classes A, T, C, and G respectively.
6. The base classification method as described in claim 5, characterized in that, The method further includes: In the hidden layer, the brightness data features of each sample image are normalized. A linear rectified function is used as the activation function in the hidden layer; The output layer uses the sigmoid / softmax function as the activation function to restrict the output result to (0, 1), so that the output result corresponds to the confidence value of the four base categories A, T, C, and G.
7. The base classification method as described in claim 5, characterized in that, The method employs SGD, RMSprop, or a combination of Adam and RMSprop as optimizers, and uses cross-entropy loss function.
8. The base classification method as described in claim 4, characterized in that, The method also includes: Obtain the correspondence between the copy number of the DNA nanospheres and the accuracy of the base recognition model, determine the abnormal points of the copy number corresponding to the abnormal accuracy based on the correspondence, and perform segmented training on the base recognition model based on the abnormal points of the copy number.
9. A gene sequencer, characterized in that, The gene sequencer includes a processor for executing a computer program stored in a memory to implement the base classification method as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the base classification method as described in any one of claims 1 to 8.