Base sequence determination device

The base sequencing device addresses peak identification issues in electrophoresis chromatograms by dynamically adjusting sigma values to minimize error, ensuring accurate base sequence determination.

WO2026140110A1PCT designated stage Publication Date: 2026-07-02HITACHI HIGH TECH CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HITACHI HIGH TECH CORP
Filing Date
2024-12-25
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Conventional DNA sequencing techniques using electrophoresis chromatograms face challenges in accurately identifying peaks due to fixed sigma variation ranges, leading to overestimation of base numbers and difficulty in estimating the correct sigma value, especially in regions with multiple local minima.

Method used

A base sequencing device that fits a point spread function to a chromatogram, calculates a sigma estimation function for each measurement point, and adjusts the sigma value range dynamically to accurately identify peaks by minimizing error between the Gaussian distribution function and the chromatogram.

Benefits of technology

Enables precise peak identification in electrophoresis chromatograms by dynamically varying the sigma value range, thereby accurately determining the base sequence of a biological sample.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2024045908_02072026_PF_FP_ABST
    Figure JP2024045908_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The purpose of the present invention is to provide technology capable of accurately specifying a peak of a chromatogram that has been obtained by performing electrophoresis of a biological sample. A base sequence determination device according to the present invention fits a point spread function to a chromatogram, calculates a sigma estimation function that defines a standard deviation of the point spread function for each measurement point of the chromatogram, and estimates a correct standard deviation by searching for a peripheral region of the sigma estimation function (see fig. 13).
Need to check novelty before this filing date? Find Prior Art

Description

DNA sequencing machine

[0001] This invention relates to a technique for determining the base sequence of a biological sample using a chromatogram obtained by electrophoresis of the biological sample.

[0002] There is a technique for estimating the DNA sequence of a biological sample by using a chromatogram obtained by electrophoresis of the DNA-containing biological sample in a capillary tube. In this technique, for example, a Gaussian distribution function is fitted to the chromatogram, and the peaks of that Gaussian distribution function are identified as peaks in the chromatogram.

[0003] Patent Document 1 below describes a technology that includes: (1) a mobility correction unit that outputs a mobility correction signal obtained by correcting the mobility of a time-series signal of the wavelength spectrum corresponding to each base; (2) a deconvolution unit that performs the following processes for a plurality of parameter candidates of the point spreading function: calculating the deconvolution signal of the mobility correction signal for each of the mobility correction signals, calculating the variance of the peak interval for the calculated deconvolution signal, identifying the parameters of the point spreading function using the calculated variance, and outputting the deconvolution signal corresponding to the point spreading function having the identified parameters as an updated deconvolution signal; (3) a peak extraction unit that extracts peak waveforms from the updated deconvolution signal and outputs an updated peak extraction signal; and (4) a sequence identification unit that receives the updated peak extraction signal and determines the base sequence. (See abstract).

[0004] Patent Document 2 below aims to "enable accurate analysis of the base sequence even for electrophoresis data containing degraded parts." It describes a technique of "determining the base sequence of a nucleic acid by including the following steps (A) to (C) in this order. (A) A base peak extraction step of extracting a base peak from electrophoresis data containing peaks of four types of base species obtained by electrophoretic separation of a sample nucleic acid. (B) A condition setting step of setting a search start point base peak and a peak interval reference value for starting a search in time series data composed of the extracted base peaks. (C) In the time series data, starting from the search start point base peak, sequentially scanning the intervals between adjacent base peaks in the forward and backward directions of the time series, comparing the interval between the base peaks with the peak interval reference value, and adding interpolation peaks to the peak missing section to determine the base sequence." (See the abstract).

[0005] WO2017 / 130349 WO2008 / 050426

[0006] Patent Document 1 evaluates the error between the inverse convolution signal and the measurement signal. This error tends to be smaller as the standard deviation of the point spread function (for example, the Gaussian distribution function) is smaller. Then, there is a strong tendency to estimate a small standard deviation as a candidate for the correct answer, which may result in an overestimation of the number of bases compared to the actual situation.

[0007] Patent Document 2 compares the interval between base peaks with the peak interval reference value and adds interpolation peaks to the peak missing section. This method may add an excessive number of interpolation peaks in a section where the peak interval reference value is small. As a result, there is a possibility that the number of bases is overestimated compared to the actual situation.

[0008] The present invention has been made in view of the above problems, and an object thereof is to provide a technique capable of accurately identifying the peaks of a chromatogram obtained by electrophoresis of a biological sample.

[0009] The base sequencing device according to the present invention fits a point spread function to a chromatogram, calculates a sigma estimation function that defines the standard deviation of the point spread function for each measurement point of the chromatogram, and estimates the correct standard deviation by searching the region surrounding the sigma estimation function.

[0010] The base sequencing apparatus according to the present invention can accurately identify peaks in a chromatogram obtained by electrophoresis of a biological sample. Other problems, configurations, and advantages of the present invention will become clear from the following description of embodiments.

[0011] This shows an example of a chromatogram obtained by electrophoresis of a biological sample. It shows the blocks contained within the chromatogram. It shows the types of blocks. This is a conceptual diagram of deconvolution. This is a diagram illustrating the evaluation criteria for sigma values. This is a diagram showing the challenges in conventional peak estimation. This shows an example of plotting measurement points and correct sigma values. This is a diagram showing the range in which the base sequencing device according to Embodiment 1 varies the sigma value. This is a block diagram of the base sequencing device 1 according to Embodiment 1. This is a diagram explaining the operation of the block extraction unit 11. This is a diagram explaining the operation of the fitting unit 12. This is a diagram explaining the operation of the sigma estimation function calculation unit 13. This is a diagram explaining the operation of the sigma evaluation unit 14. This shows an example of plotting the error between the composite waveform approximated by the Gaussian distribution function and the chromatogram for each sigma value. This is a flowchart outlining the operation of the base sequencing device 1 in Embodiment 2. This is a diagram explaining the operation of the sigma evaluation unit 14 in Embodiment 2. This is a diagram explaining the operation of the sigma evaluation unit 14 in Embodiment 2.

[0012] <Problems with the Conventional Technology> First, the problems related to peak identification in chromatograms using conventional base sequencing devices will be described below. After that, embodiments of the present invention will be described.

[0013] Figure 1 shows an example of a chromatogram obtained by electrophoresis of a biological sample. The vertical axis of the figure represents the fluorescence signal intensity, and the horizontal axis represents the measurement scan point (measurement time). A base sequencer receives the chromatogram as input, identifies the peak positions of the signal intensity within the chromatogram, and determines the base sequence of the biological sample by assigning base labels to those peak positions.

[0014] Figure 2 shows the blocks contained within a chromatogram. A chromatogram is composed of one or more blocks. A block refers to a continuous region of the chromatogram from the rising edge to the falling edge of the measured signal. However, the measured signal may contain noise. Therefore, the start and end levels of a block do not necessarily have to be zero, and a series of signals that rise from some reference value and fall to another reference value can be considered a block.

[0015] Figure 3 shows the types of blocks. There are two types of blocks: chromatograms composed of waveforms of a single base (Figure 3 left) and chromatograms composed of composite waveforms of multiple bases (Figure 3 right). Peak position estimation of the chromatogram is typically performed for each block. For example, if a block is considered to be a composite of Gaussian distribution functions centered on the true peak position, that Gaussian distribution function is estimated by inverse convolution. The center of the estimated Gaussian distribution function is estimated to be the peak position of the block.

[0016] Figure 4 is a conceptual diagram of inverse convolution. In inverse convolution, the center of the Gaussian distribution function (i.e., the peak of the block) is estimated while varying the standard deviation (sigma) of the Gaussian distribution function. Generally, a smaller sigma results in an estimated number of peaks, while a larger sigma results in an estimated number of peaks. To derive the correct number of peaks, it is necessary to accurately evaluate whether the sigma value is appropriate or not.

[0017] Figure 5 illustrates the criteria for evaluating sigma values. For example, one evaluation criterion for sigma values ​​is the error between the estimated Gaussian distribution function's composite waveform and the chromatogram (block). A smaller error indicates a more appropriate sigma value. The number of peaks obtained with the smallest error can be considered the correct number of peaks.

[0018] Figure 6 illustrates the challenges of conventional peak estimation. In conventional techniques, the number of peaks is estimated by repeatedly evaluating the error while varying the sigma within a certain range (the distribution range in the figure). This is a search process that assumes the existence of a minimum error within the range in which the sigma is varied, as shown on the left of Figure 6.

[0019] However, in reality, as shown in Figure 6 (right), the correct sigma value may exist outside the sigma variation range. Furthermore, multiple local minima may exist. In such cases, estimating the correct sigma value becomes difficult. This is because the sigma variation range is fixed in advance. Therefore, it is desirable to be able to appropriately change the sigma variation range for each measurement point (e.g., for each block or for each Gaussian distribution function). This is not considered in conventional techniques.

[0020] Figure 7 shows an example of plotting measurement points and the correct sigma value. The sigma variation range is, for example, 2.8 to 4.0. In this example, the correct sigma value exists outside the sigma variation range near the intermediate measurement point. Therefore, it is difficult to estimate the correct sigma value near the intermediate measurement point. The problem described in Figure 6 manifests itself in this form. Patent Document 1 is also thought to have a similar problem.

[0021] <Embodiment 1> Figure 8 is a diagram showing the range in which the sigma value varies in the base sequencing device according to Embodiment 1 of the present invention. Similar to Figure 7, the horizontal axis of the figure represents the measurement point, and the vertical axis represents the sigma value. In Embodiment 1, the range of sigma variation is set for each measurement point. Specifically, a function (called the sigma estimation function in Embodiment 1) that defines the relationship between the measurement point and the range of sigma value variation is set, and the region surrounding the sigma value represented by the sigma estimation function is used as the range of sigma value variation. The curve in Figure 8 is an example of the sigma estimation function, and the region encompassing the curve is the range of sigma value variation. The peak estimation procedure using the sigma estimation function will be described below.

[0022] Figure 9 is a block diagram of the base sequence determination device 1 according to Embodiment 1. The base sequence determination device 1 is a device that determines the base sequence of a biological sample using a chromatogram obtained by electrophoresis of the biological sample. The base sequence determination device 1 comprises a block extraction unit 11, a fitting unit 12, a sigma estimation function calculation unit 13, a sigma evaluation unit 14, and a sequence determination unit 15. The operation of each unit will be described later.

[0023] Figure 10 is a diagram illustrating the operation of the block extraction unit 11. The block extraction unit 11 extracts blocks from the chromatogram that consist of only one base. The upper part of Figure 10 shows an example of a chromatogram before extraction, and the lower part of Figure 10 shows the result of block extraction. As for the extraction conditions, for example, if a block has only one maximum value, that block can be considered to consist of only one base. Other appropriate extraction conditions may also be used.

[0024] Based on experience, measurement points and sigma values ​​generally correspond. However, for waveforms composed of multiple peaks, there are multiple peaks, so it is necessary to estimate the combination of the number of peaks and the optimal sigma, which makes estimation considerably difficult. On the other hand, for blocks with only one peak, it is relatively easy to estimate the sigma corresponding to the measurement point. Therefore, in Embodiment 1, we decided to first estimate the sigma of a single peak and then carry out the following steps starting from this.

[0025] Figure 11 illustrates the operation of the fitting unit 12. The fitting unit 12 approximates each block extracted by the block extraction unit 11 with a single Gaussian distribution function, thereby estimating the standard deviation of the Gaussian distribution function such that the error between the block and the Gaussian distribution function is minimized.

[0026] Figure 12 illustrates the operation of the sigma estimation function calculation unit 13. The sigma estimation function calculation unit 13 calculates the sigma estimation function based on the sigma of each block estimated by the fitting unit 12. Specifically, it calculates the sigma estimation function by plotting each sigma estimated by the fitting unit 12 in a two-dimensional space with the horizontal axis representing measurement points and the vertical axis representing sigma, and then finding the function that best approximates the plot.

[0027] The form of the sigma estimation function is not limited, but as an example, it can be defined as a polynomial function, a rational function, or a sum thereof. In this case, the sigma estimation function calculation unit 13 calculates the sigma estimation function by finding the coefficients of each term that best approximate the plot.

[0028] Figure 13 illustrates the operation of the sigma evaluation unit 14. The sigma evaluation unit 14 sets the region surrounding the sigma value represented by the sigma estimation function as the sigma value variation range. In the lower part of Figure 13, the region enclosing the function is the sigma value variation range. For example, the variation range can be defined as a predetermined range above and below the value of the sigma estimation function. The width of the variation range may be the same for each measurement point, or it may differ for each measurement point, as shown in Figure 8.

[0029] The sigma evaluation unit 14 estimates the correct number of peaks and peak positions for each block while changing the sigma value within a set range of variation. This allows each block to be accurately fitted using a Gaussian distribution function. Based on the fitting results, the sigma evaluation unit 14 identifies each peak in the chromatogram.

[0030] The sequencing unit 15 determines the base sequence of the biological sample by assigning base labels to each peak identified by the sigma evaluation unit 14. This procedure is well known, so its explanation is omitted here.

[0031] <Embodiment 1: Summary> The base sequencer 1 according to Embodiment 1 estimates the standard deviation of each Gaussian distribution function by fitting a Gaussian distribution function to a block extracted from a chromatogram, and calculates a sigma estimation function based on the estimated standard deviation. The correct sigma value of the Gaussian distribution function is estimated by changing the sigma in the peripheral region of the sigma estimation function. This makes it possible to change the sigma value within an appropriate range of variation for each measurement point, thereby more reliably estimating the correct sigma value.

[0032] <Embodiment 2> Figure 14 shows an example of plotting the error between a composite waveform approximated by a Gaussian distribution function and a chromatogram for each sigma value. The upper part of Figure 14 is the same as the left part of Figure 6. As explained in Figure 6, the conventional technique assumes that there is a sigma value at which the error is minimized. However, the relationship between the actual error and the sigma value is such that the smaller the sigma, the smaller the error tends to be.

[0033] A small sigma corresponds to a small width of the Gaussian distribution function. In other words, fitting with a Gaussian distribution function with a small sigma means fitting with a larger number of finer Gaussian distribution functions. It is obvious that fitting with a larger number of finer Gaussian distribution functions will reduce the error. Therefore, even if one tries to find a sigma value that yields a local minimum of error, it is obvious that the search will move in the direction of reducing the sigma value, and a meaningful search cannot be performed. In other words, the evaluation method of searching for a sigma value that minimizes the error has limitations as a method for finding the correct sigma value.

[0034] Therefore, in Embodiment 2 of the present invention, a new method for evaluating whether the sigma value is appropriate is proposed. Since the configuration of the base sequencing device 1 is the same as in Embodiment 1, the following will mainly describe the parts that differ from Embodiment 1.

[0035] Figure 15 is a flowchart illustrating the operation of the base sequence determination device 1 in Embodiment 2. In Embodiment 1, the functional units of the base sequence determination device 1 operate in the following order: block extraction unit 11, fitting unit 12, sigma estimation function calculation unit 13, sigma evaluation unit 14, and sequence determination unit 15. In Embodiment 2, this order is the same, with the block extraction unit 11 (S1501), fitting unit 12 (S1502), sigma estimation function calculation unit 13 (S1503), sigma evaluation unit 14 (S1504), and sequence determination unit 15 (S1505). However, the processing by the sigma evaluation unit 14 (S1504) is configured to search for an appropriate sigma by looping within the sigma distribution range, as will be described later. The other steps are the same as in Embodiment 1.

[0036] Figures 16 and 17 illustrate the operation of the sigma evaluation unit 14 in Embodiment 2. The sigma evaluation unit 14 estimates the Gaussian distribution function by performing deconvolution on any block (any number of bases) of the chromatogram using sigma values ​​within the variation range set by the method described in Embodiment 1. The sigma evaluation unit 14 detects the peak position of the block (i.e., the center of each Gaussian distribution function) from the waveform obtained by deconvolution. Without changing the number of detected peaks, the sigma evaluation unit 14 fits the peak intensity, sigma value, peak position, etc., so as to minimize the error between the composite waveform and the block. This fitting is called "CurveFit" in Figure 16 and is a separate process from the fitting performed by the fitting unit 12.

[0037] The sigma evaluation unit 14 calculates the error between the sigma estimated by CurveFit and the sigma estimated by the method using the sigma estimation function described in Embodiment 1. The sigma evaluation unit 14 searches for the number of peaks that minimizes this error. Specifically, it changes the number of peaks and refits the Gaussian distribution function to the block (CurveFit), and recalculates the error with the sigma obtained in Embodiment 1. The sigma evaluation unit 14 estimates the correct number of peaks for each block by repeating this process.

[0038] The lower part of Figure 17 plots the sigma values ​​obtained by performing CurveFit using the correct number of peaks and the sigma values ​​obtained by performing CurveFit using an incorrect number of peaks. × indicates the sigma value calculated with an incorrect number of peaks, and ● indicates the sigma value calculated with the correct number of peaks. The curve represents the sigma estimation function. When the correct number of peaks was used, values ​​close to the sigma estimation function were obtained, whereas when an incorrect number of peaks was used, values ​​deviating from the sigma estimation function were obtained. Therefore, it was found that the smaller the error between the sigma estimated by CurveFit and the sigma estimated by the method using the sigma estimation function described in Embodiment 1, the more accurately the correct number of peaks can be estimated.

[0039] <Embodiment 2: Summary> The base sequence determination device 1 according to Embodiment 2 estimates the number of peaks for each block in such a way as to minimize the error between the sigma estimated by the method using the sigma estimation function described in Embodiment 1 and the sigma estimated by fitting the Gaussian distribution function. This is done by first estimating the correct number of peaks based on the sigma estimated by each independent procedure, instead of using an evaluation method that minimizes the error between the fitting result and the chromatogram, as explained in Figure 14, and then performing fitting again. In other words, the number of peaks is determined first and then fitting is performed. Therefore, it is possible to suppress the search from moving in the direction of underestimating the sigma as shown in Figure 14, and thus estimate the correct sigma.

[0040] <Regarding Variations of the Invention> The present invention is not limited to the embodiments described above, and includes various variations. For example, the embodiments described above are described in detail to make the present invention easier to understand, and are not necessarily limited to those having all the described configurations. Furthermore, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. In addition, it is possible to add, delete, or replace parts of the configuration of each embodiment with other configurations.

[0041] In the embodiments described above, fitting the fluorescence signal waveform using a Gaussian distribution function was explained. The function used for fitting is not limited to the Gaussian distribution function; any other point spreading function may be used.

[0042] In the embodiments described above, the distribution range for sigma was set to be the peripheral region above and below the function value of the sigma estimation function. The width of this peripheral region does not have to be the same for each measurement point. For example, the width of the distribution range may be set to a value obtained by multiplying the function value by a predetermined ratio, and the distribution range may be set both above and below the function value (or either above or below the function value). Empirically, it is easy to correctly estimate sigma at the middle of the electrophoresis process (when sigma is small), whereas it is difficult to correctly estimate sigma at both ends of the electrophoresis process (when sigma is large). Therefore, the distribution range may be set to be wider at both ends of the electrophoresis process.

[0043] In the embodiments described above, each functional unit of the base sequence determination device 1 can be configured by hardware such as circuit devices that implement these operations, or by a computing device such as a CPU (Central Processing Unit) executing software that implements these operations.

[0044] 1: Base sequencing device 11: Block extraction unit 12: Fitting unit 13: Sigma estimation function calculation unit 14: Sigma evaluation unit 15: Sequence determination unit

Claims

1. A base sequence determination apparatus for determining the base sequence of a biological sample using a chromatogram obtained by electrophoresis of the biological sample, comprising: a fitting unit for fitting a first point spread function to the chromatogram; a sigma estimation function calculation unit for calculating a sigma estimation function that defines a first standard deviation of the first point spread function for each measurement point of the chromatogram; and a sigma evaluation unit for estimating the standard deviation of the first point spread function when the chromatogram is best approximated by the first point spread function by searching the surrounding region of the first standard deviation for each measurement point represented by the sigma estimation function.

2. The nucleotide sequencing apparatus according to claim 1, characterized in that the sigma evaluation unit sets a different surrounding region for each of the measurement points.

3. The base sequencing apparatus further comprises a block extraction unit for extracting blocks consisting of a single base peak from the chromatogram, and the fitting unit fits the first point spread function to the block, as described in claim 1.

4. The base sequencing apparatus according to claim 3, characterized in that the fitting unit performs the fitting using a single first point spread function for each block.

5. The nucleotide sequencing apparatus according to claim 1, characterized in that the sigma estimation function calculation unit calculates a function that approximates the relationship between the first standard deviation for each measurement point obtained by fitting and the measurement point, as the sigma estimation function.

6. The nucleotide sequencing apparatus according to claim 1, characterized in that the sigma evaluation unit defines the peripheral region by performing at least one of addition or subtraction on the first standard deviation for each measurement point in the sigma estimation function.

7. The nucleotide sequencing apparatus according to claim 1, characterized in that the sigma evaluation unit defines the peripheral region by adding or subtracting a value obtained by multiplying the first standard deviation for each measurement point in the sigma estimation function by a predetermined percentage to the first standard deviation for each measurement point.

8. The nucleotide sequencing apparatus according to claim 1, characterized in that the sigma evaluation unit fits a second point spread function to the chromatogram, and the sigma evaluation unit repeatedly fits the second point spread function to the chromatogram again so as to reduce the error between the first standard deviation estimated using the sigma estimation function and the second standard deviation obtained by the fitting performed by the sigma evaluation unit, thereby estimating the second standard deviation when the chromatogram is best approximated by the second point spread function.

9. The nucleotide sequencing apparatus according to claim 8, characterized in that the sigma evaluation unit performs deconvolution on the chromatogram using the standard deviation within the peripheral region, the sigma evaluation unit identifies the peaks of the signal waveform obtained by the deconvolution, and the sigma evaluation unit fits the second point spreading function to the chromatogram without changing the number of peaks.

10. The nucleotide sequencing apparatus according to claim 9, characterized in that the sigma evaluation unit estimates the number of peaks when the chromatogram is best approximated by the second point spreading function by repeatedly fitting the second point spreading function to the chromatogram while changing the number of peaks.

11. The base sequencing apparatus according to claim 1, further comprising a sequence estimation unit that estimates the base sequence of the biological sample corresponding to the peak position in the chromatogram based on the estimation results by the sigma evaluation unit.