Information processing device
The information processing apparatus addresses the challenge of detecting structural variations in genomes by using a polynomial correction and scoring function to align short labeled intervals, improving the accuracy of genome mapping.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HITACHI HIGH TECH CORP
- Filing Date
- 2025-07-04
- Publication Date
- 2026-06-18
AI Technical Summary
Current DNA sequencing technologies struggle to accurately detect structural variations in genomes due to measurement errors and limitations in reading large-scale genomic changes, particularly when dealing with short labeled intervals.
An information processing apparatus that uses a polynomial correction function to adjust labeled intervals and a scoring function to minimize alignment errors, enabling accurate alignment and detection of structural variations by integrating multiple measurement data sets.
The apparatus effectively corrects measurement errors and aligns short labeled intervals, allowing for precise identification of structural variations and differences between genomes, enhancing the accuracy of genome mapping.
Smart Images

Figure JP2025024105_18062026_PF_FP_ABST
Abstract
Description
Information processing device 【0001】 This invention relates to a technique for identifying the location of a subsequence of a nucleic acid sequence. 【0002】 An individual genome contains numerous differences, or mutations, from a standard reference genome. The genomes of cancerous cells are expected to contain even more mutations, including those that are pathogenic. Among these mutations, structural variants (SVs), which typically involve large changes in the base sequence of several thousand bases or more, are less numerous than smaller mutations but play a particularly important role in disease. However, current major DNA sequencing technologies are limited to reading only a few hundred bases at a time, making it difficult to capture such large-scale genomic changes. Therefore, low-cost technologies are needed to analyze broader regions of the genome. 【0003】 A technique called genome mapping can be used for these purposes. In genome mapping, a specific short base sequence of about seven bases (hereinafter referred to as a marker) is labeled with fluorescence or other means on the genome. Then, DNA obtained from a subject is amplified and cut to generate many DNA fragments consisting of hundreds of thousands of bases. On each of these DNA fragments, the approximate base position from the beginning of the DNA where the marker appears, i.e., the marker position, is measured. Based on the obtained marker positions, the location on the genome from which each DNA fragment originates can be identified from the pattern of marker spacing. Optical genome mapping, which uses fluorescence labeling, is commonly used, but other methods such as electrical detection also exist. In genome mapping, the observed marker positions for each DNA fragment can be represented as a numerical sequence. This numerical sequence will be referred to as measurement data below. 【0004】As a document that discloses a technology similar to such a technology, there is Patent Document 1. This publication describes, "In a developed or extended chromosomal DNA, a nucleic acid is hybridized to one type of repetitive base sequence, and by using a labeling substance introduced into the hybridized nucleic acid, the mutual distance on the chromosomal DNA is measured for a plurality of sets of the repetitive base sequences of the chromosomal DNA, and then based on the characteristics of the measured distance, determining the region or position on the chromosome of the set and the repetitive base sequence included in the set. A method for mapping positions on chromosomal DNA." (Claim 1). 【0005】 On the other hand, the process of comparing each measurement data obtained by genome mapping with the labeled positions obtained from the reference genome sequence to clarify the common parts and different parts is called alignment. If no genomic mutations or measurement errors have occurred, each labeled position indicated by the measurement data corresponds to one of the labeled positions on the reference genome. On the other hand, when a structural mutation occurs, the corresponding positions become discontinuous between the labels on the measurement data and the labels on the reference genome. By detecting such abnormalities in the labeled positions, structural mutations can be detected. 【0006】 As a technology related to the present application, the inventor of the present invention provided an alignment method using the ratio of adjacent label intervals, which is an amount invariant to the apparent expansion and contraction of molecules, in Japanese Patent Application No. 2023-051508 (hereinafter referred to as Prior Application 1). Furthermore, the inventor of the present invention provided an alignment method corresponding to short label intervals, which is difficult to handle with the technology of Prior Application 1, in PCT / JP2023 / 043763 (hereinafter referred to as Prior Application 2). 【0007】 Japanese Patent Laid-Open No. 2009-022274 【0008】In order to detect structural variations, it is necessary to perform alignment accurately, and for that purpose, it is necessary to address the errors contained in the measurement data. As one such error, in the measurement data, the length of each entire DNA fragment may appear to increase or decrease visually. This is due to the fact that the movement speed of the molecules during measurement is not uniform. Although Patent Document 1 discloses labeling repetitive sequences on the genome and identifying positions in the genome, it does not disclose a method for collating the labeled positions between the measured labeled intervals and the reference genome. 【0009】 Prior Application 1 attempts to appropriately perform alignment even when there are errors as described above by using the ratio of labeled intervals. However, since there is an error of several hundred bases in the position of the labels measured in genome mapping, when the labeled interval is as short as about several hundred bases, the value of the ratio may change significantly due to the error and alignment may become impossible. Therefore, Prior Application 2 provided means to enable alignment even when a short labeled interval is observed. However, when a large number of measurement data derived from a single molecule are obtained, no means for detecting structural variations is provided. 【0010】 The present invention has been made in view of the above problems, and aims to align a large number of measured molecules to a reference genome and detect structural variations. 【0011】 When the size of the labeled interval of the reference nucleic acid sequence or the labeled interval of the target nucleic acid sequence is below a lower limit value, the information processing apparatus according to the present invention uses a correction function constituted by a polynomial having the interval as an argument to correct the interval so that it becomes the lower limit value or a value larger than the lower limit value. 【0012】 According to the information processing apparatus of the present invention, even when a portion including a short labeled interval is included, it is possible to perform alignment capable of dealing with the apparent elongation and contraction of the nucleic acid molecule and identify the difference from the reference genome. Other problems, configurations, and effects than those described above will be clarified by the description of the following embodiments. 【0013】This is a block diagram showing an example configuration of a genome labeling position alignment device according to Embodiment 1. This is a diagram showing an example of the data configuration of measurement data 140. This is a flowchart showing an example of processing by the genome labeling position alignment device. This is a diagram illustrating a method for correcting the labeling interval performed in step S0301. This shows an example of the correction function in this application. This is a flowchart showing an example of the alignment processing performed in step S0304. This is a flowchart showing an example of the labeling alignment processing performed in steps S0405, S0407, and S0410. This shows an example of normalization in S0502. This shows an example of the score function in S0503. This is a flowchart showing an example of the processing of the alignment integration unit 124 performed in step S0312. This is an explanatory diagram showing an example of when the processing in Figure 8 is performed. This is an example of a user interface provided by the genome labeling position alignment device according to Embodiment 2. 【0014】 The genome labeling position alignment device according to an embodiment of the present invention will be described below. In the following figures, common components are denoted by the same reference numerals, and redundant explanations are omitted. 【0015】 When calculating the ratio of label spacing to accommodate the apparent elongation or contraction of the overall length of a DNA fragment, if the label spacing is short, the value of the ratio can be greatly disturbed due to measurement errors of several hundred base pairs at the label position. Furthermore, in order to detect structural variations based on genome mapping results, it is necessary to integrate the alignment results of measurement data from numerous molecules and extract the differences from the reference genome that appear in common across multiple molecules. The genome labeling alignment device of this embodiment realizes alignment processing that can address these problems. 【0016】<Embodiment 1> Figure 1 is a block diagram showing an example configuration of a genome labeling position alignment device according to Embodiment 1 of the present invention. Figure 1 is identical to Figure 1 of Prior Application 2 except for some of the processing performed by the CPU, but will be described again in this specification. The genome labeling position alignment device (information processing device) is composed of a computer 100. The computer 100 includes, for example, a CPU (Central Processing Unit) 110, memory 111, auxiliary storage device 112, and interfaces 113 to 115. The hardware included in the computer 100 is electrically connected, for example, via internal communication lines such as a bus. 【0017】 The CPU 110 reads programs and data stored in memory 111 and executes programs stored in memory 111. The CPU 110 includes a processor. The CPU 110 includes, for example, an index building unit 121, an index search unit 122, a label alignment unit 123, and an alignment integration unit 124, all of which are functional units. The computer 100 functions as a genome label position alignment device when the CPU 110 performs processing. 【0018】 Memory 111 temporarily stores programs executed by the CPU 110 and data used during program execution. Memory 111 includes non-volatile memory elements such as ROM (Read Only Memory) and volatile memory elements such as RAM (Random Access Memory). ROM stores immutable programs (e.g., BIOS (Basic Input / Output System)). RAM is a high-speed, volatile memory element such as DRAM (Dynamic Random Access Memory) and temporarily stores programs executed by the CPU 110 and data used during program execution. 【0019】 Memory 111 stores programs that implement, for example, the index construction unit 121, the index search unit 122, the label alignment unit 123, and the alignment integration unit 124. Memory 111 also stores the reference genome labeling position 130 and the measurement data 140. 【0020】 For example, the CPU 110 functions as an index construction unit 121 by operating according to an index construction program loaded into memory 111, as an index search unit 122 by operating according to an index search program loaded into memory 111, as an indicator alignment unit 123 by operating according to an indicator alignment program loaded into memory 111, and as an alignment integration unit 124 by operating according to an alignment integration program loaded into memory 111. 【0021】 The auxiliary storage device 112 non-volatilely stores the program executed by the CPU 110 and the data used during program execution. That is, the program is read from the auxiliary storage device 112, loaded into memory 111, and executed by the CPU 110. 【0022】 The auxiliary storage device 112 is a high-capacity, non-volatile storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive). The auxiliary storage device 112 stores programs that implement the functions of the index construction unit 121, the index search unit 122, the label alignment unit 123, and the alignment integration unit 124. The auxiliary storage device 112 also stores the reference genome labeling position 130 and the measurement data 140. Here, the labeling position refers to the position in the sequence where, for example, a predefined short subsequence of 10 bases or less appears. 【0023】Interfaces 113 to 115 are devices that mediate the transmission and reception of signals and perform protocol conversion, and are connected to external devices. Interface 113 is an I / O interface connected to the input / output device 102 via a wired or wireless line. The input / output device 102 includes input devices such as a keyboard, mouse, touch panel, numeric keypad, scanner, microphone, and sensors, as well as output devices such as a display device, printer, and speaker. Interface 113 acquires input information from the operator received by the input / output device 102. Interface 113 also outputs the program execution results to the input / output device 102 in a format that can be viewed by the operator. 【0024】 Interface 115 is a network interface connected to the external storage device 101 via network 105. Interface 115 controls communication with other devices according to a predetermined protocol. 【0025】 The external storage device 101 is a non-temporary storage device that stores data handled by the computer 100. The external storage device 101 includes, for example, storage devices such as HDDs and SSDs. The external storage device 101 can store reference genome labeling locations 130 and measurement data 140. 【0026】 Data transmission and reception between the external storage device 101 and the computer 100 are performed via the network 105. The network 105 includes, for example, a LAN (Local Area Network) and the Internet. However, the type of network 105 is not limited to those described above. The network 105 may be wired or wireless. 【0027】 Interface 114 is connected to a drive device that performs reading and writing to the removable media 103. Interface 114 includes, for example, a serial interface such as USB (Universal Serial Bus). 【0028】The removable media 103 is a non-temporary storage medium for storing data handled by the computer 100. The removable media 103 includes optical discs such as CDs and DVDs, magnetic discs, and semiconductor memory. The removable media 103 can store reference genome labeling locations 130 and measurement data 140. 【0029】 Some or all of the program executed by the CPU 110 may be provided to the computer 100 via the interface 114 from the removable media 103, which is a non-temporary storage medium, or via the network 105 from the external storage device 101, which is a non-temporary storage device, or from an external computer equipped with the external storage device 101, and stored in the non-volatile auxiliary storage device 112, which is a non-temporary storage medium. 【0030】 In Figure 1, the computer 100, which constitutes the genome labeling position alignment apparatus, is connected to an external storage device 101 and removable media 103. However, these external devices can be omitted if they are not needed. Furthermore, an input / output device 102 equipped with an external storage device 101 may be connected to the computer 100 via a network 105. Alternatively, the computer 100 may have a built-in device with input / output functionality instead of being connected to the input / output device 102. 【0031】 Hereinafter, in this specification, a k-tuple is a combination of k integer values (where k is a predefined parameter). The index construction unit 121 constructs an index based on k-tuples as described in Prior Applications 1 and 2. The index construction unit 121 constructs an index using k-tuples based on the ratio of the marking intervals, similar to Prior Applications 1 and 2, rather than an index using k-tuples that represent the interval between marking positions (hereinafter also simply called the marking interval). Furthermore, the index constructed by the index construction unit 121 is designed to handle cases where some marks are missing, similar to the Prior Applications, in order to address the possibility of missed markings. 【0032】As described in Prior Applications 1 and 2, the index search unit 122 uses an index based on k-tuples constructed by the index construction unit 121 to identify the location on the reference genome that is most likely to correspond to each measurement data. During the search, the index search unit 122 performs processing to address false detections of labels. 【0033】 The label alignment unit 123 can find correspondences of surrounding labels in addition to the correspondences of labels found by the index search unit 122 for the label locations of the measurement data and the reference genome. The index search unit 122 can identify locations where k-tuples match, but it cannot associate other labels. Labels that the index search unit 122 could not associate are associated using a method that extends a known dynamic programming alignment method (Nagarajan N et al. Bioinformatics. 24(10):1229-35, 2008) to alignment using ratios (Figure 4, described later). 【0034】 The alignment integration unit 124 calculates the alignment of measurement data and the reference genome for multiple molecules, then compares and integrates these alignments to identify labels common to the reference genome, labels present in the reference genome but only in some of the measurement data, labels only present in the measurement data, and labels only present in the reference genome, thereby clarifying the differences between the subject's genome and the reference genome. 【0035】 Before the genome labeling position alignment process begins, the computer 100 has the reference genome labeling position 130 and measurement data 140 input and stored in it. The CPU 110 may, for example, read the reference genome labeling position 130 and measurement data 140 when the computer 100 is started up or when processing is executed, and load them into memory 111. 【0036】The reference genome marker locations 130 and measurement data 140 may be stored in all of the auxiliary storage device 112, the external storage device 101, and the removable media 103, or in only some of them. When the computer 100 is stopped or when there is insufficient free space in the auxiliary storage device 112, this data may be moved or copied to the external storage device 101 or the removable media 103. Therefore, it is desirable that the reference genome marker locations 130 stored in different storage devices all contain the same information. The same applies to the measurement data 140. 【0037】 Similar to Prior Applications 1 and 2, this application improves alignment accuracy by mitigating changes in the ratio due to measurement errors, even when the marking interval is short. 【0038】 The reference genome marker position 130 includes numerical data representing the position of a marker present on each of the multiple reference genomes. The measurement data 140 includes data obtained by measuring the marker position on each of the numerous DNA fragments. 【0039】 Figure 2 shows an example of the data structure of measurement data 140, and is a reproduction of Figure 2 from Prior Application 2. Measurement data 140 shows the ID that identifies each DNA fragment (an example of the target nucleic acid sequence), the molecular length (length of the base sequence) of each DNA fragment, and the position of the label (an example of a subsequence) on each DNA fragment. Blank spaces in measurement data 140 indicate that the label was not observed. 【0040】 For example, the molecular length of a DNA fragment with ID "2" is 44,951 bases, and four markers are measured in this DNA fragment. The positions of these four markers are, respectively, the 10,844th base, 19,749th base, 23,353rd base, and 35,735th base from the beginning of the DNA fragment. The position of the markers in the DNA fragment indicated by the measurement data 140 may, for example, indicate the position at the beginning or the position at the end of the marker. 【0041】In this embodiment, an example of a short base sequence to be labeled on a DNA fragment is GCTCTTC, which is recognized by an enzyme called Nt. BspQI. As described above, the measurement data 140 shows the labeling positions of GCTCTTC on each DNA fragment in ascending order. In other words, the measurement data 140 shows information in which the labeling positions of each DNA fragment have been converted into an ascending numerical sequence (an example of a second numerical sequence). 【0042】 In this application, similar to Prior Applications 1 and 2, an index construction unit 121 is used to construct k-tuples for any label position on the reference genome based on the ratio of label intervals, and an index is constructed showing the correspondence between the constructed k-tuples and the label positions on the reference genome. The alignment process after referencing the k-tuple index is the same as in the Prior Applications, and it is possible to associate some labels with the measurement data and the reference genome. 【0043】 One method for constructing k-tuples is, for example, as described in Prior Application 2, to use a set of k ratio values d(j, j+1) / d(i, i+k) (j = i, i+1, ..., i+k-1) based on labels i to i+k to form a k-tuple. Here, d(i, j) is the label interval between label i and label j. In this specification, for simplicity, only the case where the DNA reading method is limited to one direction (5' end → 3' end) will be described below. Furthermore, this specification assumes that there is only one chromosome in the genome and that any position in the genome can be represented by a single integer coordinate value. In actual use of the present invention, it is necessary to be able to handle multiple chromosomes, including the reverse direction (3' end → 5' end), but it is easy for those skilled in the art to extend the description in this application. For example, to handle the reverse direction, one can add measurement data with the label positions calculated so that the label intervals are in reverse order. To handle multiple chromosomes, one can concatenate all chromosomes into one and prohibit alignment of measurement data across two chromosomes at the concatenation point. 【0044】 Figure 3A is a flowchart showing an example of processing using a genome labeling position alignment device. The outline of the processing carried out in this embodiment will be described with reference to Figure 3A. 【0045】S0301: Since there is an error of several hundred base pairs in the label positions measured by the genome mapping device, the ratio of the measured label interval to the actual label interval may differ significantly when the label interval is short. Therefore, the label interval of the reference genome is corrected using a correction function, as described later. 【0046】 S0302: The index construction unit 121 constructs an index based on the labeling positions of the reference genome. The processing performed by the index construction unit 121 is the same as in prior applications 1 and 2, except that the labeling interval is corrected by a correction function. 【0047】 S0303: In the following steps, all measurement data Mi (i = 0, 1, ..., n-1, where n is the number of measurement data) are compared one by one with the reference genome to establish a correspondence between the label positions of the measurement data and the label positions of the reference genome. First, i is initialized to 0. 【0048】 S0304: The labeling and alignment unit 123 aligns the measurement data Mi to the reference genome. The processing details of the labeling and alignment unit 123 will be described later with reference to Figure 4. 【0049】 S0305: If alignment fails in S0304, skip S0306. 【0050】 S0306: Add the measurement data Mi and the alignment Ai obtained from the reference genome to set S. Alignment Ai is a list of pairs of label numbers from the measurement data and label numbers from the reference genome. 【0051】 S0307: Increment the subscript i of the measurement data by 1. 【0052】 S0308: If i < n, return to step S0304. 【0053】 S0309: The alignment of measurement data aligned to various intervals [R1, R2] on the genome is then integrated. There are various ways to select the intervals [R1, R2], but as an example, the parameter R size For R1 = R size *j / 2, R2=R sizeIf you choose it as / 2*(j+2) (j=0,1,...), the size R size The interval is from the entire genome, and the position is R size You can select by shifting the selection by 2 each time. 【0054】 S0310: From the alignment set S, extract those alignments in which at least one marker on the genome falls within the interval [R1, R2], and construct a subset S'. 【0055】 S0311: If S' constructed in S0310 is an empty set, skip S0312. 【0056】 S0312: The alignment integration unit 124 integrates the alignments included in set S'. The processing of the alignment integration unit 124 will be described later with reference to Figure 8. 【0057】 S0313: If there are still sections that require processing, return to S0309. 【0058】 Figure 3B illustrates the method for correcting the label spacing, which is performed in step S0301. As mentioned above, in genome mapping, there is a measurement error of several hundred bases in the label position. As an example, Figure 3B shows examples of molecules with labels at 500 base intervals and molecules with labels at 5000 base intervals, which is 10 times the original spacing. Consider the case where the second and third labels are observed to be 200 bases upstream from the actual position, and the fifth label is observed to be 200 bases downstream. If we construct a k-tuple by making the ratio of the label spacings of the k-tuple an integer and the sum r equals r = 100, in the case of 500 base intervals the result is 12:20:28:28:12, which is far from the original 20:20:20:20:20. In contrast, in the case of 5000 base intervals the result is 19:20:21:21:19, which is closer to the original k-tuple. Thus, when the label spacing is short, the effect of measurement error becomes strong. Therefore, prior application 2 provided a correction function to correct the spacing between signs. 【0059】FIG. 3C shows an example of the correction function in the present application. The present application provides a correction function that enables more flexible and natural correction than the prior application 2. As shown in FIG. 3C, let parameter A be the maximum value of the label intervals to be corrected, and parameter B be the lower limit of the label intervals after correction. When the corrected label interval for the label interval x is c(x), c(x) should satisfy the following conditions. 【0060】 (1) At the ends of the range for correcting the label intervals, to make the label intervals before and after correction match, c(A) = A. (2) To make the correction function c(x) connect naturally with the label interval x without correction, c'(A) = 1 (the slope of c at A is 1, c'(x) is the derivative of c(x)). (3) To make the magnitude relationship of the label intervals before and after correction match, c'(x) ≥ 0 (0 ≤ x < A). (4) To make the lower limit value B, c(x) ≥ B (0 ≤ x < A). 【0061】 By the inventors' ingenuity, when 1 - 2B / A ≥ 0, the polynomial function c(x) = (B / A 2 )x 2 + (1 - 2B / A)x + B was found to satisfy these conditions. When A = eB, the exponential function c(x) = B * exp(x / (eB)) was also found to satisfy these conditions. However, since A = eB is a condition, A and B cannot be given independently. A polynomial function that allows A and B to be selected independently is superior. Hereinafter, the definition of the label interval d(i, j) is replaced with c(d(i, j)) to which the correction function is applied. 【0062】 FIG. 4 is a flowchart showing an example of the alignment process executed in step S0304. While referring to FIG. 4, the outline of the alignment process performed in step S0304 will be described. This process includes the process of the index search unit 122 and the process of the label alignment unit 123. Hereinafter, the content of each step in the alignment process shown in FIG. 4 will be described. 【0063】 S0401: In the measurement data M, use the function c(x) to correct the label intervals. Let the position of the i-th label before correction be F' i , then the position F i of the label after correction is represented by the following formula. Note that the position G iHowever, it is assumed that the same correction has been made in advance in S0301 before executing the index building unit 121. F 0 = F' 0 F i = F' 0 +Σ 1≦i’<i c(F') i -F' i-1 ) 【0064】 S0402: Using the index search unit of Prior Application 2, a region of the reference genome similar to the measurement data M is searched for, and a list H is created by associating the labels of the two. When |H| is the length of list H, the integer pairs (p) that are elements of H are... x ,q x ) (x = 0, 1, ..., |H|-1) is the p of the reference genome. x The k-tuple starting from the nth sign is q in M. x This indicates that it has been mapped to the k-th tuple. If H is short, it is highly likely that the measurement data M is not correctly mapped to the location of the reference genome from which it originates, so the parameter H len For |H| ≥ H len The condition for determining alignment success is that the following conditions are met. In addition, to prevent excessive mismatches between labels, a parameter H is set as the upper limit of the acceptable number of missing labels. dist against p x <p x+1 ≤ p x +H dist and q x <q x+1 ≤q x +H dist The condition for determining alignment success is (x = 0, 1, ..., |H|-1). Furthermore, labeled p in the reference genome. x The j-th integer value P of a k-tuple P starting from j And, in the measurement data M, label q x The j-th integer value Q of the k-tuple Q starting from j However, to allow for slight differences while preventing large deviations, parameter H diff In contrast to | P j - Q j |≦H diffThe condition for determining that the alignment is successful is that it satisfies (j = 0, 1, ..., k-1). 【0065】 S0403: If H that satisfies the conditions for determining alignment success cannot be found in S0402, the alignment is determined to be a failure. 【0066】 S0404: The expansion ratio of measurement data M is R M ←(G q|H|-1 -G q1 ) / (F p|H|-1 -F p1 The calculation is performed using ). H is a list of markers mapped with k-tuples, but in the following steps, a list A is created that also maps other markers. First, A is initialized with an empty list. 【0067】 S0405: For the leftmost part of measurement data M, p in the reference genome 0 -q 0 - From the Wth position to p 0 The nth marker and the measurement data M from the 0th to the qth 0 The markers up to the nth are matched. However, W is a parameter that represents the maximum allowable excess or deficiency of markers. The alignment is (p 0 ,q 0 To make the marker represented by ) the starting point, set i1 and j1, which represent the starting points of the marker alignment, p 0 ,q 0 The scaling factor of the measured data M is R = R M Set to this. Alignment type a can be set to either open for handling the edges of the measurement data or close for handling everything else. In S0405, we are handling the left edge of the measurement data M, so we set it to open. 【0068】 S0406: From this point onward, processing proceeds sequentially from the 0th marker pair in H until just before processing the rightmost end of the measurement data M. The alignment type is set to close. In this case, the start and end positions of the alignment are determined by the k tuple, so there is no need to consider the scaling factor. Therefore, R is set to 1.0 to ignore the scaling factor. 【0069】 S0407: p in the reference genome xFrom the second to p x+1 The second marker and q in the measurement data M x From the qth position x+1 Match the signs up to the [number]th [number]. 【0070】 S0408: Increment the value of x, the variable that specifies the marker in list H, by 1. 【0071】 S0409: Repeat steps S0407 onwards until the end of list H is reached. 【0072】 S0410: For the rightmost part of measurement data M, p in the reference genome |H|-1 From the second to p |H|-1 +|F|―q |H|-1 +Wth marker and q in the measurement data M |H|-1 The markers from the 1st to the -1st |F| are mapped. The scaling factor of the measurement data M is R = R M Set it to this. Alignment type a is set to open because it is the rightmost value of the measurement data. 【0073】 Figure 5 is a flowchart showing an example of the label alignment process performed in steps S0405, S0407, and S0410. It is an extension of a known method (Nagarajan Net al. Bioinformatics. 24(10):1229-35, 2008) that can process both open and close alignments and allows for the specification of the scaling factor. Figure 6 shows an example of normalization in S0502. Figure 7 shows an example of the score function in S0503. Referring to Figures 5 to 7, the processing content in each step of the label alignment unit 123 will be described. 【0074】S0501: Create a three-dimensional sequence D, with dimensions of |i2-i1|, |j2-j1|, and 3. Sequence D is a sequence in which, for example, the first dimension corresponds to the label position number of the reference genome, and the second dimension corresponds to the label position number of the measurement data. Sequence elements where the two correspond well are assigned good scores (e.g., negative values), and sequence elements where they do not correspond are assigned bad scores (e.g., positive values). The correspondence between label positions in both sequences is explored by sequentially comparing the sequence elements that give good scores from the top left (i.e., [0,0]) to the bottom right of the sequence. If only the best score were to be recorded, D could be a two-dimensional sequence, but in order to save the path from the top left element of D used to obtain the best score for each sequence element, it is necessary to record two additional integer values, so D is a three-dimensional sequence. In this step, D is initialized, and in the following steps, each element of D is filled in from the bottom right of the sequence. 【0075】 S0502: The alignment range of the measurement data M is normalized so that the total length is R, and the position of each marker is f j (0 ≤ j ≤ |j2 - j1|). The alignment range of the reference genome is normalized so that the total length is 1.0, and the position of each marker is set to g i Let's assume (0 ≤ i ≤ |i² - i¹|). An example of normalization is shown in Figure 6. Furthermore, both variables i and j are initialized to 1. 【0076】 S0503: For i' and j' such that 0 ≤ i' < i and 0 ≤ j' < j, S = D[i' - 1, j' - 1, 0] + s(g i+1 -g i’ , f j+1 -f j’ We seek the value that gives the maximum value of ). However, we assume that D[0,0,0] = 0 and D[i,0,0] = D[0,j,0] = ∞ (i,j > 0). s(x,y) is the score function and is defined as shown in Figure 7, number 1. s(x,y) takes a large negative value when the ratio of the intervals between the two markers is close to 1, and a large positive value when it deviates from 1. The boundary between positive and negative is determined by the parameter r 0This is given by [the following]. To adjust the balance of scores when the ratio is close to 1 and when it deviates from 1, if the score is negative, the coefficient J given as a parameter can be multiplied. An example of the score function is shown in Figure 7. This step intuitively involves the label g of the reference genome of interest. i and the label f of the measurement data M j This process extends the range of sign correspondences by selecting the most appropriate sign from all the signs to its left as the preceding sign when creating a correspondence. 【0077】 S0504: Using i' and j' obtained in S0503, the following values are stored in D. D[i, j, 0] is the value obtained by minimizing the sum of the scores for each marker interval when the markers from the starting point to marker i of the reference genome and marker j of the measurement data are associated, and corresponds to the optimal alignment from marker i of the reference genome to marker j of the measurement data. Furthermore, by referring to the values stored in D[i, j, 1] and D[i, j, 2], the optimal alignment can be traced on D. D[i, j, 0]←S D[i, j, 1]←i' D[i, j, 2]←j' 【0078】 S0505: For all possible values of j, execute the processes from S0503 to S0505. 【0079】 S0506: For all possible values of i, the processes from S0503 to S0506 are executed. 【0080】 S0507: Determine whether the alignment type is open or close in order to determine the end position of the output indicator alignment. 【0081】S0508: When the alignment type is open, D is scanned to find the i and j that give the minimum value of D[i, j, 0], and that position is taken as the endpoint. Open alignment is an alignment process when the endpoint of each sequence to be compared is not determined, so it is necessary to determine the endpoint as the result of the search at some point. Therefore, the sequence element from which the best score was obtained is considered as the endpoint, and the marker position on each sequence from which the best score was obtained is identified by tracing back through sequence D from that virtual endpoint toward the starting point. 【0082】 S0509: When the alignment type is close, the bottom right end of D is used as the endpoint. Closed alignment is an alignment process where both the start and end points of each sequence being compared are determined, so the endpoint can be the end of D corresponding to the end point. By tracing back through sequence D from the end point of the search towards the start point, the marker position on each sequence that yielded the best score is identified. 【0083】 S0510: Initialize the output indicator alignment A as a list [(Gi, Fj)] that holds only pairs of endpoint indicator numbers. 【0084】 S0511: Repeat the processes from S0511 to S0513 until the starting point of the alignment is reached. 【0085】 S0512: Read out the most optimal marker position recorded in D. 【0086】 S0513: Determine whether the starting point on the reference genome side is on the left side or not by whether i1 < i2 or not. 【0087】 S0514: Normally i1 < i2, and the indicator position of the number (i, j) read from the end of list A is added to the beginning of A. 【0088】 S0515: If i1 > i2, the indicator position of the read number (i, j) is added to the end of list A. 【0089】Figure 8 is a flowchart showing an example of the processing performed by the alignment integration unit 124 in step S0312. Figure 9 is an explanatory diagram showing an example of when this processing is performed. The processing of the alignment integration unit 124 will be explained with reference to these figures. 【0090】 S0801: Execute Step a in Figures 8 and 9. In this step, the alignment of each measurement data is obtained from the set of alignments S' that overlap with the processing target section constructed in S3010. 【0091】 S0802: Perform Step b in Figures 8 and 9. In this step, refer to each alignment A included in S' and calculate the number of corresponding measurement data for the reference genome label associated with the measurement data in A. Finally, the number of resulting measurement data is a certain value I. TN In the following cases, the label will be considered not to be present in the subject's genome: I TN The value of can be given as a fixed parameter, but it is preferable to construct a probabilistic model based on the distribution of measurement data and its labels, and set it as the upper limit of the number of false positives that are acceptable for labels not present in the subject's genome. 【0092】 S0803: Perform Step c in Figures 8 and 9. In this step, referencing each alignment A, the estimated position on the reference genome is calculated for labels that could not be mapped to the reference genome, based on the positions of the labels mapped to the preceding and succeeding reference genomes. The distance between the estimated position on the reference genome and the position on the measurement data is parameter I. denovo The following labels are considered to be the same label. For the labels of the reference genome associated with the measurement data in A, the number of associated measurement data is calculated. Finally, the number of appearing measurement data is a certain value I. TP In the above cases, the label is considered to be actually present in the subject's genome. TP The value of can be given as a fixed parameter, but it is preferable to construct a probabilistic model based on the distribution of measurement data and its labels, and to determine that it is present in the subject's genome only when the possibility of false detection during label detection is sufficiently low. 【0093】 S0804: Perform Step d in Figures 8 and 9. In this step, based on the results of Steps b and c, regions that are not present in the subject and regions that are present only in the subject are identified in the reference genome. The former are removed from the reference genome, and the latter are added to the reference genome to construct a new reference genome for the subject. 【0094】 S0805: Execute Step e in Figures 8 and 9. In this step, the measurement data that formed the basis of each alignment included in set S' is realigned with the subject genome constructed in Step d. In this alignment process, it is preferable to directly align each measurement data to the reference genome using the label alignment unit 123, without using the index search unit 122. The alignment type is set to close, and it is executed to fill in the gaps between the furthest pairs of labels that were associated before the reconstruction of the reference genome. Then, it is determined whether the alignment has changed or not. If the alignment has changed, the process from S0802 onwards is repeated. 【0095】 S0806: If the alignment has not changed, the genome for the subject is considered to have converged and the process is terminated. 【0096】 <Embodiment 1: Summary> The genome labeling position alignment device according to Embodiment 1, when given two numerical values A and B as parameters, corrects the labeling interval by a value calculated using a polynomial function so that any labeling interval less than or equal to A is treated as being at least greater than or equal to B. This makes it possible to suppress the influence of measurement errors when performing alignment using the ratio of labeling intervals, even when the labeling interval is short. 【0097】The genome labeling position alignment device according to Embodiment 1 performs labeling alignment using labeling intervals normalized using the stretching ratio of the measurement data. At this time, a scoring function is used that assigns different signs depending on whether the labeling intervals are judged to be similar or dissimilar. Then, the labeling, i.e., alignment is performed so that the sum of the scores of all associated labeling intervals is minimized. This makes it possible to appropriately compare the measurement data and the reference genome, regardless of whether open alignment or closed alignment is used for comparison. 【0098】 The genome labeling position alignment device according to Embodiment 1 identifies labels common to the reference genome, labels present in the reference genome but only in some molecules in the measurement data, labels unique to the measurement data, and labels unique to the reference genome, based on multiple measurement data derived from the subject. This effectively eliminates measurement errors and alignment errors that cannot be eliminated by alignment of a single measurement data, and allows for the estimation of differences between the reference genome and the subject's genome. 【0099】 <Embodiment 2> Figure 10 shows an example of a user interface provided by a genome labeling position alignment device according to Embodiment 2 of the present invention. The genome labeling position alignment device described in Embodiment 1 can provide a user interface 1001 like the one shown in Figure 10, for example, via an input / output device 102. 【0100】The user interface 1001 includes, for example, an input data setting area 1002 and an alignment result display area 1003. The input data setting area 1002 specifies the data file to be input to the computer 100. The alignment result display area 1003 outputs the correspondence between the label positions of the reference genome and the label positions of the measurement data, obtained as a result of the computer 100 performing alignment using the label alignment unit 123, etc., in the form of a screen display. The alignment displayed here also shows the difference between the label positions of the new reference genome and the label positions of the measurement data, which is the calculation result of the alignment integration unit 124 described in Embodiment 1. In the example of Figure 10, the parameters used in the present invention can be set arbitrarily. By updating the contents of the alignment result display area 1003 according to the input parameters, the effect of parameter modification can be easily confirmed. 【0101】 <Regarding Variations of the Invention> The present invention is not limited to the embodiments described above, and includes various variations. For example, the embodiments described above are described in detail to make the present invention easier to understand, and are not necessarily limited to those having all the configurations described. It is also possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Furthermore, it is possible to add, delete, or replace parts of the configuration of each embodiment with other configurations. 【0102】 Furthermore, each of the above configurations, functions, processing units, and processing means may be implemented in hardware, either partially or entirely, by designing them as integrated circuits, for example. Alternatively, each of the above configurations and functions may be implemented in software by having the processor interpret and execute programs that implement each function. Information such as programs, tables, and files that implement each function can be stored in memory, a recording device such as a hard disk or SSD (Solid State Drive), or a recording medium such as an IC card, SD card, or DVD. 【0103】Furthermore, the control lines and information lines shown are those deemed necessary for explanatory purposes, and not all control lines and information lines are necessarily shown in the actual product. In reality, it is safe to assume that almost all components are interconnected. 【0104】 100 Computer, 102 Input / Output device, 110 CPU, 111 Memory, 112 Auxiliary storage device, 113 Interface, 121 Index construction unit, 122 Index search unit, 123 Label alignment unit, 130 Reference genome label position, 140 Measurement data
Claims
1. An information processing device for identifying the position of a sub-sequence of a reference nucleic acid sequence corresponding to a target nucleic acid sequence, wherein the position of a sub-sequence represented by a predetermined specific sequence of 5 to 10 bases is called a label position, and the distance between adjacent sub-sequences is called a label interval, the device comprises: a storage device for storing data describing the label positions of the target nucleic acid sequence; and a processor for identifying the position of the sub-sequence using the data, wherein the processor identifies the interval between labels of the target nucleic acid sequence and the reference nucleic acid sequence; the processor identifies the position of the sub-sequence of the target nucleic acid sequence in the reference nucleic acid sequence by comparing the interval of the reference nucleic acid sequence with the ratio of two or more such intervals; if the interval of the reference nucleic acid sequence or the interval of the target nucleic acid sequence is less than or equal to a lower limit, the processor corrects the interval of the reference nucleic acid sequence so that it is greater than the lower limit, and then compares it with the ratio; and the processor uses a correction function composed of a polynomial with the interval as an argument when correcting the interval of the reference nucleic acid sequence.
2. The information processing apparatus according to claim 1, characterized in that the correction function is configured such that the upper limit of the interval corrected by the correction function is the same before and after the correction by the correction function.
3. The information processing apparatus according to claim 1, characterized in that the correction function is configured such that the slope of the correction function is 1 at the upper limit of the interval corrected by the correction function.
4. The information processing apparatus according to claim 1, characterized in that the correction function is configured such that the slope of the correction function is 0 or greater when the interval corrected by the correction function is less than the upper limit of the interval corrected by the correction function.
5. The correction function is given by c(x) = (B / A) when x is the interval, c(x) is the corrected interval, A is the upper limit of the indicator interval to be corrected, and B is the lower limit of the corrected indicator interval. 2 ) x 2 The information processing apparatus according to claim 1, characterized in that it is represented by + (1 - 2B / A)x + B.
6. An information processing device for identifying the position of a sub-sequence of a reference nucleic acid sequence corresponding to a target nucleic acid sequence, wherein the position of a sub-sequence represented by a predetermined specific sequence of 5 to 10 bases is called a label position, and the distance between adjacent sub-sequences is called a label interval, the device comprises: a storage device for storing data describing the label positions of the target nucleic acid sequence; a processor for associating the label positions of the target nucleic acid sequence with the label positions of the reference nucleic acid sequence using the data, wherein the processor is configured to extract a portion of the labels of the target nucleic acid sequence and a portion of the labels of the reference nucleic acid sequence, and associate them while maintaining their order of appearance in each sequence; the processor calculates the interval between the extracted labels in the target nucleic acid sequence and the reference nucleic acid sequence, respectively; the processor normalizes the interval of the labels of the target nucleic acid sequence by a first coefficient; and the processor normalizes the interval of the labels of the reference nucleic acid sequence by a second coefficient. The processor calculates a score using a score function that increases monotonically as the discrepancy between the label interval of the target nucleic acid sequence normalized by the first coefficient and the label interval of the reference nucleic acid sequence normalized by the second coefficient increases, and the processor associates the label positions of the target nucleic acid sequence with the label positions of the reference nucleic acid sequence by searching for a correspondence between the label positions on the reference nucleic acid sequence and the label positions on the target nucleic acid sequence such that the sum of the scores of all associated label intervals is smallest.
7. The information processing apparatus according to claim 6, wherein if the endpoints of the extracted target nucleic acid sequence and the reference nucleic acid sequence are not determined, the processor considers the label position where the correspondence relationship that results in the smallest score is obtained as the endpoint, and if the endpoints of the extracted target nucleic acid sequence and the reference nucleic acid sequence are determined, the processor terminates the search when the endpoint is reached.
8. The information processing apparatus according to claim 6, characterized in that the score function is configured to take values with different signs when the deviation is less than a reference value and when the deviation is equal to or greater than the reference value.
9. The information processing apparatus according to claim 8, wherein the score function is a function that takes two indicator intervals to be compared as arguments, and the score function is such that, when rd is the value obtained by dividing the larger of the two arguments by the smaller of the two arguments, r0 is the reference value, and J is the coefficient, if rd < r0, the score is J × (rd - r0), and if rd ≥ r0, the score is (rd - r0).
10. An information processing device for inferring the structure of a subject's genome from one or more measurement data obtained from the subject, comprising a processor for inferring the structure, wherein the measurement data includes coordinate values representing the position of a label represented by a predetermined specific sequence of 5 to 10 bases in a DNA fragment obtained by shredding the subject's genomic DNA, the processor associates the position of the label in the measurement data with the position of the label in a reference genome, the processor deletes the associated label from the reference genome if the number of measurement data having a label associated with the position of the label in the reference genome is less than a first integer value, the processor infers the position on the reference genome of the label in the measurement data that was not associated with the position of the label in the reference genome, and adds to the reference genome new label positions existing at the inferred positions on the reference genome for a number of measurement data that are at the same position but were not associated with the position of the label in the reference genome, which is greater than or equal to a second integer value, and the processor associates the new label positions in the reference genome obtained by the deletion or addition with the label positions in the measurement data.
11. The information processing apparatus according to claim 10, wherein the processor associates the labeling positions of the new reference genome obtained by the deletion or addition with the labeling positions of the measurement data, and if a different result is obtained from the association before the deletion or addition, the processor repeats the process of associating the new reference genome obtained by the deletion and addition with the labeling positions of the measurement data, and if no different result is obtained, the processor outputs the new reference genome obtained by the deletion or addition as the processing result.
12. The information processing apparatus further comprises an input / output device that provides a user interface, wherein the user interface presents the results of the processor identifying the correspondence between the labeling position of the reference nucleic acid sequence and the labeling position of the target nucleic acid sequence, as described in claim 1 or 7.
13. The information processing apparatus further comprises an input / output device that provides a user interface, wherein the user interface presents the difference between the labeling position of the new reference genome obtained by the deletion or addition and the labeling position of the measurement data, as described in claim 11.