Information processing device

The information processing device corrects label intervals and uses a scoring function to accurately align measurement data with a reference genome, addressing measurement errors and detecting structural variations in genomes.

JP2026100866APending Publication Date: 2026-06-22HITACHI HIGH TECH CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
HITACHI HIGH TECH CORP
Filing Date
2024-12-10
Publication Date
2026-06-22

AI Technical Summary

Technical Problem

Current DNA sequencing technologies struggle to accurately detect structural variations in genomes due to measurement errors that cause stretching or contraction of DNA fragments, making it difficult to align label positions with a reference genome, especially when label intervals are short.

Method used

An information processing device uses a polynomial correction function to adjust label intervals and a scoring function to minimize alignment errors, enabling accurate alignment of measurement data with a reference genome by correcting apparent stretching and shrinking of nucleic acid molecules.

Benefits of technology

The device effectively aligns large numbers of measurement molecules to a reference genome, identifying structural variations and differences by mitigating measurement errors, even with short label intervals.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100866000001_ABST
    Figure 2026100866000001_ABST
Patent Text Reader

Abstract

Numerous molecular data points, consisting of numerical values ​​representing label locations, are aligned to a reference genome to detect structural variations. [Solution] When the size of the label interval of a reference nucleic acid sequence or the label interval of a target nucleic acid sequence is less than or equal to a lower limit, the information processing device according to the present invention uses a correction function composed of a polynomial that takes the interval as an argument to correct the interval so that it becomes a value greater than the lower limit.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This invention relates to a technique for identifying the location of a subsequence of a nucleic acid sequence. [Background technology]

[0002] An individual genome contains numerous differences, or mutations, from a standard reference genome. The genomes of cancerous cells are thought to contain even more mutations, including those that are pathogenic. Among these mutations, structural variants (SVs), which typically involve large changes in the base sequence of several thousand or more bases, are less numerous than smaller mutations but play a particularly important role in disease. However, current major DNA sequencing technologies are limited to reading only a few hundred bases at a time, making it difficult to capture such large-scale genomic changes. Therefore, low-cost technologies are needed to analyze broader regions of the genome.

[0003] A technique called genome mapping can be used for these purposes. In genome mapping, a short, specific base sequence of about seven bases (hereinafter referred to as a marker) is labeled on the genome using fluorescence or other means. Then, DNA obtained from a subject is amplified and cut to generate numerous DNA fragments consisting of hundreds of thousands of bases. On each of these DNA fragments, the approximate base position from the beginning of the DNA where the marker appears, i.e., the marker position, is measured. Based on the obtained marker positions, the location on the genome from which each DNA fragment originates can be identified from the pattern of marker spacing. Optical genome mapping, which uses fluorescence labeling, is commonly used, but other methods such as electrical detection also exist. In genome mapping, the observed marker positions for each DNA fragment can be represented as a numerical sequence. This numerical sequence will be referred to as measurement data below.

[0004] As a document that discloses a technology similar to such a technology, there is Patent Document 1. This publication describes, "A method for mapping positions on chromosomal DNA, which includes hybridizing a nucleic acid to one type of repetitive base sequence in unfolded or extended chromosomal DNA, measuring the mutual distance on chromosomal DNA for a plurality of sets of the repetitive base sequences by using a labeling substance introduced into the hybridized nucleic acid, and then determining the region or position on the chromosome of the set and the repetitive base sequence included in the set based on the characteristics of the measured distance." (Claim 1).

[0005] On the other hand, the process of clarifying the common and different parts by comparing each measurement data obtained by genome mapping with the labeled positions obtained from the reference genome sequence is called alignment. If no genomic mutations or measurement errors occur, each labeled position indicated by the measurement data corresponds to one of the labeled positions on the reference genome. On the other hand, when a structural mutation occurs, the corresponding positions become discontinuous between the labels on the measurement data and the labels on the reference genome. By detecting such abnormalities in the labeled positions, structural mutations can be detected.

[0006] As a technology related to the present application, the inventor of the present invention provided an alignment method using the ratio of adjacent label intervals, which is an amount invariant to the apparent stretching and shrinking of molecules, in Japanese Patent Application No. 2023-051508 (hereinafter referred to as Prior Application 1). Furthermore, the inventor of the present invention provided an alignment method corresponding to short label intervals, which is difficult to handle with the technology of Prior Application 1, in PCT / JP2023 / 043763 (hereinafter referred to as Prior Application 2).

Prior Art Documents

Patent Documents

[0007]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0008] To detect structural variations, accurate alignment is necessary, which requires addressing errors in the measurement data. One such error is that the overall length of each DNA fragment may appear stretched or contracted in the measurement data. This is due to the non-uniformity of molecular migration speeds during measurement. While Patent Document 1 discloses labeling repeat sequences on the genome and identifying their positions within the genome, it does not disclose a method for matching the label positions between the measured label intervals and a reference genome.

[0009] Prior Application 1 attempts to perform alignment appropriately even with the above-mentioned errors by using the ratio of the label interval. However, since there is an error of several hundred bases in the position of the labels measured by genome mapping, if the label interval is short, such as several hundred bases, the value of the ratio may change significantly due to the error, making alignment impossible. Therefore, Prior Application 2 provides a means to enable alignment even when short label intervals are observed. However, it does not provide a means to detect structural variations when a large amount of measurement data originating from a single molecule is obtained.

[0010] This invention was made in view of the above-mentioned problems, and aims to detect structural variations by aligning a large number of measurement molecules to a reference genome. [Means for solving the problem]

[0011] The information processing device according to the present invention uses a correction function constructed by a polynomial with the interval as an argument when correcting the interval to a value greater than the lower limit, if the size of the label interval of the reference nucleic acid sequence or the label interval of the target nucleic acid sequence is less than or equal to a lower limit. [Effects of the Invention]

[0012] According to the information processing device of the present invention, even when the labeling interval is short, alignment can be performed that can handle the apparent stretching and shrinking of nucleic acid molecules, and differences from the reference genome can be identified. Other issues, configurations, and effects will be clarified by the following description of embodiments. [Brief explanation of the drawing]

[0013] [Figure 1] This is a block diagram showing an example configuration of a genome labeling position alignment device according to Embodiment 1. [Figure 2] This figure shows an example of the data structure of measurement data 140. [Figure 3A] This flowchart shows an example of processing using a genome labeling site alignment device. [Figure 3B] This diagram illustrates the method for correcting the sign spacing, which is performed in step S0301. [Figure 3C] An example of a correction function in this application is shown. [Figure 4] This flowchart shows an example of the alignment process performed in step S0304. [Figure 5] This flowchart shows an example of the label alignment process performed in steps S0405, S0407, and S0410. [Figure 6] An example of normalization in S0502 is shown. [Figure 7] An example of a score function in S0503 is shown. [Figure 8] This flowchart shows an example of the processing performed by the alignment integration unit 124 in step S0312. [Figure 9] This is an explanatory diagram illustrating an example of what happens when the process shown in Figure 8 is executed. [Figure 10] This is an example of a user interface provided by a genome labeling position alignment device according to Embodiment 2. [Modes for carrying out the invention]

[0014] The genome labeling position alignment device according to an embodiment of the present invention will be described below. In the following figures, common components are denoted by the same reference numerals, and redundant explanations are omitted.

[0015] When calculating the ratio of label spacing to accommodate the apparent elongation or contraction of the overall length of a DNA fragment, if the label spacing is short, the value of the ratio can be greatly disturbed due to measurement errors of several hundred base pairs at the label position. Furthermore, in order to detect structural variations based on genome mapping results, it is necessary to integrate the alignment results of measurement data from numerous molecules and extract the differences from the reference genome that appear in common across multiple molecules. The genome labeling alignment device of this embodiment realizes alignment processing that can address these problems.

[0016] <Embodiment 1> Figure 1 is a block diagram showing an example configuration of a genome labeling position alignment device according to Embodiment 1 of the present invention. Figure 1 is identical to Figure 1 of Prior Application 2 except for some of the processing performed by the CPU, but will be described again in this specification. The genome labeling position alignment device (information processing device) is composed of a computer 100. The computer 100 includes, for example, a CPU (Central Processing Unit) 110, memory 111, auxiliary storage device 112, and interfaces 113 to 115. The hardware included in the computer 100 is electrically connected, for example, via internal communication lines such as a bus.

[0017] The CPU 110 reads programs and data stored in memory 111 and executes programs stored in memory 111. The CPU 110 includes a processor. The CPU 110 includes, for example, an index building unit 121, an index search unit 122, a label alignment unit 123, and an alignment integration unit 124, all of which are functional units. The computer 100 functions as a genome label position alignment device by having the CPU 110 perform processing.

[0018] Memory 111 temporarily stores programs executed by the CPU 110 and data used during program execution. Memory 111 includes non-volatile memory elements such as ROM (Read Only Memory) and volatile memory elements such as RAM (Random Access Memory). ROM stores immutable programs (e.g., BIOS (Basic Input / Output System)). RAM is a high-speed, volatile memory element such as DRAM (Dynamic Random Access Memory) and temporarily stores programs executed by the CPU 110 and data used during program execution.

[0019] Memory 111 stores programs that implement, for example, the index construction unit 121, the index search unit 122, the label alignment unit 123, and the alignment integration unit 124. Memory 111 also stores the reference genome labeling position 130 and the measurement data 140.

[0020] For example, the CPU 110 functions as an index construction unit 121 by operating according to an index construction program loaded into memory 111, as an index search unit 122 by operating according to an index search program loaded into memory 111, as an indicator alignment unit 123 by operating according to an indicator alignment program loaded into memory 111, and as an alignment integration unit 124 by operating according to an alignment integration program loaded into memory 111.

[0021] The auxiliary storage device 112 non-volatilely stores the program executed by the CPU 110 and the data used during program execution. That is, the program is read from the auxiliary storage device 112, loaded into memory 111, and executed by the CPU 110.

[0022] The auxiliary storage device 112 is a high-capacity, non-volatile storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive). The auxiliary storage device 112 stores programs that implement the functions of the index construction unit 121, the index search unit 122, the label alignment unit 123, and the alignment integration unit 124. The auxiliary storage device 112 also stores the reference genome labeling position 130 and the measurement data 140. Here, the labeling position refers to the position in the sequence where, for example, a predefined short subsequence of 10 bases or less appears.

[0023] Interfaces 113 to 115 are devices that mediate the transmission and reception of signals and perform protocol conversion, and are connected to external devices. Interface 113 is an I / O interface connected to the input / output device 102 via a wired or wireless line. The input / output device 102 includes input devices such as keyboards, mice, touch panels, numeric keypads, scanners, microphones, and sensors, as well as output devices such as display devices, printers, and speakers. Interface 113 acquires input information from the operator received by the input / output device 102. Interface 113 also outputs the program execution results to the input / output device 102 in a format that can be viewed by the operator.

[0024] Interface 115 is a network interface connected to the external storage device 101 via network 105. Interface 115 controls communication with other devices according to a predetermined protocol.

[0025] The external storage device 101 is a non-temporary storage device that stores data handled by the computer 100. The external storage device 101 includes, for example, storage devices such as HDDs and SSDs. The external storage device 101 can store reference genome labeling locations 130 and measurement data 140.

[0026] Data transmission and reception between the external storage device 101 and the computer 100 are performed via the network 105. The network 105 includes, for example, a LAN (Local Area Network) and the Internet. However, the type of network 105 is not limited to those described above. The network 105 may be wired or wireless.

[0027] Interface 114 is connected to a drive device that performs reading and writing to the removable media 103. Interface 114 includes, for example, a serial interface such as USB (Universal Serial Bus).

[0028] The removable media 103 is a non-temporary storage medium for storing data handled by the computer 100. The removable media 103 includes optical discs such as CDs and DVDs, magnetic discs, and semiconductor memory. The removable media 103 can store reference genome labeling locations 130 and measurement data 140.

[0029] Some or all of the programs executed by the CPU 110 may be provided to the computer 100 via the interface 114 from the removable media 103, which is a non-temporary storage medium, or via the network 105 from the external storage device 101, which is a non-temporary storage device, or from an external computer equipped with the external storage device 101, and stored in the non-volatile auxiliary storage device 112, which is a non-temporary storage medium.

[0030] In Figure 1, the computer 100, which constitutes the genome labeling position alignment apparatus, is connected to an external storage device 101 and removable media 103. However, these external devices can be omitted if they are not needed. Furthermore, an input / output device 102 equipped with the external storage device 101 may be connected to the computer 100 via a network 105. Alternatively, the computer 100 may have a built-in device with input / output functionality instead of being connected to the input / output device 102.

[0031] Hereinafter, in this specification, a k-tuple is a combination of k integer values ​​(where k is a predefined parameter). The index construction unit 121 constructs an index based on k-tuples as described in Prior Applications 1 and 2. The index construction unit 121 constructs an index using k-tuples based on the ratio of the marking intervals, similar to Prior Applications 1 and 2, rather than an index using k-tuples that represent the interval between marking positions (hereinafter also simply called the marking interval). Furthermore, the index constructed by the index construction unit 121 is designed to handle cases where some marks are missing, similar to the Prior Applications, in order to address the possibility of missed markings.

[0032] As described in Prior Applications 1 and 2, the index search unit 122 uses an index based on k-tuples constructed by the index construction unit 121 to identify the location on the reference genome that is most likely to correspond to each measurement data. During the search, the index search unit 122 performs processing to address false detections of labels.

[0033] The label alignment unit 123 can find correspondences of surrounding labels in addition to the correspondences of labels found by the index search unit 122 for the label locations in the measurement data and the reference genome. The index search unit 122 can identify locations where k-tuples match, but it cannot associate other labels. Labels that the index search unit 122 could not associate are associated using a known dynamic programming alignment method (Nagarajan N et al. Bioinformatics. 24(10):1229-35, 2008), which has been extended to alignment using ratios (Figure 4, described later).

[0034] The alignment integration unit 124 calculates the alignment of measurement data and the reference genome for multiple molecules, then compares and integrates these alignments to identify labels common to the reference genome, labels present in the reference genome but only in some of the measurement data, labels only present in the measurement data, and labels only present in the reference genome, thereby clarifying the differences between the subject's genome and the reference genome.

[0035] Before the genome labeling position alignment process begins, the computer 100 has the reference genome labeling position 130 and measurement data 140 input and stored in it. The CPU 110 may, for example, read the reference genome labeling position 130 and measurement data 140 when the computer 100 is started up or when processing is executed, and load them into memory 111.

[0036] The reference genome marker locations 130 and measurement data 140 may be stored in all of the auxiliary storage device 112, the external storage device 101, and the removable media 103, or in only some of them. These data may be moved or copied to the external storage device 101 or the removable media 103 when the computer 100 is shut down or when the auxiliary storage device 112 runs out of space. Therefore, it is desirable that the reference genome marker locations 130 stored in different storage devices all contain the same information. The same applies to the measurement data 140.

[0037] Similar to Prior Applications 1 and 2, this application improves alignment accuracy by mitigating changes in the ratio due to measurement errors, even when the marking interval is short.

[0038] The reference genome marker position 130 includes numerical data representing the location of the marker present on each of the multiple reference genomes. The measurement data 140 includes data obtained by measuring the marker position on each of the numerous DNA fragments.

[0039] Figure 2 shows an example of the data structure of measurement data 140, and is a reproduction of Figure 2 from Prior Application 2. Measurement data 140 shows the ID that identifies each DNA fragment (an example of the target nucleic acid sequence), the molecular length (length of the base sequence) of each DNA fragment, and the position of the label (an example of a subsequence) on each DNA fragment. Blank spaces in measurement data 140 indicate that the label was not observed.

[0040] For example, the molecular length of a DNA fragment with ID "2" is 44,951 bases, and four markers are measured in this DNA fragment. The positions of these four markers are, respectively, the 10,844th base pair, the 19,749th base pair, the 23,353rd base pair, and the 35,735th base pair from the beginning of the DNA fragment. Note that the position of the markers in the DNA fragment indicated by measurement data 140 may, for example, indicate the position at the beginning or the position at the end of the marker.

[0041] In this embodiment, an example of a short base sequence to be labeled on a DNA fragment is GCTCTTC, which is recognized by an enzyme called Nt.BspQI. As described above, the measurement data 140 shows the labeling positions of GCTCTTC on each DNA fragment in ascending order. In other words, the measurement data 140 shows information converted into a numerical sequence (an example of a second numerical sequence) in ascending order of the labeling positions of each DNA fragment.

[0042] In this application, similar to Prior Applications 1 and 2, an index construction unit 121 is used to construct k-tuples for any label location on the reference genome based on the ratio of label intervals, and an index is constructed showing the correspondence between the constructed k-tuples and the label locations on the reference genome. The alignment process after referencing the k-tuple index is the same as in the Prior Applications, and it is possible to associate some labels with the measurement data and the reference genome.

[0043] One method for constructing a k-tuple is, for example, as described in Prior Application 2, to use a set of k ratio values ​​d(j,j+1) / d(i,i+k) (j=i,i+1,···,i+k-1) based on labels i to i+k to form a k-tuple. Here, d(i,j) is the label interval between labels i and j. In this specification, for simplicity, only the case where the DNA reading method is limited to one direction (5' end → 3' end) will be described. Furthermore, this specification assumes that there is only one chromosome in the genome and that any position in the genome can be represented by a single integer coordinate value. In actual use of the present invention, it is necessary to be able to handle multiple chromosomes, including the reverse direction (3' end → 5' end), but it is easy for those skilled in the art to extend the description in this application. For example, to handle the reverse direction, one can add measurement data with the label positions calculated so that the label intervals are in reverse order. To handle multiple chromosomes, one can concatenate all chromosomes into one and prohibit alignment of measurement data across two chromosomes at the concatenation point.

[0044] Figure 3A is a flowchart showing an example of processing using a genome labeling site alignment device. Referring to Figure 3A, the outline of the processing performed in this embodiment will be described.

[0045] S0301: Since there is an error of several hundred base pairs in the label positions measured by genome mapping devices, the ratio of the measured label interval to the actual label interval may differ significantly when the label interval is short. Therefore, the label interval of the reference genome is corrected using a correction function, as described later.

[0046] S0302: The index construction unit 121 constructs an index based on the label positions of the reference genome. The processing performed by the index construction unit 121 is the same as in prior applications 1 and 2, except that the label interval is corrected by a correction function.

[0047] S0303: Next, compare each of all the measurement data Mi (i = 0, 1,..., n - 1, where n is the number of measurement data) with the reference genome, and associate the labeled positions of the measurement data with the labeled positions of the reference genome. First, initialize i to 0.

[0048] S0304: The label alignment unit 123 aligns the measurement data Mi with the reference genome. The processing details of the label alignment unit 123 will be described later while referring to FIG. 4.

[0049] S0305: If the alignment fails in S0304, skip S0306.

[0050] S0306: Add the alignment Ai obtained from the measurement data Mi and the reference genome to the set S. The alignment Ai is a list of pairs of the labeled numbers of the measurement data and the labeled numbers of the reference genome.

[0051] S0307: Increment i, which is the subscript of the measurement data, by 1.

[0052] S0308: If i < n, return to step S0304.

[0053] S0309: Next, for various intervals [R1, R2] on the genome, integrate the alignments of the measurement data aligned with the interval. There are various ways to select the interval [R1, R2], but as an example, for the parameter R size let R1 = R size *j / 2 and R2 = R size / 2*(j + 2) (j = 0, 1,...). If selected in this way, intervals of size R size can be selected from the entire genome while shifting the position by R size / 2 one by one.

[0054] S0310: Extract from the set S of alignments those whose labels on the genome are included in the interval [R1, R2] even for one label, and construct a subset S'.

[0055] If S' constructed in S0311:S0310 is an empty set, skip S0312.

[0056] S0312: The alignment integration unit 124 integrates the alignments included in set S'. The processing of the alignment integration unit 124 will be described later with reference to Figure 8.

[0057] S0313: If there are still sections that need processing, return to S0309.

[0058] Figure 3B illustrates the method for correcting the label spacing, which is performed in step S0301. As mentioned earlier, there is a measurement error of several hundred base pairs in the label position during genome mapping. As an example, Figure 3B shows examples of molecules with labels spaced 500 base pairs apart and molecules with labels spaced 5000 base pairs apart (10 times the original spacing). Consider the case where the second and third labels are observed to be 200 base pairs upstream from their actual position, and the fifth label is observed to be 200 base pairs downstream. If we construct a k-tuple by making the ratio of the label spacings of the k-tuple an integer and ensuring that the sum r = 100, the result for the 500 base pair spacing is 12:20:28:28:12, which is far from the original 20:20:20:20:20. In contrast, the result for the 5000 base pair spacing is 19:20:21:21:19, which is closer to the original k-tuple. Thus, the effect of measurement error becomes stronger when the label spacing is short. Therefore, prior application 2 provided a correction function to correct the marking interval.

[0059] Figure 3C shows an example of the correction function in this application. This application provides a correction function that allows for more flexible and natural correction than Prior Application 2. As shown in Figure 3C, parameter A is the maximum value of the mark interval to be corrected, and parameter B is the lower limit of the corrected mark interval. When the corrected mark interval is c(x) for a mark interval x, c(x) should satisfy the following conditions.

[0060] (1) At the end of the range where the marking interval is corrected, in order to make the marking interval before and after the correction the same, c(A) = A (2) The correction function c(x) is such that it is naturally connected to the uncorrected label interval x, i.e., c’(A)=1 (the slope of c at A is 1, and c’(x) is the derivative of c(x)) (3) To ensure that the magnitude relationship of the label intervals before and after correction is consistent, c’(x)≧0 for 0≦x<A (4) To set the lower limit value as B, c(x)≧B for 0≦x<A

[0061] Through the inventors' ingenuity, when 1 - 2B / A≧0, it was found that the polynomial function c(x)=(B / A 2 )x 2 +(1 - 2B / A)x + B satisfies these conditions. When A = eB, it was also found that the exponential function c(x)=B*exp(x / (eB)) satisfies these conditions. However, since A = eB is a condition, A and B cannot be given independently. A polynomial function that allows A and B to be selected independently is superior. Hereinafter, the definition of the label interval d(i,j) is replaced with c(d(i,j)) to which the correction function is applied.

[0062] Figure 4 is a flowchart showing an example of the alignment process executed in step S0304. While referring to Figure 4, the outline of the alignment process implemented in step S0304 will be described. This process includes the processing of the index search unit 122 and the processing of the label alignment unit 123. Hereinafter, the content of each step in the alignment process shown in Figure 4 will be described.

[0063] S0401: In the measurement data M, use the function c(x) to correct the label interval. Let the position of the i-th label before correction be F’ i , then the position F i of the label after correction is represented by the following formula. Note that the position G i of the label array on the genome is also assumed to have been corrected in advance in S0301 in the same manner before the index construction unit 121 is executed. F0 = F’0 F i = F’0 + Σ 1≦i’<i c(F’ i - F’ i-1 )

[0064] S0402: Using the index search unit of Prior Application 2, regions of the reference genome similar to the measurement data M are searched for, and a list H is created by associating the labels of the two. When |H| is the length of list H, the integer pairs (p) that are elements of H are created. x ,q x ) (x=0,1,...,|H|-1) is the p of the reference genome. x The k-tuple starting from the nth sign is q in M. x This indicates that it has been mapped to the k-th tuple. If H is short, it is highly likely that the measurement data M is not correctly mapped to the reference genome location from which it originates, so the parameter H len For |H|≧H len The condition for determining alignment success is that the following conditions are met. In addition, to prevent excessive mismatches between labels, a parameter H is set as the upper limit of the acceptable number of missing labels. dist against p x <p x+1 ≤p x +H dist and q x x+1 ≤q x +H dist The condition for determining alignment success is (x=0,1,...,|H|-1). Furthermore, labeled p in the reference genome. x The j-th integer value P of a k-tuple P starting from j And, in the measurement data M, label q x The j-th integer value Q of the k-tuple Q starting from j However, to allow for slight differences while preventing large deviations, parameter H diff against |P j -Q j |≦H diff The condition for determining that the alignment is successful is that (j=0,1,...,k-1) is satisfied.

[0065] In S0403:S0402, if H that satisfies the conditions for determining alignment success cannot be found, the alignment is determined to be a failure.

[0066] ​S0404: R is the expansion ratio of measurement data M. M ←(G q|H|-1 -G q1 ) / (F p|H|-1 -F p1 The calculation is performed using ). H is a list of markers mapped with k-tuples, but in the following steps we will create a list A that also maps other markers. First, initialize A with an empty list.

[0067] S0405: For the leftmost part of the measurement data M, the marker at position p0-q0-W in the reference genome is associated with the markers at position 0-q0 in the measurement data M. W is the parameter that represents the maximum allowable excess or deficiency of markers. The alignment is set to start with the marker represented by (p0,q0), assigning p0 and q0 to i1 and j1, which represent the starting points of the marker alignment. The stretching ratio of the measurement data M is R=R M Set it to this. Alignment type a can be set to either open for handling the edges of the measurement data or close for handling everything else. In S0405, we are handling the left edge of the measurement data M, so we set it to open.

[0068] S0406: From this point onward, processing proceeds sequentially from the 0th marker pair in H until just before processing the rightmost end of the measurement data M. The alignment type is set to close. In this case, the start and end positions of the alignment are determined by the k-tuple, so there is no need to consider the scaling factor. Therefore, R is set to 1.0 to ignore the scaling factor.

[0069] S0407: p in the reference genome x From the second to p x+1 The second marker and q in the measurement data M x From the qth position x+1 Match the signs up to the [number]th [number].

[0070] S0408: Increment the value of x, the variable that specifies the marker in list H, by 1.

[0071] S0409: Repeat steps S0407 onwards until the end of list H is reached.

[0072] S0410: For the rightmost part of measurement data M, p in the reference genome |H|-1 From the second to p |H|-1 +|F|―q |H|-1 +Wth marker, and q in the measurement data M |H|-1 The markers from the 1st to the -1st |F| are mapped. The scaling factor of the measurement data M is R = R M Set it to this. Alignment type a is set to open because it is the rightmost value of the measurement data.

[0073] Figure 5 is a flowchart showing an example of the label alignment process performed in steps S0405, S0407, and S0410. It is an extension of a known method (Nagarajan N et al. Bioinformatics. 24(10):1229-35, 2008) that can handle both open and closed alignment and allows for the specification of the scaling factor. Figure 6 shows an example of normalization in S0502. Figure 7 shows an example of the score function in S0503. Referring to Figures 5 to 7, the processing content in each step of the label alignment unit 123 will be described.

[0074] S0501: Create a 3D sequence D with dimensions of |i2-i1|, |j2-j1|, and 3 respectively. Sequence D is a sequence in which, for example, the first dimension corresponds to the label position number of the reference genome, and the second dimension corresponds to the label position number of the measurement data. Sequence elements where the two correspond well are assigned good scores (e.g., negative values), and sequence elements where they do not correspond are assigned bad scores (e.g., positive values). The correspondence between label positions in both sequences is explored by sequentially comparing the sequence elements that give good scores from the top left (i.e., [0,0]) to the bottom right of the sequence. If only the best score were to be recorded, D could be a 2D sequence, but in order to save the path from the top left element of D used to obtain the best score for each sequence element, it is necessary to record two additional integer values, so D is a 3D sequence. In this step, D is initialized, and in the following steps, each element of D is filled in from the bottom right of the sequence.

[0075] S0502: The alignment range of the measurement data M is normalized so that the total length is R, and the position of each marker is f j Let (0≦j≦|j2-j1|). The alignment range of the reference genome is normalized so that the total length is 1.0, and the position of each marker is set g i Let's assume (0 ≤ i ≤ |i2 - i1|). An example of normalization is shown in Figure 6. Furthermore, we initialize both variables i and j to 1.

[0076] S0503:0≦i' <i, 0≦j’<jなるi’, j’で、S=D[i’-1, j’ -1,0]+s(g i+1 -g i’ , f j+1 -f j’ We seek the value that gives the maximum value of ). Here, we assume D[0,0,0]=0 and D[i,0,0]=D[0,j,0]=∞ (i,j>0). s(x,y) is the score function and is defined as shown in Figure 7, number 1. s(x,y) takes a large negative value when the ratio of the two marker intervals is close to 1, and a large positive value when it deviates from 1. The positive / negative boundary is given by the parameter r0. To adjust the balance of the score when the ratio is close to 1 and when it deviates from 1, if the score is negative, we can multiply it by the coefficient J given as a parameter. An example of the score function is shown in Figure 7. This step intuitively involves the marker g of the reference genome of interest. i and the label f ​​of the measurement data M j This process extends the range of sign correspondences by selecting the most appropriate sign from all the signs to its left as the preceding sign when creating a correspondence.

[0077] Using the i' and j' obtained in S0504:S0503, the following values ​​are stored in D. D[i, j, 0] is the value obtained by minimizing the sum of the scores for each marker interval when the markers from the starting point to marker i of the reference genome and marker j of the measurement data are associated, and corresponds to the optimal alignment from marker i of the reference genome to marker j of the measurement data. Furthermore, by referring to the values ​​stored in D[i, j, 1] and D[i, j, 2], the optimal alignment can be traced on D. D[i, j, 0]←S D[i, j, 1]←i' D[i, j, 2]←j'

[0078] S0505: For all possible values ​​of j, execute the processes from S0503 to S0505.

[0079] S0506: For all possible values ​​of i, execute the processes in S0503 to S0506.

[0080] S0507: Determine whether the alignment type is open or close in order to determine the end position of the output indicator alignment.

[0081] S0508: When the alignment type is open, D is scanned to find the i and j that give the minimum value of D[i, j, 0], and that position is taken as the endpoint. Open alignment is an alignment process when the endpoint of each sequence to be compared is not determined, so it is necessary to determine the endpoint as the result of the search at some point. Therefore, the sequence element that obtained the best score is considered as the endpoint, and the marker position on each sequence that obtained the best score is identified by tracing back through sequence D from that virtual endpoint towards the starting point.

[0082] When the type of alignment is "close" in S0509, the lower right end of D is used as the end point. A closed alignment is an alignment process when both the start point and the end point of each array to be compared are determined. Therefore, the end of D corresponding to the end point can be used as the end point. By tracing back the array D from the end point of the search to the start point, the marker positions on each array where the best score is obtained are identified.

[0083] S0510: Initialize the output marker alignment A as a list [(Gi,Fj)] that holds only the pair of marker numbers at the end point.

[0084] S0511: Repeat the processing from S0511 to S0513 until reaching the start point of the alignment.

[0085] S0512: Read out the marker position immediately before, which is determined to be optimal, recorded in D.

[0086] S0513: Determine whether the start point on the reference genome side is on the left by checking whether i1 < i2.

[0087] S0514: Normally, i1 < i2. Add the marker position of the read number (i,j) to the beginning of list A at the end of list A.

[0088] S0515: When i1 > i2, add the marker position of the read number (i,j) to the end of list A at the end of list A.

[0089] Figure 8 is a flowchart showing an example of the processing of the alignment integration unit 124 executed in step S0312. Figure 9 is an explanatory diagram showing an example when this processing is executed. The processing of the alignment integration unit 124 will be described while referring to these.

[0090] S0801: Execute Step a in FIGS. 8 and 9. In this step, obtain the alignment of each measurement data from the set S' of alignments overlapping with the processing target interval constructed in S3010.

[0091] S0802: Perform Step b in Figures 8 and 9. In this step, refer to each alignment A included in S' and calculate the number of corresponding measurement data for each reference genome label associated with the measurement data in A. Finally, the number of resulting measurement data is a certain value I. TN In the following cases, the label will be considered not to be present in the subject's genome: I TN The value of can be given as a fixed parameter, but it is preferable to construct a probabilistic model based on the distribution of measurement data and its labels, and set it as the upper limit of the number of false positives that are acceptable for labels not present in the subject's genome.

[0092] S0803: Perform Step c in Figures 8 and 9. In this step, referencing each alignment A, the estimated position on the reference genome is calculated for labels that could not be mapped to the reference genome, based on the positions of the labels mapped to the preceding and succeeding reference genomes. The distance between the estimated position on the reference genome and the position on the measurement data is parameter I. denovo The following labels are considered to be the same label. For the labels of the reference genome associated with the measurement data in A, the number of associated measurement data is calculated. Finally, the number of appearing measurement data is a certain value I. TP In the above cases, the label is considered to be actually present in the subject's genome. TP The value of can be given as a fixed parameter, but it is preferable to construct a probabilistic model based on the distribution of measurement data and its labels, and to determine that it is present in the subject's genome only when the possibility of false detection during label detection is sufficiently low.

[0093] S0804: Perform Step d in Figures 8 and 9. In this step, based on the results of Steps b and c, identify regions in the reference genome that are not present in the subject and regions that are present only in the subject. The former are removed from the reference genome, and the latter are added to the reference genome to construct a new reference genome for the subject.

[0094] S0805: Execute Step e in Figures 8 and 9. In this step, the measurement data that formed the basis of each alignment included in set S' is realigned to the subject genome constructed in Step d. In this alignment process, it is preferable to directly align each measurement data to the reference genome using the label alignment unit 123, without using the index search unit 122. The alignment type is set to close, and it is executed to fill in the gaps between the furthest pairs of labels that were associated before the reconstruction of the reference genome. Then, it is determined whether the alignment has changed or not. If the alignment has changed, the process from S0802 onwards is repeated.

[0095] S0806: If the alignment has not changed, the genome for the subject is considered to have converged and the process is terminated.

[0096] <Embodiment 1: Summary> The genome labeling position alignment device according to Embodiment 1, when given two numerical values ​​A and B as parameters, corrects the labeling interval by a value calculated using a polynomial function so that any labeling interval less than or equal to A is treated as being at least greater than or equal to B. This makes it possible to suppress the influence of measurement errors when performing alignment using the ratio of labeling intervals, even when the labeling interval is short.

[0097] The genome labeling position alignment device according to Embodiment 1 performs labeling alignment using labeling intervals normalized using the stretching ratio of the measurement data. At this time, a scoring function is used that assigns different signs depending on whether the labeling intervals are judged to be similar or dissimilar. Then, the labeling, i.e., alignment is performed so that the sum of the scores of all associated labeling intervals is minimized. This makes it possible to appropriately compare the measurement data and the reference genome, regardless of whether open alignment or closed alignment is used for comparison.

[0098] The genome labeling position alignment device according to Embodiment 1 identifies labels common to the reference genome, labels present in the reference genome but only in some molecules in the measurement data, labels unique to the measurement data, and labels unique to the reference genome, based on multiple measurement data derived from the subject. This effectively eliminates measurement errors and alignment errors that cannot be eliminated by alignment of a single measurement data, and allows for the estimation of differences between the reference genome and the subject's genome.

[0099] <Embodiment 2> Figure 10 shows an example of a user interface provided by a genome labeling position alignment device according to Embodiment 2 of the present invention. The genome labeling position alignment device described in Embodiment 1 can provide a user interface 1001 like the one shown in Figure 10, for example, via an input / output device 102.

[0100] The user interface 1001 includes, for example, an input data setting area 1002 and an alignment result display area 1003. The input data setting area 1002 specifies the data file to be input to the computer 100. The alignment result display area 1003 outputs the correspondence between the label positions of the reference genome and the label positions of the measurement data, obtained as a result of the computer 100 performing alignment using the label alignment unit 123, etc., in the form of a screen display. The alignment displayed here also shows the difference between the label positions of the new reference genome and the label positions of the measurement data, which is the calculation result of the alignment integration unit 124 described in Embodiment 1. In the example in Figure 10, the parameters used in the present invention can be set arbitrarily. By updating the contents of the alignment result display area 1003 according to the input parameters, the effect of parameter modification can be easily confirmed.

[0101] <Regarding variations of the present invention> The present invention is not limited to the embodiments described above, and various modifications are included. For example, the embodiments described above are described in detail to make the present invention easier to understand, and are not necessarily limited to those having all the configurations described. Furthermore, it is possible to replace parts of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add configurations from other embodiments to the configuration of one embodiment. In addition, it is possible to add, delete, or replace parts of the configuration of each embodiment with other configurations.

[0102] Furthermore, each of the above configurations, functions, processing units, and processing means may be implemented in hardware, either partially or entirely, by designing them as integrated circuits, for example. Alternatively, each of the above configurations and functions may be implemented in software by having the processor interpret and execute programs that implement each function. Information such as programs, tables, and files that implement each function can be stored in memory, a recording device such as a hard disk or SSD (Solid State Drive), or a recording medium such as an IC card, SD card, or DVD.

[0103] Furthermore, the control lines and information lines shown are those deemed necessary for explanatory purposes, and not all control lines and information lines are necessarily shown in the actual product. In reality, it is safe to assume that almost all components are interconnected. [Explanation of Symbols]

[0104] 100 Computer, 102 Input / Output Device, 110 CPU, 111 Memory, 112 Auxiliary Storage Device, 113 Interface, 121 Index Construction Unit, 122 Index Search Unit, 123 Label Alignment Unit, 130 Reference Genome Label Position, 140 Measurement Data

Claims

1. An information processing device for identifying the position of a subsequence of a reference nucleic acid sequence corresponding to a target nucleic acid sequence, When the position of a subsequence represented by a predetermined sequence of 5 to 10 bases is called the label position, and the distance between adjacent subsequences is called the label interval, A storage device that stores data describing the labeling position of the target nucleic acid sequence. A processor that uses the aforementioned data to determine the position of the subarrangement. Equipped with, The processor identifies the interval between the labels of the target nucleic acid sequence and the reference nucleic acid sequence, The processor identifies the position of a sub-sequence of the target nucleic acid sequence in the reference nucleic acid sequence by comparing the intervals in the reference nucleic acid sequence with the ratio of two or more such intervals. If the interval in the reference nucleic acid sequence or the interval in the target nucleic acid sequence is less than or equal to the lower limit, the processor corrects the interval in the reference nucleic acid sequence to be greater than the lower limit, and then compares it with the ratio. The processor uses a correction function constructed by a polynomial that takes the interval as an argument when correcting the interval in the reference nucleic acid sequence. An information processing device characterized by the following:

2. The correction function is configured such that the upper limit of the interval corrected by the correction function is the same before and after the correction by the correction function. The information processing apparatus according to claim 1, characterized in that it is a product of the present invention.

3. The correction function is configured such that the slope of the correction function is 1 at the upper limit of the interval corrected by the correction function. The information processing apparatus according to claim 1, characterized in that it is a product of the present invention.

4. The correction function is configured such that the slope of the correction function is 0 or greater when the interval to be corrected by the correction function is less than the upper limit of the interval to be corrected by the correction function. The information processing apparatus according to claim 1, characterized in that it is a product of the present invention.

5. The correction function is given by c(x) = (B / A) when x is the interval, c(x) is the corrected interval, A is the upper limit of the indicator interval to be corrected, and B is the lower limit of the corrected indicator interval. 2 ) x 2 It can be expressed as + (1 - 2B / A)x + B The information processing apparatus according to claim 1, characterized in that it is a product of the present invention.

6. An information processing device for identifying the position of a subsequence of a reference nucleic acid sequence corresponding to a target nucleic acid sequence, When the position of a subsequence represented by a predetermined sequence of 5 to 10 bases is called the label position, and the distance between adjacent subsequences is called the label interval, A storage device that stores data describing the labeling position of the target nucleic acid sequence. A processor that associates the label positions of the target nucleic acid sequence with the label positions of the reference nucleic acid sequence using the aforementioned data. Equipped with, The processor is configured to extract a portion of the label of the target nucleic acid sequence and a portion of the label of the reference nucleic acid sequence, and to associate them while maintaining their order of appearance in each sequence. The processor calculates the interval between the extracted labels in the target nucleic acid sequence and the reference nucleic acid sequence, The processor normalizes the spacing of the labels in the target nucleic acid sequence by a first coefficient. The processor normalizes the spacing of the labels in the reference nucleic acid sequence by a second coefficient. The processor calculates a score using a score function that increases monotonically as the discrepancy between the label interval of the target nucleic acid sequence normalized by the first coefficient and the label interval of the reference nucleic acid sequence normalized by the second coefficient increases. The processor associates the label locations on the target nucleic acid sequence with the label locations on the reference nucleic acid sequence by searching for a correspondence between the label locations on the reference nucleic acid sequence and the label locations on the target nucleic acid sequence such that the sum of the scores of all associated label intervals is smallest. An information processing device characterized by the following:

7. If the endpoints of the extracted target nucleic acid sequence and the reference nucleic acid sequence are not determined, the processor considers the label position where the correspondence relationship that results in the smallest score is obtained as the endpoint. The processor terminates the search when it reaches the end of the extracted target nucleic acid sequence and the reference nucleic acid sequence, provided that the end of each sequence is determined. The information processing apparatus according to claim 6, characterized in that it is a product of the same name.

8. The aforementioned score function is configured such that it takes values ​​with different signs depending on whether the deviation is less than the reference value or whether the deviation is equal to or greater than the reference value. The information processing apparatus according to claim 6, characterized in that it is a product of the same name.

9. The aforementioned score function is a function that takes the intervals between two indicators to be compared as arguments, The aforementioned score function is, rd is the value obtained by dividing the larger of the two arguments by the smaller one. The aforementioned reference value is r0, Let the coefficient be J. When rd < r0, the function is J × (rd - r0), and when rd ≥ r0, it is (rd - r0). The information processing apparatus according to claim 8, characterized in that it is a product of the present invention.

10. An information processing device for inferring the structure of a subject's genome from one or more measurement data obtained from the subject, A processor that infers the aforementioned structure is provided. The measurement data includes coordinate values ​​representing the position of a label, which is represented by a specific sequence of 5 to 10 base pairs, in a DNA fragment obtained by shredding the subject's genomic DNA. The processor associates the position of the label in the measurement data with the position of the label in the reference genome. If the number of measurement data points having a label associated with the position of the label in the reference genome is less than a first integer value, the processor deletes the associated label from the reference genome. The processor estimates the position on the reference genome of the labels in the measurement data that could not be associated with the position of the labels in the reference genome, and adds to the reference genome new label positions at the estimated positions on the reference genome for labels that could not be associated with the position of the labels in the reference genome but are located at the same position among a number of measurement data equal to two or more integer values. The processor associates the label locations of the new reference genome obtained by the deletion or addition with the label locations of the measurement data. An information processing device characterized by the following:

11. The processor, by associating the label locations of the new reference genome obtained by the deletion or addition with the label locations of the measurement data, if a different result is obtained from the association before the deletion or addition, repeats the deletion and addition to associate the new reference genome obtained with the label locations of the measurement data. If no different results are obtained, the new reference genome obtained by the deletion or addition is output as the processing result. The information processing apparatus according to claim 10, characterized in that it is a product of the present invention.

12. The information processing device further includes an input / output device that provides a user interface, The user interface presents the results of the processor's determination of the correspondence between the labeling position of the reference nucleic acid sequence and the labeling position of the target nucleic acid sequence. The information processing apparatus according to claim 1 or 7, characterized in that it is a product of the same name.

13. The information processing device further includes an input / output device that provides a user interface, The user interface presents the difference between the labeling location of the new reference genome obtained by the deletion or addition and the labeling location of the measurement data. The information processing apparatus according to claim 11, characterized in that it is a feature of the present invention.