Method and device for the earliest determination of a phenotypic trait from a stream of genomic sequences
The method addresses sequencing time and error rate issues by using iterative k-mer detection and stability criteria to accelerate phenotypic prediction, optimizing sequencing depth and computational efficiency.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BIOMERIEUX SA
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-18
AI Technical Summary
Current nucleic acid sequencers, such as SBS and nanopore sequencers, face limitations in sequencing time and error rates, which hinder their ability to surpass traditional phenotypic methods, requiring long processing times and significant computational resources, while nanopore sequencing errors are frequent, necessitating high sequencing depths for reliable results.
A method and system for phenotypic characterization using iterative processing of real-time or batch-delivered nucleic acid sequences, employing k-mers detection and stability criteria to determine a stopping point, reducing the need for extensive sequencing and computational resources.
Accelerates phenotypic prediction by identifying stable results early, reducing sequencing time and data volume, saving computational resources and reagents, and ensuring robust results through noise smoothing and stability analysis.
Smart Images

Figure EP2025086296_18062026_PF_FP_ABST
Abstract
Description
[0001]
[0002] FIELD OF INVENTION
[0003] The invention relates to the field of microbial characterization, in particular the determination of phenotypic traits of microorganisms, based on the sequencing of their genomes carried out using a sequencer delivering sequences in real time or in batches frequently or regularly.
[0004] STATE OF THE ART
[0005] The rapid decrease in costs and the significant improvement in the performance of third-generation sequencers, such as SBS sequencers from Illumina Inc. or nanopore sequencers from Oxford Nanopore Technologies, have fostered the rapid development of genomic-to-phenotypic science. This discipline aims to predict the phenotypic characteristics of organisms based on their genomes. In microbiology, for example, it is now possible to predict whether a bacterium is resistant to an antibiotic through genomic analysis. This approach aims to complement, or even replace, traditional phenotypic methods based on microbial culture, which deliver results too slowly for serious pathologies such as sepsis.
[0006] Current sequencers, however, have a significant limitation that prevents them from surpassing traditional methods. Indeed, the sequencing time required to obtain reliable results with sequencing errors compatible with genomic-phenotypic analysis is still too long, at best between one and two days. Firstly, SBS sequencers or equivalents deliver their results all at once, at the very end of the sequencing process. They do, however, offer the advantage of an extremely low sequencing error rate, which becomes virtually zero once the reads (i.e., the digital nucleic acid sequences produced by a sequencer) they generate are assembled. Unlike short-fragment sequencers, nanopore sequencing delivers its reads continuously, either in real time or in batches at regular intervals.This allows for immediate analysis, and the user can, in theory, stop sequencing as soon as the desired confidence level is reached, thus optimizing turnaround time. However, nanopore sequencing errors are more frequent (3%–5% versus 0.1% for SBS sequencers), so sequencing depths of at least 50x are currently required to achieve the necessary confidence level. Ultimately, the time saved by this continuous analysis is minimal. DESCRIPTION OF THE INVENTION.
[0007] The aim of the present invention is to provide a method and system for phenotypic characterization of an organism based on its nucleic material (DNA, RNA, protein, etc.), material digitized by a sequencer delivering reads continuously, either in real time or in batches at regular and periodic intervals, thus reducing:
[0008] - the time required to obtain a result as reliable as that produced by deep and / or long-duration sequencing;
[0009] - the quantity of reads required to obtain such a result, which allows (a) saving computing resources (amount of memory, number of processors, etc.) and / or computation time thanks to a reduced volume of data to be processed, (b) saving reagents used for sequencing (such as reagents for library preparation) and / or limiting wear and tear on the sequencing platform and its consumables (for example, sequencing cells or "flow cell" in English) by sequencing only what is necessary to achieve the result.
[0010] To this end, the invention relates to a method for predicting a phenotypic trait of an organism based on its genome, comprising:
[0011] • a sequencing of said genome so as to produce digital nucleic acid sequences representative of said genome, hereinafter "reads", said sequencing being configured to deliver the reads a) in real time and / or b) in batches at regular or periodic intervals, said reads being stored in a computer memory;
[0012] • an iterative process implemented by computer, comprising for each iteration:
[0013] - an update of a set of reads based on reads delivered in real time or in batches;
[0014] - a calculation of the sequencing depth of said genome as a function of said set of reads;
[0015] - a detection of predetermined length digital reference nucleic acid sequences, hereinafter "k-mers", in said set of reads, a k-mer being detected when its number of occurrences in said set is greater than a predetermined fraction of the calculated sequencing depth;
[0016] - a prediction of phenotypic character based on the k-mers detected in said set of reads; and
[0017] - a calculation of a stopping criterion and a storage of the predicted phenotypic trait as the final phenotypic trait when said stopping criterion is reached. According to the invention, the calculation of the stopping criterion includes the calculation of the variation in the calculated sequencing depth between two iterations, the variation in the number of k-mers detected between two iterations, and the variation in the predicted phenotypic trait between two iterations, and the stopping criterion is reached:
[0018] A. if the variation in the number of detected k-mers is stable during a first predetermined number of iterations; and
[0019] B. following the stability of the variation in the number of k-mers detected during the first number of iterations:
[0020] B. l. if the variation in the phenotypic trait is stable during a second number of iterations; and
[0021] B.2. if the variation in sequencing depth calculated during the second predetermined number of iterations is greater than a predetermined threshold.
[0022] In other words, to accelerate the delivery of the phenotypic prediction, the invention makes it possible a) to detect as early as possible when the latter is likely to stabilize on a result consistent with the content of the genome, b) to identify this stability in a rapid and robust manner.
[0023] At the beginning of sequencing, both the number of detected KMers and the prediction results fluctuate considerably due to the shallow sequencing depth. Consequently, if a stopping criterion is based solely on phenotypic prediction, a wide iteration window is necessary for stability analysis to prevent transient stabilization from being interpreted as an expected result. However, the wider the window, the longer the delay in obtaining already stabilized results. The invention therefore proposes an initial stabilization of the number of detected KMers to identify the actual or imminent end of this unstable period, thus enabling the stability analysis of the prediction to be performed over a smaller window.Furthermore, due to potentially high sequencing noise (up to 5% on some platforms), the number of KMers can vary at greater sequencing depths, even after the initial phase. To prevent this noise from affecting the stability of the phenotypic characterization, the stability window is defined to smooth out this noise without altering the final result. To this end, a minimum sequencing depth is imposed between two iterations of the window or across the entire window. This ensures a robust final result that can be confidently reported.
[0024] For example, by choosing the first iteration number to be 1, the second to 4, and the sequencing depth increment to be 3x, the prediction results are delivered at around 20x. This represents a 30x gain compared to a sequencing stoppage typically performed around 50x. Complete sequencing can, however, be delayed depending on the results of other analyses performed in parallel, particularly by multiplexing. In this type of analysis, sequencing is stopped when all multiplexed samples have reached predictive stability.
[0025] According to one embodiment, the predetermined fraction of the sequencing depth is between 0.08 and 0.18. According to one embodiment, the first number of iterations is equal to 1. According to one embodiment, the second number of iterations is greater than or equal to 4. According to one embodiment, the threshold is greater than or equal to 3x.
[0026] According to one embodiment, the variation in the number of detected k-mers is stable when it is less in absolute value than a first stability threshold, and the variation in the phenotypic character is stable when it is less than a second predetermined stability threshold, and the first and second thresholds are less than or equal to 0.1.
[0027] According to one embodiment, the sequencing is configured to deliver reads with an average length greater than 500kb.
[0028] According to one embodiment, the reads are not assembled.
[0029] According to one embodiment, prior to sequencing, the process involves the production of a sample from a microorganism isolate, with sequencing being carried out on said sample.
[0030] In one embodiment, the organism is a microorganism, and the phenotypic characteristic is a resistance or sensitivity profile of the microorganism to at least one antimicrobial. In particular, the microorganism is taken from a patient or animal suspected of having a microbial infection, and the method includes determining an antimicrobial therapy based on said profile and administering said therapy to the patient or animal.
[0031] According to one embodiment, when the stopping criterion is reached, the sequencing is stopped and / or the final phenotypic character is displayed on a screen.
[0032] The invention also relates to a system for predicting a phenotypic trait of an organism based on its genome, comprising:
[0033] • a nucleic acid sequencer, said sequencer being configured to sequence said genome so as to produce digital nucleic acid sequences representative of said genome, hereinafter "reads", said sequencer being configured to deliver the reads a) in real time and / or b) in batches at regular or periodic intervals, said reads being stored in a computer memory;
[0034] • an information processing unit connected to the sequencer to receive said reads and comprising one or more microprocessors and a computer memory storing computer instructions which, when executed by the microprocessor(s), implement iterative processing, comprising for each iteration:
[0035] - an update of a set of reads based on reads delivered in real time or in batches;
[0036] - a calculation of the sequencing depth of said genome as a function of said set of reads;
[0037] - a detection of predetermined length digital reference nucleic acid sequences, hereinafter "k-mers", in said set of reads, a k-mer being detected when its number of occurrences in said set is greater than a predetermined fraction of the calculated sequencing depth;
[0038] - a prediction of phenotypic character based on the k-mers detected in said set of reads; and
[0039] - a calculation of a stopping criterion and a storage of the predicted phenotypic trait as the final phenotypic trait when said stopping criterion is reached.
[0040] According to the invention, the calculation of the stopping criterion includes calculating the variation in the calculated sequencing depth between two iterations, the variation in the number of k-mers detected between two iterations, and the variation in the predicted phenotypic trait between two iterations, and the stopping criterion is reached:
[0041] A. if the variation in the number of detected k-mers is stable during a first predetermined number of iterations; and
[0042] B. following the stability of the variation in the number of k-mers detected during the first number of iterations:
[0043] Bl if the variation in the phenotypic trait is stable during a second number of iterations; and
[0044] B.2. if the variation in sequencing depth calculated during the second predetermined number of iterations is greater than a predetermined threshold.
[0045] According to one embodiment, the computer memory stores computer instructions which, when executed by the microprocessor(s), implement a process of the type described above. The invention also relates to a computer program product comprising a computer memory storing computer instructions which, when executed by one or more microprocessors, implement the iterative processing described above.
[0046] The invention also relates to a method for producing a computer memory for predicting a phenotypic trait of an organism based on its genome, comprising:
[0047] • the iterative updating of a set of digital nucleic acid sequences representative of said genome, hereinafter "reads", based on new reads representative of said genome;
[0048] • computer-implemented exploration of a space defined by at least a first number of iterations, a second number of iterations, and a threshold, hereinafter referred to as a "triplet," the exploration comprising, for each value of a set of triplets of said space, (a) a calculation of a sequencing depth of said genome as a function of said set of reads, a detection of reference digital nucleic acid sequences of predetermined length, hereinafter referred to as "k-mers," in said set of reads, a k-mer being detected when its number of occurrences in said set is greater than a predetermined fraction of the calculated sequencing depth, (b) a prediction of the phenotypic character as a function of the k-mers detected in said set of reads, and (c) a calculation of a stopping criterion, the calculation of the stopping criterion comprising the calculation of the variation in the calculated sequencing depth between two iterations,of the variation in the number of k-mers detected between two iterations and of the variation in the predicted phenotypic trait between two iterations;
[0049] • the storage in computer memory (a) of a triplet as first number of iterations, second number of iterations, and threshold and (b) of computer instructions which, when executed by one or more microprocessors, implement the iterative processing according to any one of claims 1 to 12, when said triplet fulfills the following conditions I and II:
[0050] AI if the variation in the number of detected k-mers is stable during the first predetermined number of iterations; and
[0051] B. following the stability of the variation in the number of k-mers detected during the first number of iterations:
[0052] B. l. if the variation in the phenotypic trait is stable during the second number of iterations; and
[0053] B.2. if the variation in sequencing depth calculated during the second predetermined number of iterations is greater than the threshold, II. when condition I is met:
[0054] C.1. if the detected k-mers are also detected in a reference genome of the organism;
[0055] C.2. and if the detected phenotypic trait is identical to a reference phenotypic trait for the organism.
[0056] BRIEF DESCRIPTION OF THE FIGURES
[0057] The invention will be better understood upon reading the following description, given solely by way of example, and in conjunction with the accompanying drawings, in which identical reference numerals designate identical or analogous elements, and in which Figure 1 illustrates a computer and hardware architecture for implementing a microbiological analysis according to the invention; Figure 2 illustrates the invention within a complete microbiological analysis workflow; Figure 3 illustrates a bi-class prediction according to the invention, corresponding to the resistance or sensitivity of a given bacterial strain to a given antibiotic, with a zoom on the initial period of instability and a zoom on the stability analysis window of the prediction; Figure 4 illustrates a method for determining the value of stability parameters according to the invention;Figure 5 illustrates the ROC curve for a given strain and for different values of N^kmer. Figures 6A and 6B illustrate AUC values for optimal values of the fraction X; kmer for a set e I have 55 bacterial strains as a function of the average sequencing depth, with 3 iterations represented; Figure 7 illustrates a distribution of the fraction X kmcrfor the 55 bacterial strains according to intervals of average sequencing depths; Figure 8 illustrates in the form of a box plot the average sequencing depth at which the phenotypic prediction according to the invention is delivered; Figure 9 illustrates the time saving between a 50x metagenomic prediction and a metagenomic prediction according to the invention; Figure 10 illustrates this time saving for the 215 strains tested for their resistance / sensitivity to an antibiotic; Figure 11 is a confusion matrix between a metagenomic prediction according to the invention and a 50x metagenomic prediction; Figure 12 illustrates the average sequencing depth at which the metagenomic prediction according to the invention is detected, stabilized and thus delivered; and Figure 13 illustrates a variant of the software and hardware architecture for the implementation according to the invention.
[0058] DETAILED DESCRIPTION OF THE INVENTION
[0059] METHOD OF IMPLEMENTATION
[0060] Figure 1 illustrates a computer and hardware architecture for implementing a microbiological analysis according to the invention.
[0061] A first organization, 1000, for example, a microbiological analysis laboratory in a hospital, houses a sequencer, 1002. This sequencer comprises a sequencing platform, which generates raw signals based on nucleic acid bases contained in a sample being analyzed, and a first computer processing unit, connected to the platform (for example, via a network if this unit is remote, or by cable when it is in the same chassis as the platform, as illustrated in this figure), which translates these raw signals into digital sequences of nucleic acids, or "reads" (this signal transformation is usually called "base calling"). As an example, the sequencing platform is a Gridion, and the flow cell and library preparation are those marketed under the name "R9.4," sold by Oxford Nanopore Technologies.
[0062] The sequencer 1002 is also connected to a second processing unit 1004, which implements the front-end of a cloud-hosted SaaS (Software-as-a-Service) analysis. A second organization 2000, for example, the Applicant, hosts a computer server 2002. This server is remotely connected via the Internet 1006 to the sequencer 1002 to receive its reads 1008 (in the form of a fastQ file, for example) and process them to predict one or more phenotypic characteristics 2004, for example, a gAST antibiogram. The server 2002, also connected to the processing unit 1004, communicates this prediction to it for display and local storage. The server 2002 also includes a reference database 2006 for implementing this prediction, as described below.Finally, server 2002 monitors the stability of the prediction in order to detect it as early as possible and sends a stop signal 2008 ("stop") to either sequencer 1002 or unit 1004, which then controls the sequencer, along with the final prediction result. The computer unit embedded in the sequencer and the second unit 1004, for example, have classic personal computer architectures. Server 2002 preferentially has greater computing power due to the amount of data to be processed, for example, in the form of processing nodes.
[0063] Figure 2 illustrates the invention within a complete microbiological analysis workflow, from the collection of a sample from a patient or animal suspected of having a bacterial infection, for example, to the delivery of the antibiogram of the bacterial strains in the sample. While an embodiment of the invention is described as being applied to predicting the resistance or sensitivity of a bacterial strain to an antibiotic, it is understood that this application does not limit the invention, which can be used for any type of phenotypic characterization, such as predicting its species, virulome, serotype, genomic similarity with another microbial strain (or "typing"), etc.
[0064] This process begins, in step 10, with the collection of the biological sample, which undergoes, in step 12, an initial preparation phase to isolate a bacterial strain with sufficient biomass for sequencing, to extract the genetic material from the cells corresponding to this strain (by chemical or mechanical lysis, for example), and to concentrate and purify this material. A second preparation of the sample thus produced then consists, in step 14, of preparing the genetic material for sequencing by the 1002 sequencer, a step known as "library preparation".
[0065] Once the final sample is loaded into a sequencer 200 and sequencing has started in 16, the sequencer 1002 delivers, in 18, packets of reads 1008 at fixed time steps or at regular time intervals (for example, the packets are of fixed sizes, a packet being sent when it is full) for their storage and analysis by the server 2002. The reception of a packet of reads by the server 2002 starts a new iteration of computer processing 20, processing which includes an optional step 22 of preprocessing the reads in order to characterize their quality and to filter out reads deemed to be of poor quality (as non-limiting examples: filtering on reads having a length less than a predefined threshold, filtering reads having a quality score less than a predefined threshold, filtering reads which correspond to the adapters used during the preparation of the library...).
[0066] The preprocessed reads are then stored, in step 24, in a set of reads consisting of filtered reads from previous iterations. Next, the newly added reads are translated into k-mers, that is, fixed-length nucleic acid sequences, called "H>", preferably between 15 and 50, for example, 31. This length allows for k-mers to be distinct from one another while robustly describing microbial genetic diversity. This translation is performed, for example, by sliding a window of lengths k onto each read in steps of 1 and storing each detected k-mer in a set of k-mers containing k-mers from previous iterations.
[0067] The process continues, in step 26, with the detection of the k-mers actually contained within the genetic material of the sample being tested. To do this, a sequencing depth is calculated by the 2002 server, either directly from the set of reads or directly on the set of k-mers. As an example, a pan-genome assembled from the species of the bacterial strain being characterized is stored in the 2006 reference database, and the k-mers are aligned to this reference genome (for example, with a perfect match using tools such as Mummer4, available at https: / / mummer4.github.io / , and described in the article "MUMmer4: A fast and versatile genome alignment system" by G. Marçais et al., PLoS computational biology (2018)). Once the alignment is achieved, an average, or local, sequencing depth as described in patent application number EP22166649 is calculated.The k-mers of the set of k-mers are then enumerated, and if the number of occurrences of a k-mer exceeds a threshold equal to a fraction X. kmcr of the calculated sequencing depth, advantageously a fraction X kmcr If the value is between 0.08 and 0.18, then server 2002 detects this k-mer as indeed present in the sequenced genetic material. Once the k-mers are detected, a phenotypic prediction is implemented by server 2002, for example, the prediction of an antibiogram as described in requests WO2021180771 and WO2021180768, a prediction which is returned to unit 1004. Note that this prediction is performed directly on the k-mers, without the reads ever being assembled into contigs, as assembly is a very costly operation in terms of time and computing resources.
[0068] In parallel with the iterative processing 20, server 2002 implements a computer task to analyze the stability of the phenotypic prediction. To this end, this processing performs an initial stability test 30 on the number of detected k-mers, with the aim of detecting the actual or imminent end of the first phase of instability. Specifically, if the variation in the number of detected k-mers between two successive iterations, expressed as a percentage, is less than a threshold S in absolute value. kmer during N kmer successive iterations, then said end is identified. Advantageously, the threshold S kmer is equal to 0.1 and the number of iterations N kmer is equal to 1.
[0069] Once this first stability criterion is met, the 2002 server performs a second stability analysis on the phenotypic prediction. If the latter is stable over a window of N predIf successive iterations are performed, and the increase in average sequencing depth between two successive iterations of this window is greater than a threshold AX, then the 2002 server determines that the prediction is stable. There are multiple ways to calculate the variation of the prediction. The latter provides an output vector (one component per tested bacterium-antibiotic pair, for example), which can be discrete (such as resistance or sensitivity to an antibiotic) or continuous (such as the minimum inhibitory concentration of a bacterial strain to an antibiotic).
[0070] The variation is, for example, calculated as a normalized Euclidean norm of the difference between two successive vectors provided by the prediction. The variation is then considered stable if the calculated norm is less than a threshold S PredOr, for a multi-label prediction, its variation is equal to the number of prediction outputs that have changed since the previous iteration, and if the variation of the prediction between two successive iterations, expressed as a percentage, is less in absolute value than a threshold S pred during N pred successive iterations, then the prediction is considered stable. Advantageously, the number of iterations N r ' rcd is equal to 4, the threshold S pred is equal to 0.1, and the average sequencing depth increase over the window is advantageously fixed at 3x, which allows for effective smoothing of sequencing noise from biological nanopore-based technologies such as those developed by Oxford Nanopore Technology, and thus makes a robust result.
[0071] Server 2002 then sends, at port 34, a sequencing stop signal to sequencer 2002 which stops the ongoing sequencing, and sends the final prediction to processing unit 1004. A clinician is notified, by email, SMS or other means, and determines an antibiotic therapy for the patient based on this prediction, which is then delivered to the patient.
[0072] Figure 3 illustrates a bi-class prediction corresponding to the resistance or sensitivity of a given bacterial species strain to a given antibiotic, with a zoom on the initial period of instability and a zoom on the stability analysis window of the prediction.
[0073] INVENTION PARAMETERS
[0074] To detect the stability of the prediction as early as possible, the invention therefore has the following parameters which can be set individually or in combination:
[0075] - the way in which k-mers are detected, in particular the X fractionkmcr ; And
[0076] - the way in which prediction stability is detected, in particular the S threshold Pred , the width of the stability window N predand the increase AX between two successive iterations of the mean sequencing depth. A method for determining the value of these parameters is now described in relation to Figure 4. This method begins, in step 40, with the establishment of a database 42, called the "genomic truth" database, and a read database 44. This method is implemented by a computer unit, for example, server 2002. More specifically, for each strain in a set of previously collected bacterial strains: i. a reference genome is determined. Preferably, the reference genome is the result of a process aimed at minimizing, and preferably eliminating, sequencing errors. For example, each strain is sequenced with a short-read sequencer from Illumina with a large sequencing depth (at least 50x, preferably greater than 100x) and an assembly is performed.This assembly of short reads leads to a very low error rate, or even zero with current assembly tools. The assembled genome is then translated into k-mers (hereafter "reference k-mers") by sliding a window of length k, in steps of 1, over the contigs as described previously, and the reference k-mers are stored in base 42. When phenotypic prediction uses a reference genome, this can be the one used to determine the parameter values. ii. The strain genome is sequenced with a sequencer of the same type as that used for microbiological analysis (same type of platform, flow cell, and library and base-calling preparation as those used for phenotypic characterization) to produce read packets, and this is done for a significant sequencing depth (at least 50x, preferably greater than 100x).These read packets are optionally filtered and then translated into k-mers (hereafter "ONT k-mers") and stored in a set of k-mers as described previously. Base 44 therefore stores iterations of the set of ONT k-mers, and thus a set of ONT k-mers that grows according to the sequencing depth.
[0077] The process continues, in 46, with the adjustment of the fraction X kmcr , and therefore the number of copies a read must have in the set of k-mers to conclude that it has been detected.
[0078] To this end, for each previously collected bacterial strain, and for each iteration of the ONT k-mers set, a ROC curve is calculated as a function of several values of the fraction X kmerThis fraction is, for example, incremented from 0 in regular steps (e.g., 0.005). For each value of this fraction, ONT k-mers from the set of k-mers are judged to be detected or not. The detected ONT k-mers are then compared to the reference k-mers, and if a detected ONT k-mer is a reference k-mer, then it is a true positive; otherwise, it is a false positive. Figure 5 illustrates the ROC curve for a given strain and for different values of X. kmcr (labeled "Thresh" in the figure). For each of the ROC curves, the value of the fraction X kmer The area under the ROC curve (denoted "AUC") is stored, maximizing the area under the curve (in the example in Figure 5, the value Thresh = 0.075). This yields, for each strain, a set of optimal values for the fraction X. kmcr depending on the iterations. Figures 6A and 6B illustrate AUC values for the optimal values of the fraction X kmcrfor a set of 55 bacterial strains (in the example of Staphylococcus aureus strains) as a function of the average sequencing depth, 3 iterations (noted "n Jqs") being represented.
[0079] The optimal values for the fraction X kmcr are then grouped by sequencing depth intervals, for example, lOx intervals, to obtain a distribution of said fraction. Such a distribution for the 55 bacterial strains is illustrated in Figure 7 as a box plot, considering the median and the 5 e and 95 e percentiles. The number 1 on the abscissa corresponds to the interval of average sequencing depths [Ox; 10x[, the number 2 corresponding to the interval [1Ox; 20x[, etc.
[0080] The median values from interval 2, and therefore for average sequencing depths greater than lOx, define an optimal fraction range according to the invention, in particular the range [0.08; 0.18], the values of which lead to AUCs greater than 0.99.
[0081] The process continues, in step 48, with the determination of the width of the stability window N predand the increment AX between two successive iterations of the mean sequencing depth. More specifically, this step involves calculating values for these parameters such that the result of the phenotypic prediction at the end of the stabilized window is substantially identical to the phenotypic prediction obtained for a mean sequencing depth for which the prediction is known to be stabilized. For example, the phenotypic prediction at a mean sequencing depth greater than or equal to 50x, e.g., 50x, is chosen as the reference. A search space consisting of values within the optimal fraction interval X is then defined. kmcr , for example a discretization of said interval by steps of 0.05, of width N values pred For example, integers between 1 and 10, and the value of the increment of AX, for example from Ix to 1Ox in steps of Ix. For each triplet of values X kmer , Npred and AX, the phenotypic prediction is implemented for the iteration marking the end of the stability window and compared to the reference phenotypic prediction. This comparison is preferably based on a metric measuring the accuracy of this prediction relative to the reference prediction, for example, balanced accuracy or the mean of balanced accuracy if the prediction is multi-class. The following table illustrates the balanced accuracy for a binary prediction (i.e., resistant or not to an antibiotic) as described in the applications WO2021180771 and WO2021180768 and applied to the k-mers of the 55 S. aureus strains, for X kmcr = 0.1 and different values of N pred and AX.
[0082] The inventors noted that the value of the metric reaches a plateau when N pred is greater than or equal to 4 and AX is greater than or equal to 3, and this holds true for all values of Xkmcr in the interval [0.08; 0.18],
[0083] The stability window is thus advantageously chosen such that N pred AX = 4 and AX > 3X, preferably AX < 4X, these pairs of values corresponding to the fastest delivery of the phenotypic prediction. Even more preferably, AX = 3X. Indeed, the performance gain for 4X compared to 3X is marginal, while this gain implies a later delivery.
[0084] Figure 8 illustrates, in the form of a box plot (median, 5th and 95th percentiles), the average sequencing depth at which the phenotypic prediction according to the invention is delivered with Xtaner. = (y . N kmer = j. gkmer = (y ; I predict = 4. S Pred =(yet AX = 3 X and pQur55 strains of E. coli, 80 strains of E. coli and 80 strains of P. aeruginosa. We observe a gain of at least 15x in terms of depth of field compared to a phenotypic prediction delivered at 50x and even a gain greater than 30x on some microbiological species.
[0085] The invention also includes parameters for detecting the end of the initial period of instability, in particular the threshold S kmer and the number of iterations N kmcr The inventors found that S kmer = 0.1 and N kmcr = l allow for robust detection of the end of the instability period. Other values have been tested with a marginal gain in the speed of prediction delivery.
[0086] EXTENSION OF THE TEACHING METHOD OF IMPLEMENTATION i. A prediction performed on a bacterial strain isolate has been described. The invention also applies to metagenomics, which consists of characterizing microorganisms in the same sample without isolating them as described previously. The prediction includes, for example, an additional step of assigning reads to a particular species, one or more of whose genomes are stored in the 2006 database, a step known as "pooling." Each pool of reads, assigned to a species, is then processed in the previous manner. In such a case, sequencing is stopped when all the predictions are stable. Figure 9 illustrates the time saved for these adjustments between a 50x prediction and a prediction according to the invention. On average, a time saving of 3 hours is observed.Figure 10 illustrates this same phenomenon for the 215 strains tested, as described above, for antibiotic resistance / sensitivity. For example, 20% of the predictions are delivered only a few tens of minutes after the start of sequencing, with approximately 70% of the predictions delivered twice as fast as a prediction made on 50x reads. Figure 11 is a confusion matrix between the invention and a 50x prediction, and Figure 12 illustrates the average sequencing depth at which the prediction according to the invention is detected as stabilized and thus delivered. ii. A phenotypic prediction has been described as described in documents WO2021 180771 and WO2021 180768. The invention is not limited to such predictions. For example, the invention encompasses the prediction of a resistome (i.e.the delivery of a set of genetic markers suspected of being involved in the resistance or sensitivity of a strain to an antibiotic) or of a virulome, for example. iii. A specific hardware and software architecture has been described for implementing the invention. In particular, the computer processing units and the server(s) comprise computer memory (cache, RAM, ROM, etc.) and one or more microprocessors or processors (CPU and / or GPU), organized or not as computing nodes, necessary for executing computer instructions stored in the memory for implementing the method according to the invention. It will be understood that, with regard to computing, any type of computer architecture can be suitable and that the description above and below should not be construed as limiting the scope of the invention.In one variant (Figure 13), the architecture is hosted by a single organization, for example a hospital, with the SaaS being implemented on-premise.
Claims
DEMANDS 1. A method for predicting a phenotypic trait of an organism based on its genome, comprising: • a sequencing of said genome so as to produce digital nucleic acid sequences representative of said genome, hereinafter "reads", said sequencing being configured to deliver the reads a) in real time and / or b) in batches at regular or periodic intervals, said reads being stored in a computer memory; • an iterative process implemented by computer, comprising for each iteration: - an update of a set of reads based on reads delivered in real time or in batches; - a calculation of the sequencing depth of said genome as a function of said set of reads; - a detection of predetermined length digital reference nucleic acid sequences, hereinafter "k-mers", in said set of reads, a k-mer being detected when its number of occurrences in said set is greater than a predetermined fraction of the calculated sequencing depth; - a prediction of phenotypic character based on the k-mers detected in said set of reads; and - a calculation of a stopping criterion and a storage of the predicted phenotypic trait as the final phenotypic trait when said stopping criterion is reached, characterized in that the calculation of the stopping criterion includes the calculation of the variation in the calculated sequencing depth between two iterations, the variation in the number of k-mers detected between two iterations, and the variation in the predicted phenotypic trait between two iterations, and in that the stopping criterion is reached: A. if the variation in the number of detected k-mers is stable during a first predetermined number of iterations; and B. following the stability of the variation in the number of k-mers detected during the first number of iterations: Bl if the variation in the phenotypic trait is stable during a second number of iterations; and B.
2. during the second number of iterations, if the variation in sequencing depth calculated between two iterations is greater than a predetermined threshold.
2. A method according to claim 1, characterized in that the predetermined fraction of the sequencing depth is between 0.08 and 0.
18.
3. Method according to claim 1 or 2, characterized in that the first number of iterations is equal to 1.
4. A method according to any one of the preceding claims, characterized in that the second number of iterations is greater than or equal to 4.
5. A method according to any one of the preceding claims, characterized in that the threshold is greater than or equal to 3x.
6. A method according to any one of the preceding claims, characterized in that the variation in the number of detected k-mers is stable when it is less in absolute value than a first stability threshold, and the variation in the phenotypic character is stable when it is less than a second predetermined stability threshold, and in that the first and second thresholds are less than or equal to 0.
1.
7. A method according to any one of the preceding claims, characterized in that sequencing is configured to deliver reads with an average length greater than 500kb.
8. A method according to any one of the preceding claims, characterized in that the reads are not assembled.
9. A method according to any one of the preceding claims, characterized in that, prior to sequencing, the method comprises the production of a sample from a microorganism isolate, the sequencing being carried out on said sample.
10. A method according to any one of the preceding claims, characterized in that the organism is microorganisms, and in that the phenotypic character is a resistance or sensitivity profile of the microorganism to at least one antimicrobial.
11. A method according to claim 10, characterized in that the microorganism is taken from a patient or animal suspected of having a microbial infection, and in that the method comprises determining an antimicrobial therapy based on said profile and administering said therapy to the patient or animal.
12. A method according to any one of the preceding claims, characterized in that when the stopping criterion is reached, the sequencing is stopped and / or the final phenotypic character is displayed on a screen.
13. System for predicting a phenotypic trait of an organism based on its genome, comprising: • a nucleic acid sequencer, said sequencer being configured to sequence said genome so as to produce digital nucleic acid sequences representative of said genome, hereinafter "reads", said sequencer being configured to deliver the reads a) in real time and / or b) in batches at regular or periodic intervals, said reads being stored in a computer memory; • an information processing unit connected to the sequencer to receive said reads and comprising one or more microprocessors and a computer memory storing computer instructions which, when executed by the microprocessor(s), implement iterative processing, comprising for each iteration: - an update of a set of reads based on reads delivered in real time or in batches; - a calculation of the sequencing depth of said genome as a function of said set of reads; - a detection of predetermined length digital reference nucleic acid sequences, hereinafter "k-mers", in said set of reads, a k-mer being detected when its number of occurrences in said set is greater than a predetermined fraction of the calculated sequencing depth; - a prediction of phenotypic character based on the k-mers detected in said set of reads; and - a calculation of a stopping criterion and a storage of the predicted phenotypic trait as the final phenotypic trait when said stopping criterion is reached, characterized in that the calculation of the stopping criterion includes the calculation of the variation in the calculated sequencing depth between two iterations, the variation in the number of k-mers detected between two iterations, and the variation in the predicted phenotypic trait between two iterations, and in that the stopping criterion is reached: A. if the variation in the number of detected k-mers is stable during a first predetermined number of iterations; and B. following the stability of the variation in the number of k-mers detected during the first number of iterations: Bl if the variation in the phenotypic trait is stable during a second number of iterations; and B.
2. if the variation in sequencing depth calculated during the second predetermined number of iterations is greater than a predetermined threshold.
14. Prediction system according to claim 13, characterized in that the computer memory stores computer instructions which, when executed by the microprocessor(s), implement a method according to any one of claims 2 to 12.
15. Product computer program comprising a computer memory storing computer instructions which, when executed by one or more microprocessors, implement iterative processing according to any one of claims 1 to 12.
16. A method for producing a computer memory for predicting a phenotypic trait of an organism based on its genome, comprising: i. the iterative updating of a set of digital nucleic acid sequences representative of said genome, hereinafter referred to as "reads", based on new reads representative of said genome; ii. computer-implemented exploration of a space defined by at least a first number of iterations, a second number of iterations, and a threshold, hereinafter referred to as a "triplet", the exploration comprising, for each value of a set of triplets of said space, (a) a calculation of a sequencing depth of said genome based on said set of reads, and the detection of reference digital nucleic acid sequences of predetermined length, hereinafter referred to as "k-mers", in said set of reads.(a) a k-mer being detected when its number of occurrences in said set exceeds a predetermined fraction of the calculated sequencing depth, (b) a prediction of the phenotypic trait based on the k-mers detected in said set of reads, and (c) a calculation of a stopping criterion, the calculation of the stopping criterion comprising calculating the change in the calculated sequencing depth between two iterations, the change in the number of k-mers detected between two iterations, and the change in the predicted phenotypic trait between two iterations; iii. the storage in computer memory of (a) a triplet as the first iteration number, second iteration number, and threshold, and (b) computer instructions which, when executed by one or more, microprocessors, implement the iterative processing according to any one of claims 1 to 12, when said triplet fulfills the following conditions I and II: AI if the variation in the number of detected k-mers is stable during the first predetermined number of iterations; and B. following the stability of the variation in the number of k-mers detected during the first number of iterations: B. l. if the variation in the phenotypic trait is stable during the second number of iterations; and B.
2. if the variation in sequencing depth calculated during the second predetermined number of iterations is greater than the threshold, II. when condition I is met: C.
1. if the detected k-mers are also detected in a reference genome of the organism; C.
2. and if the detected phenotypic trait is identical to a reference phenotypic trait for the organism.