Predicting packaging of single stranded nucleic acid

A computer-based method applies virtual forces to ssNA molecules to predict their compaction into AAV capsids, addressing the challenge of packaging efficiency by iteratively compacting and using machine learning, thus enhancing the selection of ssNA molecules for gene therapy.

WO2026137036A1PCT designated stage Publication Date: 2026-07-02COMMONWEALTH SCI & IND RES ORG +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
COMMONWEALTH SCI & IND RES ORG
Filing Date
2025-12-19
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Current methods struggle to determine whether a single stranded nucleic acid (ssNA) molecule will fit into an adeno-associated virus (AAV) capsid, as the size of the genetic information is limited by the capsid, making it difficult to predict packaging efficiency.

Method used

A computer-implemented method using a simulator environment to apply virtual forces to ssNA molecules, iteratively compacting them to calculate spatial characteristics, such as radius of gyration, and optionally using machine learning to predict compaction into AAV capsids.

Benefits of technology

Enables accurate prediction of ssNA compaction into AAV capsids without structural damage, reducing the need for costly lab experiments and accelerating the selection of suitable ssNA molecules for gene therapy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure AU2025051468_02072026_PF_FP_ABST
    Figure AU2025051468_02072026_PF_FP_ABST
Patent Text Reader

Abstract

This disclosure relates to a computer-implemented method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector. The method comprises determining a three-dimensional structure of the ssNA molecule; applying a virtual force to elements of the ssNA molecule in a simulator environment using the three-dimensional structure, the virtual force being applied towards an interior point of the three-dimensional structure, the simulator environment being configured to simulate a chemical environment within the vector; and simulating, in the simulator environment, a compaction of the three-dimensional structure by the force applied to the elements of the ssNA molecule to calculate a spatial characteristic, related to packaging the ssNA molecule into the vector, of a compacted three-dimensional structure of the ssNA molecule.
Need to check novelty before this filing date? Find Prior Art

Description

"Predicting packaging of single stranded nucleic acid"Cross-Reference to Related Applications

[0001] The present application claims priority from Australian Provisional Patent Application No 2024904307 filed on 24 December 2024, the contents of which are incorporated herein by reference in their entirety.Technical Field

[0002] This disclosure relates to determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector.Background

[0003] Adeno-associated viruses (AAV) comprise a capsid (protein “shell”) and a genome containing genetic information encoded in nucleic acids, within the capsid. AAVs can be used for gene therapy by packaging therapeutic deoxyribonucleic acid (DNA) into the capsid. The nucleic acids for a single stranded nucleic acid molecule, such as an ssDNA molecule, is compacted, i.e. spatially compressed, so that it fits within the capsid.

[0004] There are protocols for introducing the desired genome into the AAV capsid but the amount of genetic information in the package is limited by the size of the capsid.

[0005] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

[0006] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each of the appended claims.Summary

[0007] This disclosure provides a computer-implemented method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector. The method comprises determining a three-dimensional structure of the ssNA molecule; applying a virtual force to elements of the ssNA molecule in a simulator environment using the three-dimensional structure, the virtual force being applied towards an interior point of the three-dimensional structure, the simulator environment being configured to simulate a chemical environment within the vector; and simulating, in the simulator environment, a compaction of the three-dimensional structure by the force applied to the elements of the ssNA molecule to calculate a spatial characteristic, related to packaging the ssNA molecule into the vector, of a compacted three-dimensional structure of the ssNA molecule.

[0008] It is an advantage that applying the virtual force compacts the ssNA molecule in the simulator. This way, it is possible to calculate a compacted molecule to determine its spatial characteristics.

[0009] In some embodiments, simulating the compaction comprises iteratively simulating the compaction until a convergence criterion is met. It is an advantage that the iterative stimulation prevents breaking the molecule in the simulator.

[0010] In some embodiments, iteratively simulating the compaction comprises, in each iteration, increasing the force.

[0011] In some embodiments, the convergence criterion is a loss of structural integrity of the molecule.

[0012] In some embodiments, applying the virtual force comprises creating a boundary structure in the simulator environment and applying the force to elements of the ssNA molecule outside the boundary structure. It is an advantage that the boundary structure enables computationally efficient evaluation of the applied force on the molecule.

[0013] In some embodiments, the boundary structure has a size and simulating the compaction comprises iteratively reducing the boundary size.

[0014] In some embodiments, the boundary structure comprises a first boundary and a second boundary and a first force is applied to elements outside the first boundary and a second force is applied to elements outside the second boundary, wherein the first force and the second force are towards the interior point of the three-dimensional structure. It is an advantage that both forces are applied towards the interior point because this way, the compaction can be calculated reliably and without tangling up the molecule before the final compaction is achieved.

[0015] In some embodiments, the first boundary and the second boundary are spherical boundaries. It is an advantage that spherical boundaries mimic the final shape of a virus usable for gene therapy.

[0016] In some embodiments, the ssNA molecule comprises genetic information for a virus and genetic information to provide gene therapy.

[0017] In some embodiments, the virus has a size and the spatial characteristic is an indication of whether the ssNA molecule compacts into the size of the virus. It is an advantage that a given ssNA strand can be tested on whether it will fit into the virus. This way, the given ssNA strand can be evaluated in silico without the expense and delay of a lab experiment.

[0018] In some embodiments, the spatial characteristic comprises a radius of gyration.

[0019] In some embodiments, the method further comprises repeatedly performing the steps of determining the three-dimensional structure, applying the virtual force, and stimulating the compaction for multiple different ssNA molecules to calculate multiple respective spatial characteristics; and using the multiple respective spatial characteristics to train a machine learning model configured to receive, at an input, a description of the ssNA molecule and generate, at an output, the spatial characteristic to determine a trained machine learning model.

[0020] In some embodiments, the method further comprises applying the trained machine learning model to a new ssNA molecule to generate spatial characteristics of the new ssNA molecule, wherein applying the trained machine learning model is computationally less expensive than performing the steps of determining the three-dimensional structure, applying the virtual force, and stimulating the compaction of the new ssNA molecule.

[0021] In some embodiments, the method further comprises validating the spatial characteristic against packaging efficiency determined in a lab experiment.

[0022] In some embodiments, the method further comprises iteratively swapping codons in the ssNA sequence to optimise the spatial characteristic.

[0023] A computer-implemented machine learning method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector comprises accessing a trained machine learning model, the trained machine learning model being configured to receive, at an input, a description of the ssNA molecule and generate, at an output, the spatial characteristic of a compacted three-dimensional structure of the ssNA molecule , the machine learning model being trained on multiple training samples; and applying the trained machine learning model to the ssNA molecule to generate the spatial characteristics of the compacted three-dimensional structure of the ssNA molecule.

[0024] In some embodiments, each of the multiple samples comprises a description of an ssNA molecule for the input and a corresponding simulated spatial characteristics as labels for the output.

[0025] In some embodiments, the method comprises training the machine learning model by adjusting parameters of the machine learning model to reduce an error between an output of the machine learning model and the simulated spatial characteristics.

[0026] Software, when executed by a computer, causes the computer to perform the above method.

[0027] A computer system comprises one or more processors configured to perform the above method.

[0028] An adeno-associated virus for gene therapy comprises the above ssNA molecule.Brief Description of Drawings

[0029] An example will now be described with reference to the following drawings:

[0030] Figure 1 illustrates an example pipeline for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector.

[0031] Figure 2 illustrates boundary conditions at various steps of the proposed method.

[0032] Figure 3 illustrates a computer-implemented method 300 for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector, such as an AAV.

[0033] Figure 4 illustrates a computer system 400 for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector.

[0034] Figure 5 illustrates a computer-implemented machine learning method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector.

[0035] Figure 6 is a graph to illustrate the relationship between the length of DNA sequences and their radius of gyration.

[0036] Figure 7 illustrates the performance Evaluation of the Hybrid CNN-RNN model with sequence input.

[0037] Figure 8 is flow chart of a method for codon optimisation using the exemplary method of Figure 5.Description of Embodiments

[0038] This disclosure relates to methods for packaging single strand nucleic acid molecules into a vector, such as an adeno-associated virus. Within this disclosure, a “vector” refers to a viral vector unless stated otherwise, such as in the case of a support vector, which is a data structure in a software program. It is further noted that some examples disclosed herein relate to an adeno-associated virus (AAV). However, the disclosed solution is equally applicable to other viruses. Therefore, the reference to AAV in this disclosure can be replaced by other suitable viruses that have a spherical capsid such as an icosahedral capsid. In addition, the disclosed solution is applicable to viruses that do not have perfectly spherical capsids. In particular, this includes lentiviruses as well as all other members of the Retroviridae family. Further, the disclosed solution is well-suited for all viruses in the Parvoviridae family, which includes AAV, and for all viruses in the Adenoviridae family, such as adenoviruses, which may be used as vectors in gene therapy. The disclosed solution can also be applicable to other families, such as Picornaviridae (e.g., poliovirus), Flaviviridae (e.g., dengue and zika), Togaviridae (e.g., chikungunya), Hepadnaviridae (e.g., hepatitis B virus), Cytomegalovirus and Paramyxoviridae (e.g., Hendra virus, Newcastle disease virus).

[0039] The process of genome packaging for adeno-associated viruses (AAVs) begins with the preparation of the AAV genomes, which are flanked by inverted terminal repeats (ITRs). These genomes are packaged as single-stranded DNA into preformed empty capsids. This packaging mechanism is intricately linked with the replication process, ensuring that the genomes are correctly inserted into the capsids. The small Rep proteins play a role in this process, facilitating the translocation of the genome into the capsid.

[0040] For genes that exceed the packaging limit, it may be possible to employ single oversized AAV vectors and dual AAV vectors. Single oversized AAV vectors involve packaging a larger genome into a single capsid, while dual AAV vectors split the genome into two separate vectors that can recombine within the target cell. This approach allows for the delivery of larger genes that would otherwise be too large for a single AAV vector.Additionally, self-complementary AAV vectors are used to expedite the onset of transgene expression. These vectors package both complementary strands of the genome within a single capsid, enabling rapid and efficient expression of the transgene.

[0041] Replication and packaging are tightly coupled processes, with the replication of the AAV genome occurring concurrently with its packaging into the capsid. This coupling ensures that the genome is replicated and packaged in a coordinated manner, minimizing the risk of errors and maximizing the efficiency of the process. The Rep proteins not only facilitate the translocation of the genome into the capsid but also play a role in the replication of the genome, further highlighting their importance in the overall process.

[0042] It is understood that there is a packaging limit to what fits into an AAV capsid. The packaging limit may be formulated as a number of bases, such as 4.7 kb. However, the actual limit depends on the question of whether a given molecule can be compacted to fit within the size of the capsid. The answer to this question depends on the actual bases in the molecule because they determine the internal forces and different structures. As a result, there may be molecules that have fewer bases than the packaging limit but do not fit into the capsid, while there may be molecules that have more bases than the packaging limit that do fit.

[0043] It is a major obstacle to drug development that it is currently difficult to determine whether the molecule will fit or not.Computational pipeline

[0044] Therefore, this disclosure provides a computer implemented method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule. This ssNA molecule may be a single stranded desoxyribonucleic acid (ssDNA) molecule, a single stranded ribonucleic acid (ssRNA) molecule or any other ssNA molecule. The determined spatial characteristic may be indicative of whether the ssNA molecule fits within a given capsid.

[0045] It is noted that when reference is made to a ssNA molecule, this is meant to refer to an actual chemical molecule. When reference is made to an ssNA sequence, this is meant to refer to the identification of bases in the sequence, such as a string of the letters AGCTU stored or processed in a computer (i.e. in silico).

[0046] A single stranded nucleic acid (ssNA) molecule is a type of nucleic acid molecule consisting of a single linear chain of nucleotides. Unlike double stranded nucleic acids, which form stable double helix structures through complementary base pairing, single stranded nucleic acids exist as single sequences of bases (adenine, guanine, cytosine, thymine, oruracil) without forming such pairings within the same molecule. There are primarily two types of single stranded nucleic acids, classified based on the sugar present in their nucleotides: single stranded deoxyribonucleic acid (ssDNA) and single stranded ribonucleic acid (ssRNA). The former contains the sugar deoxyribose and includes the bases adenine (A), guanine (G), cytosine (C), and thymine (T). The latter contains the sugar ribose and includes the bases adenine (A), guanine (G), cytosine (C), and uracil (U) instead of thymine. Single stranded nucleic acids have several distinct characteristics. Structurally, ssNAs do not form a double helix. Instead, they may fold into various secondary and tertiary structures due to intramolecular base pairing and other interactions. Functionally, ssNAs play roles in various biological processes. For example, ssRNA molecules include messenger RNA (mRNA), which carries genetic information from DNA to ribosomes for protein synthesis, and ribosomal RNA (rRNA) and transfer RNA (tRNA), which are involved in the translation process. In terms of stability, ssNAs are generally less stable than their double stranded counterparts due to the lack of hydrogen bonding between complementary strands. However, certain secondary structures, such as hairpins and stem-loops, can increase their stability.

[0047] The process of arriving at the final result may comprise a number of different steps, also referred to as a pipeline. Depending on available software packages, the pipeline may comprise a number of different programs, or software packages, such as different binaries that pass data to each other in the form of files stored on a file system or by accessing a shared database, such as a relational (e.g., SQL) or a graph database.

[0048] Figure 1 illustrates an example pipeline 100, comprising a first part 101 (potentially including multiple software packages) used to determine a three-dimensional structure of the ssNA molecule. That is, the ssNA molecule is simulated to determine its three-dimensional structure in the absence of a capsid, so in free space. In that free space example, the three dimensional structure is dependent on the bonds, electrons, etc. between the different atoms of the molecule. This three-dimensional structure does not yet yield sufficient information on whether that molecule can fit into a given capsid.

[0049] In one example, the pipeline comprises a software package designed to predict the secondary structure of nucleotide sequences by calculating the minimum free energy (MFE) structure of these sequences, which influences their stability. The software reads nucleotidesequences and computes their MFE structures using a dynamic programming algorithm. This algorithm, allows the software to efficiently predict the most stable secondary structure for a given sequence. More particularly, the algorithm systematically evaluates all possible secondary structures of the sequence to find the one with the lowest free energy. The MFE structure is considered the most stable configuration of the ssNA molecule. The output may be presented in bracket notation. This algorithm facilitates the prediction of nucleic acid folding patterns, leveraging thermodynamic parameters to provide insights into probable secondary structures and base-pairing probabilities. With these preliminary folding patterns as the foundation, the pipeline then advances to a more intricate three-dimensional structural analysis. In one example, the first part 101 may be implemented using the RNAfold program in the ViennaRNA Package from the University of Vienna (https: / / www.tbi.univie.ac.at / RNA / ).

[0050] The output of this first part 101 of the pipeline may be fed into a second part 102 of the pipeline to predict the tertiary structure of the ssNA molecule. This may be achieved by biophysical modelling by detailed sampling and energetic estimates of structural and mutational degrees of freedom. The algorithms in the second part 102 of the pipeline may handle full-atom representations of the ssNA molecule, which allows for high-resolution modelling. More specifically, the second part may employ a robust energy function to evaluate the stability of nucleic acid structures. This energy function includes terms for van der Waals interactions, electrostatics, hydrogen bonding, and solvation effects. By minimizing this energy function, the algorithm can identify the most stable conformations of nucleic acids. This process is useful for predicting the secondary and tertiary structures of the ssNA molecule. In one example, the second part 102 may be implemented using the RosettaDNA software available from Rosetta Commons (https: / / rosettacommons.org / software / ) or more specifically, a parameter-optimized version of Fragment Assembly of RNA with Full Atom Refinement (FARFAR2) https: / / docs.rosettacommons.org / docs / latest / FARFAR2. This tool enables the extrapolation of the spatial configurations of these molecules by employing an energy -based approach and optimizing the configuration of the nucleic acid molecules to reach energetically favourable conformations.

[0051] The output of this second part 102 of the pipeline may be fed into a third part 103 of the pipeline to address solvation and ionization. The third stage may comprise a softwarepackage designed to create initial configurations for molecular dynamics simulations by packing molecules within defined regions of space. The software package may ensure that short-range repulsive interactions do not disrupt the simulations, which is beneficial for obtaining accurate and reliable results. One of the software’s functions is to generate initial coordinates for a system of molecules, ensuring that they are appropriately spaced to avoid overlaps and unrealistic interactions. This is particularly beneficial in molecular dynamics simulations, where the initial configuration can significantly impact the accuracy and efficiency of the simulation.

[0052] To solvate and ionize the ssNA molecule, the process defines the molecule’s coordinates, obtained from the second part 102 of the pipeline. The molecule is then placed in a simulation sphere, which represents the environment in which the simulation will occur. A user may specify the dimensions of this sphere and the number of solvent molecules, such as water, to be added. The software uses an algorithm to distribute the solvent molecules around the ssNA, ensuring that they are evenly spaced and do not overlap with the ssNA or each other. This step is beneficial for accurately modelling the solvation shell around the ssNA, which plays a role in its structural stability and interactions with other molecules.

[0053] In this step the system is also ionized. This involves adding ions, such as sodium or chloride, to neutralize the charge of the ssNA and mimic physiological conditions. The user may specify the number and type of ions to be added, as well as their placement within the simulation sphere. The software uses a similar algorithm to distribute the ions, ensuring that they are appropriately spaced and do not overlap with the ssNA or solvent molecules. This step is beneficial for accurately modelling the electrostatic interactions within the system, which can significantly impact the ssNA’s packaging. In one example, the third step may be implemented using the software package Packmol from the Universities of Campinas and Sao Paulo available at https: / / m3g.github.io / packmol / . Solvating the extended form of nucleic acids using VMD may result in excessively large molecular systems, often in the millions of atoms, due to cubic solvation of the max and min cartesian coordinates of the nucleic acids. Instead, Packmol was utilized to construct a sphere around the nucleic acid structure that was then filled uniformly with a definite number of water molecules and magnesium ions. The quantity of water molecules and magnesium ions is chosen following the length of the nucleic acid chain not to overload the system.

[0054] Once solvation and ionization is accounted for, which is also referred to herein as determining a three-dimensional structure of the ssNA molecule, a fourth parti 04 of the pipeline performs the compaction of the molecule.

[0055] The fourth part 104 may comprise a molecular dynamics software package suitable to simulate large biomolecular systems, involving millions of atoms, by leveraging parallel computing techniques, which allows it to efficiently utilize hundreds to thousands of processor cores, making it suitable for high-performance computing environments. One functionality of the software may involve calculating the forces and subsequent movements of atoms within a molecular system based on classical mechanics principles. It may employ a variety of algorithms to handle different aspects of molecular dynamics simulations, such as energy minimization, temperature and pressure control, and long-range electrostatics. For instance, the software may use the Particle Mesh Ewald (PME) method to efficiently compute long-range electrostatic interactions.Compaction

[0056] When it comes to compacting the ssNA molecule, the software package can be utilized to perform steered molecular dynamics simulations. In these simulations, virtual forces are applied to elements of the ssNA, such as specific atoms or groups of atoms to induce conformational changes in the molecule. This happens in the simulator environment using the three-dimensional structure from the third part 103. The virtual force is applied towards an interior point of the three-dimensional structure. This could be the geometrical centre, centre of gravity, centre of the sphere of third part 103 or another interior point of the three-dimensional structure. The simulator environment is configured, by way of adding, water, ions, etc., to simulate a chemical environment within the vector, such as within the AAV capsid. The software then simulates, in the simulator environment, the compaction of the three-dimensional structure by the force applied to the elements of the ssNA molecule to calculate a spatial characteristic, related to packaging the ssNA molecule into the vector, of a compacted three-dimensional structure of the ssNA molecule. For example, to compact the ssNA molecule, one might apply a pulling force to the ends of the ssNA strand, gradually bringing them closer together. This process involves defining a reaction coordinate, such as the distance between the ends of the ssNA. The force applied can be controlled to mimicexperimental or production conditions, allowing researchers to observe how the ssDNA molecule responds to compaction forces at an atomic level.

[0057] During the simulation, the software calculates the forces acting on each atom in the ssNA molecule at each time step, updating their positions and velocities accordingly. The software uses a combination of bonded and non-bonded force calculations to account for interactions within the molecule and with the surrounding environment. Bonded interactions include bond stretching, angle bending, and dihedral angle torsions, while non-bonded interactions encompass van der Waals forces and electrostatic interactions. By iteratively solving Newton's equations of motion for all atoms in the system, the software simulates the dynamic behaviour of the ssNA molecule as it undergoes compaction. One package that can be used in the fourth part of the pipeline is the software NAMD, which stands for Nanoscale Molecular Dynamics from the University of Illinois Urbana-Champaignhttp s : / / www. ks . uiuc . edu / Re search / namd / .

[0058] In some examples, it has been found that a direct application of a force sufficiently strong to compact the ssNA molecule can lead to structural damage of the ssNA molecule before it is suitably compacted in the simulator. To address this problem, it is possible to simulate the compaction by iteratively simulating the compaction until a convergence criterion is met and in each iteration, increasing the force applied to the elements of the ssNA molecule. This provides for a gentler compaction that leads to a greater reduction in size before the molecule breaks. This breaking of the molecule, i.e. loss of structural integrity, can be used as a convergence criterion in the sense that the result of the last iteration is used, where the molecule still has its structural integrity.Boundary conditions

[0059] One particularly beneficial way of applying the force iteratively is by defining a boundary structure in the simulator environment. This way, the elements of the molecule are subjected to a force that depends on their location relative to the boundary. The result is a more homogenous and well controlled application of the force and thus a more advantageous compaction result. In some examples, the force is applied to the elements of the molecule that are outside the boundary so that they are forced into the boundary. In other examples, the force is applied to elements of the molecule that are within the boundary to further compactthe elements of the molecule that are within the boundary. Once the result of the applied force is calculated, it is possible to reduce the size of the boundary structure. It is another example of the boundary structure that its size can be reduced relatively easily by adjusting a single parameter, such as a radius of a spherical boundary or an edge length of a cubical boundary. The force is then increased implicitly since the elements of the molecule now have a changed relative location with respect to the boundary condition.

[0060] The simulator can use spherical boundary conditions with one or two potential functions. The potential energy is:> reconstantrt= current position of atom icentre of the spherical boundaryradius of the spherical boundaryexponent for the potential functionThe force applied is:whereunit vector in the direction from atom z to the centre of the sphere.

[0061] A positive force constant pulls atoms to the sphere's centre. In one example, spherical boundary conditions are used to impose a spheroidal configuration on the system. The geometrical centre of this configuration may be anchored at the origin, (0, 0, 0).

[0062] In a further improvement, the boundary structure may comprise a first boundary and second boundary. The first boundary may be smaller than the second boundary and the first boundary may lie within the second boundary. In that sense, the first boundary may be referred to as an inner boundary and the second boundary may be referred to as an outer boundary. The force applied by the two boundary conditions is calculated as set out above. Two concentric spherical boundaries can be defined comprising an inner one with a radius of10 units and an outer one extending up to 50 units. Initially, both these boundaries are subjected to a nominal force constant of 0.000001, characterized by an exponential factor of 2. As the simulation progresses, the force constants are systematically increased in stages, culminating in a value of 0.025. This final state represents a compact, spheroidal nucleic acid structure, akin to a tightly packed ball. When a loss in structural integrity is observed, the molecule from the previous iteration is used for further modelling. That is, the loss of structural integrity is used as a convergence criterion of the iterative method.

[0063] Figure 2 illustrates the concept of boundaries, starting with a nucleotide structure 201 generated by Packmol including ionisation and solvation and going through packaging in a first step 202, second step 203, third step 204, fourth step 205, resulting in a final structure 206. As can be seen in the figure, there is an outer boundary 207 and an inner boundary 208 labelled in first step 202. In each step, the size, such as the radius, of the outer boundary 207 and the inner boundary 208 is reduced. Further, at each step, the force acting on the atoms relative to the boundaries is increased in each step. This means that the force at the beginning of the compaction at first step 202 is relatively weak so to not break the molecule. However, it has been found that the molecule can take stronger forces towards the end of the compaction without breaking.

[0064] It is noted that while some examples herein relate to spherical boundary conditions and spherical potential functions, other configurations are equally possible. That is, the boundaries may have any shape, including the shape of an ellipsoid, cylinder, polyhedron, egg-shaped and others. The force applied to each element may be governed by any force field. This force may depend on the distance from the centre equally in any direction, or differently for each direction, which is advantageous in combination with ellipsoid boundaries, such as ax2+by2+cz2, where x, y, z are the distances from the centre in x, y, z direction respectively, which may be along the ellipsoid axes. The result may be normalised. Further, a general force field can be used that is described by an arbitrary function. In that case, there may be no boundary since the relationship between distance and force may be defined explicitly in the force field definition. This allows the selection of boundary conditions and force fields that best suit the application. For example, the boundary condition can the chosen so that it matches the shape of the capsid of the vector to be used, e.g. prolate boundary conditions for a prolate capsid shape.

[0065] Subsequent the formation of the nucleic acid into a spherical (or differently shaped) structure, there may be a series of canonical production simulations utilizing the NAMD software package, to facilitate molecular equilibration and stabilization. Subsequently, the structure's radius of gyration is assessed to determine its stabilization point, specifically identifying when it reached a plateau. In particular, the final solvated and compacted structure is used for a series of NAMD simulations to achieve molecular equilibration. More specifically, the NAMD simulation starts with an equilibration step to remove little tangles. Then, it will go through production runs until the radius of gyration (rg) reaches a plateau which means that the molecule is relatively stable. Throughout these simulations, the rg is monitored to identify the stabilization point of the molecule, which is marked by a plateau in its structural changes. This enables the determination of the stability of the molecule by assessing the changes in rg over time.

[0066] In conclusion, the disclosed approach offers a comprehensive computational model for nucleic acid encapsulation, highlighting the intricate dynamics of nucleotide packaging. This framework has the potential to guide experimental designs, validate theoretical models, and contribute to innovations in therapeutic and diagnostic applications. It represents a significant advancement towards understanding nucleic acid dynamics in various bioencapsulation settings.Computer implemented method

[0067] Figure 3 illustrates a computer-implemented method 300 for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector, such as an AAV. The method 300 comprises determining 301 a three-dimensional structure of the ssNA molecule. As described above, this may comprise a pipeline of software applications, such as RNAfold, Rosetta and Packmol. This three-dimensional structure may be the structure that the ssNA molecule assumes under realistic environmental conditions, including the presence of water molecules and ions, such as salts. This three-dimensional structure may be determined for the situation where there is an absence of external forces acting on the molecule, such as an absence of spatial constraints.

[0068] The method 300 then comprises applying 302 a virtual force to elements of the ssNA molecule. These elements of the ssNA molecule may be atoms, such that the virtual force isapplied to every atom in the molecule or to every atom that is within a spatial area (inside or outside a boundary). In other examples, the force is only applied to specific atoms, such as phosphorus atoms. In yet another example, the force is applied to parts of the molecule consisting of multiple atoms, such as hydroxyl groups or other parts of the phosphatedeoxyribose backbone. The virtual force is applied in a simulator environment (in silico) using the three-dimensional structure provided by the previous step. This is the meaning of ‘virtual’ force in that the force is not physically applied to the molecule in a lab but rather the force is applied virtually, by a computer in the software-based simulator environment. It is noted that the computer architecture for the simulation of the molecule may have a range of different types, such as program modules written in source code, such as C++, to be compiled for a central processing unit (CPU), interpretable code, such as Python, executed by an interpreter. Other options are field programmable gate arrays (FPGAs), graphics processing units (GPU) and application specific circuits (ASICs). Even non-digital architectures are possible, including quantum computing or quantum simulation where the Hamiltonian of the system is replicated by a number of qubits.

[0069] Further, the virtual force is applied towards an interior point of the three-dimensional structure. As mentioned above, the interior point may be the geometric centre, centre of gravity or any other point that is interior of the molecule such that the force moves the elements of the molecule closer together. It is further noted that the simulator environment is configured to simulate the chemical environment within the vector, which may mean that chemical constituents of the chemical environment are also added virtually to the simulator environment. These constituents may be the same or different ones to those added for estimating the three-dimensional structure of the molecule. That is, the chemical environment may comprise water and ions and those may already by present as a result of the previous steps of the pipeline.

[0070] The simulator then simulates 303, in the simulator environment, a compaction of the three-dimensional structure by the force applied to the elements of the ssNA molecule. That is, the simulator calculates how the molecule becomes smaller so as to determine whether it can be compacted to the size that is available in the vector. The calculation is performed with the aim of calculating a spatial characteristic of the compacted three-dimensional structure of the ssNA molecule. This spatial characteristic is related to packaging the ssNA molecule intothe vector, such as a radius of gyration. Once the spatial characteristic is calculated, it can be used to decide whether to proceed with manufacturing the given ssNA molecule for clinical trials, for example. This way, the method may comprise repeating the steps 301, 302, 303 for each candidate ssNA molecule. The advantage is that a large number of molecules can be tested for possible compaction before a difficult lab experiment confirms its suitability.Computer system

[0071] Figure 4 illustrates a computer system 400 for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector. The computer system 400 comprises a processor 401, a non-volatile computer readable medium working as program memory 402 and data memory 403. Program memory 402 has program code stored there on to cause processor 401 to execute the methods disclosed herein and in particular method 300. The computer system 400 may be implemented with a single processor, multiple processor cores, multiple processors and other computing platforms as set out above. Data memory 403 may also be non-transitory, such as a hard-disk or solid state disk or cloud memory. Data memory 403 may store a range of data, including ssNA sequence data, spatial characteristics, training samples and a machine learning model. The machine learning model may be stored on data memory 403 in the form of parameter values, which may be adjusted in the course of training the machine learning model.

[0072] By executing the instructions stored on program memory 402, processor 401 determines a three-dimensional structure of the ssNA molecule and applies a virtual force to elements of the ssNA molecule in a simulator environment using the three-dimensional structure. Again, the virtual force is applied towards an interior point of the three-dimensional structure and the simulator environment is configured to simulate a chemical environment within the vector. Further, processor 401 simulates, in the simulator environment, a compaction of the three-dimensional structure by the force applied to the elements of the ssNA molecule to calculate a spatial characteristic. The spatial characteristic is related to packaging the ssNA molecule into the vector. Further, the spatial characteristic is of a compacted three-dimensional structure of the ssNA molecule.Gene therapy

[0073] As mentioned above, the ssNA molecule can be used in a vector for applying gene therapy. Gene therapy is a technique that aims to treat or prevent diseases by introducing genetic material into cells. The genetic material can be either DNA or RNA, and it can carry instructions to correct or replace faulty genes, enhance the expression of beneficial genes, or interfere with the activity of harmful genes. One of the challenges of gene therapy is to deliver the genetic material efficiently and safely to the target cells, without causing unwanted immune responses or side effects. One possible way to achieve this is to use a vector, which is a vehicle that can carry and protect the ssNA inside. Vectors can be derived from viruses, bacteria, plasmids, or synthetic materials, and they have different advantages and disadvantages depending on their size, stability, specificity, immunogenicity, and transfection efficiency. The choice of the vector depends on the type and location of the target cells, the nature and amount of the genetic material, and the desired duration and outcome of the gene therapy. The ssNA molecule can be used in a vector for applying gene therapy, where it can fold into a compact and stable three-dimensional structure that can fit inside the vector and resist degradation by enzymes. The three-compacted dimensional structure of the ssNA molecule can also influence its interaction with the target cells and the expression of the encoded genes.

[0074] The size of a vector virus affects the use of gene therapy in several ways. First, the size of the vector virus determines how much genetic material it can carry and deliver to the target cells. Larger vector viruses can accommodate longer or more complex ssNA molecules, which may be desirable for certain gene therapy applications. However, larger vector viruses may also have lower transfection efficiency, meaning that they may infect fewer target cells or transfer less ssNA molecules per cell. Second, the size of the vector virus influences its stability and immunogenicity, which are factors for the safety and efficacy of gene therapy. Smaller vector viruses may be more stable and less prone to degradation by enzymes or environmental factors, which can improve their shelflife and viability. Smaller vector viruses may also elicit less immune responses or inflammation, which can reduce the risk of adverse reactions or complications. However, smaller vector viruses may also have lower specificity, meaning that they may infect non-target cells or tissues, which can cause unwanted effects or off-target gene expression. Therefore, the size of the vector virus may be a trade-off between different aspects of gene therapy. This disclosure provides a methodology for determiningwhether a given ssNA molecule fits into a given virus so to aid the selection of the most promising avenue for gene therapy.

[0075] The size of the vector virus may be measured and quantified by the diameter of the viral capsid, which is the protein shell that surrounds and protects the genetic material inside. The diameter of the viral capsid can be determined by various methods, such as electron microscopy, dynamic light scattering, or atomic force microscopy. These methods can provide information about the shape, size distribution, and surface morphology of the viral capsids. The diameter of the viral capsid is a parameter for assessing the suitability of a virus for gene therapy, as it affects the amount of ssNA that can be packaged into the vector, as well as the stability, immunogenicity, and specificity of the vector.

[0076] One of the spatial characteristics of a ssNA molecule that can be used to determine whether it can fit into a vector is the radius of gyration. The radius of gyration is a measure of the size and shape of a polymer, such as a ssNA molecule, in solution. It is defined as the root mean square distance of the segments of the polymer from its centre of mass. The radius of gyration can help to determine whether a ssNA molecule can fit into a vector by comparing it with the diameter of the vector. If the radius of gyration of the ssNA molecule is smaller than half of the diameter of the vector, it means that the ssNA molecule can be packaged into the vector without being stretched or compressed. However, if the radius of gyration of the ssNA molecule is larger than half of the diameter of the vector, it means that the ssNA molecule cannot be packaged into the vector without being distorted or damaged. Therefore, the radius of gyration can provide a criterion for selecting the appropriate vector for a given ssNA molecule, or for designing the optimal ssNA molecule for a given vector.Machine learning

[0077] While the above disclosure provides for a single run of calculating the spatial characteristic of a compacted ssNA molecule, it is also possible to repeatedly perform the steps of determining the three-dimensional structure, applying the virtual force, and stimulating the compaction for multiple different ssNA molecules to calculate multiple respective spatial characteristics. As a result, there is a number of data samples that each have the input ssNA sequence and an output spatial characteristic. The output spatial characteristic could be considered a label for the input ssNA data sample.

[0078] As such, it is possible to use the multiple respective spatial characteristics to train a machine learning model. The structure of the machine learning model is such that it is configured to receive, at an input, a description of the ssNA molecule, such as the ssNA sequence. While in some examples the ssNA sequence is the only input to the machine learning model, other models may receive other data at their input. At its output the ML model is configured to generate the spatial characteristic, such as the radius of gyration. The training is iterative so that the parameters of the machine learning model are adjusted to minimise the error between the current output and the label of the current sample. The result is a trained machine learning model that can be used for later evaluation.

[0079] More particularly, the trained machine learning model may be applied to a new ssNA molecule to generate spatial characteristics of the new ssNA molecule. In that case, applying the trained machine learning model is computationally less expensive than performing the steps of determining the three-dimensional structure, applying the virtual force, and stimulating the compaction of the new ssNA molecule. Therefore, the ML model can significantly accelerate the calculation and evaluation of many ssNA molecules. In one example, the ML method may be less accurate but much faster than the simulation described herein. Therefore, the ML method can be used to screen a very large number of molecules quickly. The more accurate simulation method disclosed herein can then be applied to the remaining molecules for a more accurate assessment.

[0080] The ML model may comprises Support Vector regression, Random Forest or logistic regression. The ML model may also comprise a neural network, such as a deep neural network, recurrent neural network or convolutional neural network. Other types of ML model may equally be usable.

[0081] Figure 5 illustrates a computer-implemented machine learning method 500 for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector. Method 500 may be performed by processor 401 or a different computer system. Processor 401 accesses 501 a trained machine learning model, such as by reading adjusted parameters from data memory 403. The trained machine learning model is configured to receive, at an input, a description of the ssNA molecule and generate, at an output, the spatial characteristic of a compacted three-dimensional structure of the ssNAmolecule. Further, the machine learning model being trained on multiple training samples, which may be generated by the methods disclosed herein, including method 300. Processor 401 then applies 502 the trained machine learning model to the ssNA sequence to generate the spatial characteristics of the compacted three-dimensional structure of the ssNA molecule as disclosed herein.

[0082] It may be technically difficult to validate the output of the method because determining a radius of gyration is difficult for some labs. However, the packaging efficiency or the yield, can be determined more readily. Therefore, it is possible to validate the calculated spatial characteristic against packaging efficiency or yield determined in a lab experiment.

[0083] During evaluation, processor 401 accesses the trained machine learning model. The trained machine learning model is configured to receive, at an input, a description of the ssNA molecule and generate, at an output, the spatial characteristic of a compacted three-dimensional structure of the ssNA molecule. Further, the machine learning model is trained on multiple training samples calculated as described with reference to Figure 3. Further, processor 401 applies the trained machine learning model to the ssNA molecule to generate the spatial characteristics of the compacted three-dimensional structure of the ssNA molecule. The multiple samples may comprise a description of an ssNA molecule for the input and a corresponding simulated spatial characteristics as labels for the output.Experiments

[0084] Figure 6 illustrates the relationship between the length of DNA sequences and their radius of gyration. It has been observed as on this data plot that a sequence of length 1000 can have the same radius of gyration as a sequence of 3000 nucleotides. This indicates the usefulness of the disclosed method as it can calculate the radius of gyration reliably so that length does not need to be the relevant factor.

[0085] Different models such as Support Vector regression, Random Forest and logistic regression have been tried as the ML model to train in accordance with this disclosure. The Random Forest algorithm initially showed promise, boasting impressive performance metrics that suggested a high level of accuracy. However, a deeper dive into the learning curves revealed that the model was overfitting. Whereas, Support Vector Regression model not onlyprovided robust metrics but also displayed a learning curve indicative of excellent generalization, without discernible overfitting. The table below provides some results:CVAvg MSE CV StdMSE MSE MAE RMSE 1 1 Random119.95 72.75 83.64 6.87 9.14Forest1 1 Logistic1000.27 994.85 676.67 20.36 23.7regression1 1 SVR 325.21 266.5 219.8 9.42 14.8

[0086] Our machine learning is based on the identification of appropriate features and target values. In evaluating ssNA packaging within AAV capsids, we considered various compactness measures as potential target values, including volume and Radius of Gyration (Rg). The Radius of Gyration (Rg) is a measure that describes the distribution of the components of an object around its centre of mass. In the context of ssNA sequences, Rg provides insight into how tightly or loosely the sequence is packed. For example, a lower Rg value indicates that the nucleic acid sequence is more compactly folded, whereas a higher Rg value suggests a more extended conformation. This metric is particularly useful for understanding the spatial organization of ssNA within confined environments such as viral capsids. Rg was selected as the primary target variable due to its superior ability to capture spatial distribution differences. Unlike volume measurements, Rg effectively quantifies how ssNA molecules occupy three-dimensional space. The Rg were measured for all the all sequence in the dataset .

[0087] We engineered a set of features to capture the multifaceted attributes of ssNA sequences that potentially influence capsid interactions. We featured the ssNA sequence composition as nucleotide frequency, GC content, sequence conservation, sequence entropy. Furthermore, we incorporated features derived from the ssNA's secondary structure, including quantitative measures of specific structural elements such as loops, unpaired nucleotides, free energy. Additionally, we incorporated features representing tertiary structure and diversemolecular properties. For tertiary structure, we included metrics such as solvent-accessible surface area, volume, and sphericity. Lastly, we employed feature extraction tools like Seq2Feature to capture a broader range of physio-chemical, energetic, and conformational properties of the DNA sequences.Machine learning model training and testing

[0088] Scikit-learn, a free Python machine learning library, was used to conduct all model training and testing procedures. Other Python libraries, including numpy, pandas, and matplotlib, were also included for data analysis and presentation. Specifically,matpl otlib.pyplot was used to generate the learning curve, Actual vs predicted, residual plot and Q-Q plot. For all models random stated was fixed at 42.

[0089] To evaluate the performance of all models, the training dataset was first subjected to fivefold cross-validation using multiple machine learning algorithms. LinearRegression was used to construct the linear-regression model. RandomForestRegressor was used to construct the random forest model with 100 decision trees. Permutation importance was performed to carry out feature selection.

[0090] For support vector regression (SVR) model with a radial basis function (RBF) kernel was chosen and was configured with a regularization parameter C=1 .0C and an epsilon value of 0.1 . For all deep neural network models, we employed the Adam optimizer and utilized mean squared error as the loss function.

[0091] For the Convolutional Neural Network (CNN) analysis, we implemented four distinct approaches: The first approach utilized only the nucleotide sequence as input. The second method employed structural information alone. In the third approach, we combined nucleotide sequence and structural data into a single integrated feature for model training. Finally, we implemented a dual-input strategy where sequence and structure were processed as separate features within the same model architecture. For each approach we fixed the number of first convolutional layers at 32, a kernel size of 3, and a ReLU activation function. Following the convolutional layer, a max pooling layer with a pool size of 2 is applied. Then a fully connected dense layer with 64 units and a ReLU activation function is added.

[0092] For the Recurrent Neural Network (RNN) analysis, we implemented two distinct approaches: one utilizing only the nucleotide sequence as input, and another using solely the structural information as the feature set. Architectures including Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) were used with an input size of 5, hidden size of 32, 2 layers, embedding size of 8, and an output size of 1.

[0093] For the hybrid CNN-RNN model, we implemented the same four approaches as in the CNN models. This hybrid architecture comprised a ID convolutional layer with 32 filters and a kernel size of 3, followed by a ReLU activation function and a max pooling layer with a kernel size of 2. The output was then fed into a Long Short-Term Memory (LSTM) layer with 64 units. Finally, two fully connected layers were added: one with 64 units and another with a single unit for output.Results

[0094] The dataset comprised 163 ssDNA sequences with lengths ranging from 286 to 5760 nucleotides. From these sequences, 130 features were extracted based on their sequence, secondary, and tertiary structures.

[0095] Given the number of samples and the uncertainty regarding linearity or non-linearity, three regression algorithms were applied to the training dataset to systematically compare their predictive performances: linear regression (LR), random forest (RF), and support vector regression (SVR).

[0096] All models were subjected to five-fold cross-validation using four filtering criteria to balance model bias and variance. Target values were scaled using standardization to ensure a mean of zero and a standard deviation of one. Using standardized training data, the LR model yielded an R2score of -27.36 and a Mean Squared Error (MSE) of 2680.61. In contrast, the RF model with 100 decision trees demonstrated better metrics overall with an average MSE of 19.79, Mean Absolute Error (MAE) of 3.4, and an average Root Mean Squared Error (RMSE) of 3.95.

[0097] For the SVR model, a RBF kernel was applied to evaluate its predictive performance. The SVR model exhibited variability in performance across different folds, with an average RMSE of 7.12 and a standard deviation of 5.95.

[0098] The Convolutional Neural Network (CNN) model using the sequence as the sole feature showed a good fit with an average R2of 0.74, indicating it explained a substantial amount of variance in the data. It had a low error, with an average MAE of 0.30 and RMSE of 0.47, showing that the model's predictions were close to the actual values. While there was some variability in the performance metrics, the overall performance was consistent. The learning curves suggested no overfitting, as the validation loss closely followed the training loss across all folds.

[0099] The CNN model using structure as feature demonstrated a good fit with an average R2of 0.66, explaining a substantial amount of variance in the data. The average RMSE was 0.47, and the MAE was 0.37, indicating that the model's predictions were reasonably close to the actual values. The standard deviations of these metrics suggested some variability in performance across the folds. The average MSE was 0.28, with a standard deviation of 0.66, reflecting the average squared error. The learning curves showed stable training and validation loss, indicating no significant overfitting. Residuals and Q-Q plots indicated some deviations from normality but overall demonstrated that the model captured the underlying data distribution well. The performance was consistent, with low standard deviations in the metrics, indicating robustness across different folds.

[0100] The CNN model using combined DNA sequence and structure as input inputs through two channels demonstrates good predictive performance with an average RMSE of 0.57 and MAE of 0.49, indicating accurate predictions. The average R2value of 0.52 suggests that the model explains a substantial amount of variance in the data. Learning curves show stable training and validation loss, indicating the model is not overfitting. Residual and Q-Q plots display some deviations from normality but overall suggest the model captures the data distribution reasonably well.

[0101] The CNN model using both sequence and structure inputs through two channels demonstrates a good fit with average RMSE and MAE values of 0.54 and 0.44, respectively. The model performs well in capturing the general trend between actual and predicted values, as seen in the scatter plots. The learning curves indicate effective learning with no significant overfitting.

[0102] The Recurrent Neural Network (RNN) model using DNA sequence as input showed moderate predictive accuracy with an average RMSE of 0.96 and MAE of 0.80. The RNN model using DNA structure as input showed moderate predictive accuracy with an average RMSE of 0.97 and MAE of 0.81.

[0103] For the hybrid CNN-RNN model using the sequence as the sole feature, demonstrates high predictive accuracy with an average RMSE of 0.24 and MAE of 0.17. The model explained a substantial portion of the variance in the data, as indicated by the high average R2value of 0.93. The learning curves showed rapid convergence of both training and validation losses, suggesting effective model fitting and generalization. Figure 7 illustrates the performance Evaluation of the Hybrid CNN-RNN model with sequence input. (A) Q-Q Plot of Residuals (B) Actual vs. Predicted Plot (C) Learning Curves (D) Residuals Plot

[0104] For the hybrid CNN-RNN model using DNA structure as the sole feature, showed reasonable predictive accuracy with an average RMSE of 1, an average MAE of 0.89, and an average MSE of 1.01. The combined CNN-RNN model combined using DNA sequence and structure as input inputs through two channels demonstrated a moderate performance with an average RMSE of 0.98 and MAE of 0.85. The combined CNN-RNN model combined using DNA sequence and structure as input inputs through one channel exhibits moderate predictive accuracy with an average RMSE of 0.63 and MAE of 0.45, along with an average R2value of 0.52.of the different foldsDiscussion

[0105] This disclosure provides a machine learning (ML) tool to predict the spatial occupation of single-stranded nucleic acid (ssNA) sequences within biotechnological vectors, aiming to overcome the limitations of traditional methodologies. Utilizing molecular dynamics (MD) simulations, a comprehensive dataset is created encompassing a wide range of ssNA sequences, capturing features for model training.

[0106] Generating this dataset uses significant computational resources and time. MD simulations, while providing detailed and accurate structural data, are computationally intensive and scaling up is difficult to produce large datasets rapidly. This constraint highlights the importance of optimizing the disclosed ML models to perform effectively with limited data.

[0107] Our results demonstrated the effectiveness of different ML models. Deep learning models, particularly Convolutional Neural Networks (CNNs), outperformed other algorithms. The CNN model using only sequence features achieved a high R2and low error rates, highlighting its capability to capture complex relationships. Incorporating structural features diminish performance. Recurrent Neural Networks (RNNs), including LSTM and GRU architectures, showed moderate accuracy.

[0108] The hybrid CNN-RNN model exhibited the best performance overall, particularly when using only sequence data as input. This model achieved high predictive accuracy, demonstrating effective model fitting and generalization. The hybrid architecture was able to leverage the strengths of both CNNs and RNNs, capturing both spatial and temporal features of the ssNA sequences, leading to superior performance. It effectively captures the intricate, non-linear relationships between sequence, structure, and spatial occupation, making it highly valuable for therapeutic vector design.

[0109] Integrating this ML tool into therapeutic vector design can improve packaging efficiency, reduce manufacturing costs, and expand the therapeutic scope of gene therapy and vaccine development. The disclosed methods provide a rapid, accurate, and scalable solution for predicting ssNA packaging space occupancy, representing a significant advancement in biotechnology. The insights gained from this study can facilitate the development of more effective ssNA-based therapeutics, contributing to advancements in gene therapy and vaccine delivery.Example

[0110] Provided below is an example with further implementation details that may be altered without departing from the scope of this disclosure.Sequence generation and secondary structure predictionSingle-stranded DNA sequences were generated computationally to span a range from 100 to 4,700 nucleotides. Each sequence was submitted to RNAfold (ViennaRNA Package v2.4.18) for secondary structure prediction, using the DNA-specific Matthews 2004 thermodynamic model (dna_mathews2004 .par) (12,13). The prediction was performed with the flags -p, — noLP, — d2, and — noconvto compute base-pairing probabilities, suppress lone pairs, and prevent nucleotide conversion. The resulting dot-bracket notation of the minimum free energy (MFE) structure was extracted and cleaned for compatibility with downstream modelling.Three-dimensional structure modellingTertiary structures were modelled using the rna_denovo protocol from the Rosetta software suite (14). The cleaned dot-bracket secondary structure and the DNA sequence were provided as inputs. A single 3D structure per sequence was generated (-nstruct 1), with energy minimization enabled via the -minimi ze_rna flag. The resulting low-energy structure was extracted from the Rosetta silent file and exported in PDB format.Topology construction with VMDThe Rosetta-derived PDB structure was converted into simulation-ready topology and coordinate files using VMD’s psfge module, executed through a custom script. This script loaded CHARMM36-compatible topology files for nucleic acids, proteins, lipids, carbohydrates, and solvent ions. Residue aliases were defined to reconcile naming differencesbetween the input PDB and topology library (e.g., A to ADE, T to THY, HIS to HSE). A DNA segment (segment A) was built and patched with the DEOX modifier on all residues to ensure correct sugar representation. Bonded parameters were regenerated with regenerate angles dihedrals, and any missing atoms were added using guesscoord. The processed structure was saved as both PSF and PDB files.Solvation and ionic environment constructionThe DNA topology files were solvated using Packmol. A spheroidal volume was created and populated with TIP3P water molecules and Mg2+ions, scaled proportionally to DNA length. Ionic concentrations of 5 mM, 15 mM, 50 mM, and 100 mM were tested to evaluate their influence on DNA compaction. Output files from Packmol were post-processed using a custom Python script to separate DNA, water layers, and ion groups. These components were then recombined into simulation-ready files, yielding final PDB and PSF files.Molecular dynamics simulationsEach prepared system was simulated using a multi-step workflow invoking NAMD. The simulation pipeline included initial minimisation, thermal equilibration, and production runs. All simulations were performed under NPT conditions (310 K, 1 atm) using the CHARMM36 force field and explicit solvent. To mimic confinement within encapsulated environments, harmonic spherical restraints were applied. Two concentric spheres centred at the origin were defined, with radii of 10 and 50 units. Restraint forces were gradually increased in eight steps from 0.000001 to 0.025 kcal mol1A2, inducing progressive compaction of the DNA structure toward the centre.Structural stability and compaction analysisSimulation trajectories were analysed to evaluate compaction and structural stability. The radius of gyration (Rg) was computed every ten frames during the simulation. To quantify convergence, a numerical gradient of Rg was calculated over ten-frame intervals using the formula:Gradient= Rg(tl 1 )— Rg(t 1 ) / (tl 1 — tl )where ti and tn are simulation frames separated by ten steps. A gradient magnitude below 0.02 was used as the threshold for equilibrium. This allowed identification of stable plateau phases indicating successful molecular compaction.Codon optimisation

[0111] The above-described methods for determining the spatial characteristic of ssDNA can be used as a component in an optimisation method that optimises the package size over different codons.

[0112] A “codon” may be defined as a discrete nucleotide triplet unit within a nucleic acid sequence, wherein each codon comprises a contiguous sequence of three nucleotides selected from the canonical nucleotide bases. In the context of single-stranded nucleic acid molecules, the codon may be positioned sequentially along the polynucleotide chain, such that its constituent nucleotides are oriented in the 5’ to 3’ direction. The identity of each codon may be determined by the specific sequence arrangement of its three nucleotide bases, which may include any combination of adenine (A), cytosine (C), guanine (G), and thymine (T) for DNA, or uracil (U) in place of thymine for RNA.

[0113] Each codon may function as an informational unit that is interpreted during the process of translation, wherein cellular or cell-free translational machinery may utilise the codon sequence to specify the incorporation of an amino acid into a polypeptide chain. The mapping from codon sequence to amino acid identity may be governed by the genetic code, which is a defined correspondence between nucleotide triplets and amino acids.

[0114] It is noted that the same amino acid may be expressed by multiple different codons. Therefore, there is a degree of freedom in choosing the most appropriate codon for a particular amino acid. In the context of the current disclosure, this means that the codons can be swapped to reduce the size of the package. In other words, the ssNA module may be simulated to calculate the packaging size. Then, one codon may be swapped for a different one that expresses the same amino acid and the altered ssNA simulated again. If the packaging size is smaller, the altered codon is kept. If not, the original codon is kept. This way, this optimisation loop iterates over different codons to find a minimum packaging size.

[0115] One difficulty that may arise is that there may be multiple codons for multiple loci in the ssNA molecule. Further, those loci may not be independent. This means that a particular codon in one loci may reduce the size only if there is a particular codon in the other loci. One option of finding the global minimum would be to iterate over all combinations of codons and select the combination that has the minimum packaging size.

[0116] However, in some examples with a large number of codons, the total number of combinations may be computationally prohibitive. In those cases, approximations or heuristics may be used, such as a genetic optimisation algorithm.

[0117] Figure 8 is flow chart of an exemplary method 800 for codon optimisation, using the above-described methods for determining the spatial characteristic of ssDNA. Method 800 commences with a transgene amino acid 801, that may comprise 4.7kb, so about 1,566 amino acids. Generating all possible sequences with different codons 802 would result in 10A791 combinations, which would involve heavy computational resources and a long time to simulate. Therefore, method 800 generates a predefined number, such as 1,000, of possible sequences at step 802 and passes them through the method disclosed herein 803 to calculate the packaging size for each sequence. The sequences in step 802 may be generated by sampling them randomly from all possible sequences, such as by sampling the codon for each loci independently.

[0118] What follows is a filtering step 804 in which the top sequences are kept (selection). For example, the filter may keep 5-10% of the sequences with the highest spatial characteristic, such as the highest radius of gyration Rg. The method returns to step 802 where further 1,000 sequences are generated through mutation and crossover of the sequences from filtering step 804. This means a number of codons are swapped for alternatives. The method returns to the step 804 of calculating the packaging size to close the loop, for example as described above with reference to Figure 5. The loop may continue until a maximum number of sequences has been considered or the improvement of packaging size between iterations is below an improvement threshold.

[0119] It will be appreciated by persons skilled in the art that numerous variations and / or modifications may be made to the above-described embodiments, without departing from thebroad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:

1. A computer-implemented method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector, the method comprising:determining a three-dimensional structure of the ssNA molecule;applying a virtual force to elements of the ssNA molecule in a simulator environment using the three-dimensional structure, the virtual force being applied towards an interior point of the three-dimensional structure, the simulator environment being configured to simulate a chemical environment within the vector; andsimulating, in the simulator environment, a compaction of the three-dimensional structure by the force applied to the elements of the ssNA molecule to calculate a spatial characteristic, related to packaging the ssNA molecule into the vector, of a compacted three-dimensional structure of the ssNA molecule.

2. The method of claim 1, wherein simulating the compaction comprises iteratively simulating the compaction until a convergence criterion is met.

3. The method of claim 2, wherein iteratively simulating the compaction comprises, in each iteration, increasing the force.

4. The method of claim 2 or 3, wherein the convergence criterion is a loss of structural integrity of the molecule.

5. The method of any one of the preceding claims, wherein applying the virtual force comprises creating a boundary structure in the simulator environment and applying the force to elements of the ssNA molecule outside the boundary structure.

6. The method of claim 5, wherein the boundary structure has a size and simulating the compaction comprises iteratively reducing the boundary size.

7. The method of claim 5 or 6, wherein the boundary structure comprises a first boundary and a second boundary and a first force is applied to elements outside the first boundary and a second force is applied to elements outside the second boundary, wherein thefirst force and the second force are towards the interior point of the three-dimensional structure.

8. The method of claim 7, wherein the first boundary and the second boundary are spherical boundaries.

9. The method of any one of the preceding claims, wherein the ssNA molecule comprises genetic information for a virus and genetic information to provide gene therapy.

10. The method of claim 9, wherein the virus has a size and the spatial characteristic is an indication of whether the ssNA molecule compacts into the size of the virus.

11. The method of any one of the preceding claims, wherein the spatial characteristic comprises a radius of gyration.

12. The method of any one of the preceding claims, wherein the method further comprises:repeatedly performing the steps of determining the three-dimensional structure, applying the virtual force, and stimulating the compaction for multiple different ssNA molecules to calculate multiple respective spatial characteristics; andusing the multiple respective spatial characteristics to train a machine learning model configured to receive, at an input, a description of the ssNA molecule and generate, at an output, the spatial characteristic to determine a trained machine learning model.

13. The method of claim 12, wherein the method further comprises applying the trained machine learning model to a new ssNA molecule to generate spatial characteristics of the new ssNA molecule, wherein applying the trained machine learning model is computationally less expensive than performing the steps of determining the three-dimensional structure, applying the virtual force, and stimulating the compaction of the new ssNA molecule.

14. The method of any one of the preceding claims, wherein the method further comprises validating the spatial characteristic against packaging efficiency determined in a lab experiment.

15. The method of any one of the preceding claims, wherein the method further comprises iteratively swapping codons in the ssNA sequence to optimise the spatial characteristic.

16. A computer-implemented machine learning method for determining spatial characteristics of a single stranded nucleic acid (ssNA) molecule for packaging the ssNA molecule into a vector, the method comprising:accessing a trained machine learning model, the trained machine learning model being configured to receive, at an input, a description of the ssNA molecule and generate, at an output, the spatial characteristic of a compacted three-dimensional structure of the ssNA molecule, the machine learning model being trained on multiple training samples; and applying the trained machine learning model to the ssNA molecule to generate the spatial characteristics of the compacted three-dimensional structure of the ssNA molecule.

17. The method of claim 16, wherein each of the multiple samples comprises a description of an ssNA molecule for the input and a corresponding simulated spatial characteristics as labels for the output.

18. The method of claim 17, wherein the method comprises training the machine learning model by adjusting parameters of the machine learning model to reduce an error between an output of the machine learning model and the simulated spatial characteristics.

19. Software that, when executed by a computer, causes the computer to perform the method of any one of the preceding claims.

20. A computer system comprising one or more processors configured to perform the method of any one of claims 1 to 18.

21. An adeno-associated virus for gene therapy comprising the ssNA molecule of any one of claims 1 to 18.