Electrostatics-based representation of biological macromolecules and functions thereof, based on structure-related descriptors

WO2026132855A1PCT designated stage Publication Date: 2026-06-25ATP SIMBIO KUTATÓ CSOPORT KFT

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ATP SIMBIO KUTATÓ CSOPORT KFT
Filing Date
2025-12-18
Publication Date
2026-06-25

Smart Images

  • Figure HU2025050111_25062026_PF_FP_ABST
    Figure HU2025050111_25062026_PF_FP_ABST
Patent Text Reader

Abstract

The invention relates to a computer-implemented method performed by one or more data processing apparatus for assessing a biological feature of a biological macromolecule, preferably a protein molecule, by providing an electrostatic data based in silico representation of the biological macromolecule. The biological feature may be, without limitation, multiple function-related states of the macromolecule, differences in the conformation, e.g. in the secondary structure, patterns based on primary structure, folded nature of the protein including, among others, π-π relationships and relationship between the genetic code and the electrostatic state of the environment of the encoded amino acid(s). The invention can be used among others in the field of drug research, medical diagnosis, genetic engineering and macromolecular modeling.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Electrostatics-based representation of biological macromolecules and functions thereof, based on structure-related descriptors

[0002] FIELD OF INVENTION

[0003] The invention relates to a computer-implemented method performed by one or more data processing apparatus for assessing a biological feature of a biological macromolecule, preferably a protein molecule, by providing an electrostatic data based in silico representation of the biological macromolecule. The invention can be used in the field of drug research, medical diagnosis and macromolecular modeling.

[0004] BACKGROUND OF THE INVENTION

[0005] Drug research and development is based on biomolecular experiments. Due to the huge money invested in this field of science lots of biological pathways have been discovered (Kramer et al., 2014). The most important elements of the pathways are proteins.

[0006] The proteins are built from 20 different amino acids in humans. The primary structure - the order of amino acids - of the proteins is coded in the DNA. The possible configurations of an average size protein are enormous however it is known that even one point mutation can have severe effects on the functionality and resulting in abnormal health state, disease.

[0007] The primary structure in humans is dependent from a mechanism called RNA splicing. The primer transcript contains all the information of its gene and based on the desired protein some parts, called the introns, are cleaved from the mRNA. The background of the exact mechanism is not known yet but the effect is seen in a larger variety of the possible configurations coded in one gene.

[0008] The numerous different possible configurations are not the only limiting factor in understanding protein developments. After translation the newly bom protein chain needs to be folded up and acquire its secondary, tertiary and in the case of protein complexes the quaternary structure. The precise mechanism of the development of each group of spatial structures is not known yet, furthermore there is no clear clarification of the dependency between the order of the amino acids and the final native structure. The spatial structure is very important for the desired activity of a protein.

[0009] The mentioned processes must have a very strictly determined regularity due to the enormous possible varieties of the configuration and conformation. If the processes were based on randomity only very little percentage of the possible protein forms would have showed up in the lifetime of the Universe probably without any enzymatic function at all. This statement is known as Levinthal’s paradox (Zwanzig et al., 1992) and its only possible solution is that folding is not a spontaneous process however there is very little information on its regularities.

[0010] Due to the lack of available knowledge on the importance of the primary structure of the proteins, the dependency of the conformation from the configuration, the mechanism of folding, including the folding of large protein complexes with vital roles, the field of molecular biology is only an experimental science.

[0011] Drug development and research is mainly based on molecular biology and requires intensive experimentation with extremely high cost. Furthermore, the final product in most cases offers therapy to reduce the symptoms rather than terminate the causes behind the malfunctions. In order to make drug research a more effective, designable and more affordable, the biological mechanism is to be explored and a technical utilization must be developed.

[0012] Docking methods are well known computer implemented techniques to provide aid in the search of the best fitting ligand. There are several types of these but the common characteristic is to to find the optimal protein structure or pose for the docking and then score the interaction and binding affinity between the ligand and the target protein based on several energetic components of the binding.

[0013] For the evaluation computationally intensive molecular dynamic simulations are used in most cases, but there are other ways with low or moderate computational costs available.

[0014] One example for this is mentioned in PCT publication WO 2023 / 104969 Al (Jelsch et al., 2023), where the determination of the interior and exterior electrostatic energy densities of the complex is utilized for the calculation of a score value that is proportional to the stability of the determined complex.

[0015] That invention defines the interacting surfaces on the given structural complex between the protein and the ligand and then estimates the electrostatic energy density of those given surfaces. The evaluation is carried out by the visualization of scatter plots. The points on which the energy values are interpreted, correspond to geometrical points of the surface.

[0016] In general, the docking method of Jelsch et al comprises the following steps:

[0017] (i) they receiving structure of at least one complex constituted of a molecule and its environment;

[0018] (ii) defining from the structure received in step (i) a surface delimiting said molecule from said environment;

[0019] (iii) determining for each point of a plurality of points of the surface defined in step (ii) an electrostatic energy density type parameter to obtain a score characterizing protein stability.

[0020] The method is applied in a case-by-case basis and is not prepared to use mathematical models (and machine learning model in particular), in the sense that no score is compared with reference scores.

[0021] Electrostatic forces account for a significant proportion of the interaction energy between molecule surfaces and may be responsible for the selection of a pairing mode.

[0022] Thus, the method does not utilizes charge -related quantities themselves assigned to atoms or other charge- related entities.

[0023] In W02008033100A1 (Tong J. C. et al., 2008) a method is disclosed for predicting physico-chemical property of a target polypeptide, wherein descriptor values are formed for a given amino acid property, thereby categorizing the amino acids into classes, such that the amino acids of one class all have the property to a higher level than all the amino acids of the other.. The descriptor values may indicate how many amino acids in the region are in one or more of the classes and / or how the amino acids in one of more of the classes are distributed along the region.

[0024] The process does not start from electrostatic potential maps, and charge is in theory mentioned as an amino acid property but not actually used.

[0025] The determination of protein structures begins with determining the amino acid sequence. The methods providing structural information are based on this, with the help of which the atomic positions of the peptide chain(s) in the given protein are determined.

[0026] Single crystal X-ray diffraction (XRD) is the most often used method to determine the spatial atomic positions (Guo et al., 2024). Here, a crystal structure is produced from a large number of proteins, whereas there are protein molecules in every unit cell of the crystal lattice. Based on the intensity distribution of the diffraction image of the crystal structure, it is possible to determine the atomic positions and their occupancy. Essentially, this provides the opportunity for three-dimensional mapping of the electron cloud of the protein molecule in the lattice with atomic precision. The disadvantage of the XRD method is that the creation of a lattice containing many proteins is needed for the mapping and this likely changes the native structure of the proteins. As an additional measurement for certain smaller structural elements, nuclear magnetic resonance (NMR) can be used, and the structural information gathered from multiple sources is docked into one three-dimensional protein structure.

[0027] With the significant development of transmission electron microscopy techniques, the Cryo-EM (Frank, 2002) method also became involved in structure determination, which is capable of providing structural information about the native shape of the protein under conditions similar to the aqueous medium of the proteins. The atomic positions are determined by different fitting methods into the experimentally obtained electrostatic potential density maps. This step might be carried out manually or with automated deep learning tools such as DeepTracer (Pfab et al., 2021). Algorithms for resolving the atomic coordinates by fitting into different electrostatic parameter distributions are not the subjects of the present invention.

[0028] The structural information determined by Cryo-EM usually gives a lower resolution image of the atomic groups. However, method to determine the all-atom structure of a protein complex based solely on its Cryo-EM map and amino acid sequence are continously developed (Pfab et al., 2021) and the differences in resolution between Cryo- EM and XRD methods have been significantly reduced in recent years due to developments in sample preparation techniques and in electron microscopy, and therefore within a few years it is expected that Cryo-EM will completely replace XRD methods in protein structure research.

[0029] There are also several methods of varying accuracy for determining the electrostatic potential field currently available.

[0030] The first group of methods is based on quantum mechanical and quantum chemistry methods (ab initio) (Friesner, 2005). These algorithms are capable of accurately approximating the conformations and charge distributions of small molecules with immense computational cost. However, since these methods are based on determining the wave function of the electron, they are usually exponentially dependent on the number of atoms in the given molecule. Therefore, their scope of application typically only covers dipeptides or oligopeptides made up of a few amino acids.

[0031] The second group of methods are forcefield methods (Cieplak et al., 2009) based on quantum mechanical simulations of small molecules, capable of describing the interactions atoms and approximating charge rearrangements. Since with the help of a few parameters per atom, the interactions give an independent description of the phenomenological conformation, these methods have moderate computational cost, but are incapable of analyzing the effects of the folding of the protein.

[0032] Thus, for several decades, protein structures with high precision could be determined with experimental methods only, such as X-ray diffraction (XRD), Nuclear Magnetic Resonance (NMR) or recently Cryogenic Electron Microscopy (Cryo-EM), but these are immensely time and resource consuming. Since there are considerably more protein sequences then experimentally determined structures, the need arose for a method to predict structures with reasonable accuracy and less effort, thus AlphaFold2 was created, which will be referred to as simply AlphaFold for the rest of the document.

[0033] Since there are considerably more protein sequences then experimentally determined structures, the need arose for a method to predict structures with reasonable accuracy and less effort, thus AlphaFold2 was created, which will be referred to as simply AlphaFold (Jumper et al., 2021) for the rest of the document.

[0034] Theoretically, it is possible to determine the structure of a protein from the amino acid sequence, but it is extremely challenging because of the colossal number of possible conformations.

[0035] Very recently, artificial intelligence (Al) has emerged and provided hope to arrive at a solution for this problem. Machine Learning is capable of identifying complex relationships and patterns from large datasets. AlphaFold is an artificial intelligence system with multiple different components, capable of predicting three- dimensional protein structures based on the amino acid sequence of the protein. It utilizes machine learning to leam patterns in a huge dataset. After being trained on this dataset, it can be used on new data not included in its training dataset. AlphaFold uses a subset of machine learning, deep learning which focuses on neural networks. Neural networks are made up of artificial neurons (nodes) which can receive, process and send signals to other neurons. These nodes make up layers and the layers form a complete neural network that can process information and leam patterns. The input of these layers of nodes is the multiple sequence alignment (MSA), which is the input amino acid sequence aligned with multiple similar sequences queried from databases. For the exact layout and implementation of the neural network that AlphaFold uses, refer to the open-source code (Senior et al., 2024), (Senior et al., 2021). The training data for AlphaFold was the Protein Data Bank (PDB) (Berman, 2000) which contains more than two hundred thousand experimentally determined protein structures along with the amino acid sequences of said proteins.

[0036] The input of AlphaFold is usually a single amino acid sequence. The model aligns this with sequences of other similar proteins to identify sections that tend to change together in evolutionary time, these are likely to interact with each other and be physically close in the spatial structure. A separate model, AlphaFold-Multimer, exists for proteinprotein complexes, both homo-multimers and hetero -multimers.

[0037] Alphafold is also capable of identifying intrinsically disordered regions, but is unable to predict disordered or dynamic substructures. AlphaFold is less accurate with highly variable sequences. Moreover, Alphafold is not sensitive to point mutations. This is because it lacks the data on the effect of these variations the focus of the model is on patterns rather than calculating physical forces.

[0038] Additionally, proteins tend to undergo structural changes when performing their function and these conformations can almost never be found in the PDB. Since the model is designed for static structures, it cannot capture these changes, however, there exist methods to generate different conformations, like using templates to model a protein in a particular state.

[0039] There are also things AlphaFold is incapable of doing, for example the model is unaware of other (non-protein) molecules that interact with the proteins, thus unable to do any calculations with them.

[0040] AlphaFold may also predict a ligand- or ion-bound form of the structure even in the absence of said ligand or ion. It is also unaware of membrane planes thus it cannot correctly model the relative orientations of any transmembrane domains and other domains of trans-membrane proteins (Perrakis & Sixma, 2021).

[0041] The prior art methods typically focus on protein structure as the origin of biochemical function. Extensive research has been going on for decades in this field, which is primarily based on mapping the structure of proteins and determining the positions of atoms in proteins. From the accumulated information and knowledge here, it is hoped that in the future, the protein mechanisms would be definable.

[0042] However, the present inventors have unexpectedly found that electrostatic parameters are more sensitive to biochemical functions of macromolecules, in particular proteins, as well as alterations thereof, than structural parameters; which lead to a novel type of representation of these macromolecules which provides a set of in silico tools for finding relationships between this representation and biochemical function and for uses in the field of medicine, including diagnosis, drug design including research and biochemistry.

[0043] BRIEF DESCRIPTION OF THE INVENTION

[0044] According to the present invention an in silico computational system and method is provided for the prediction of a given biological feature of a biological macromolecule, preferably a protein with the ultimate use of electrostatic properties of the molecule. The macromolecule is made up of a multiplicity of atoms and its geometric structure is usually defined on the level of atoms. The geometric structure may be originated from Cryo-EM, XRD, NMR or even in silico structure prediction methods.

[0045] The present invention, by creating a novel in silico representation of biological macromolecules, e.g. proteins, provides a tool to assess biochemical function of the macromolecules and thereby information on their mechanims by utilizing, instead of geometric structure, a corresponding electrostatic field. The electrostatic values have a spatial arrangement and fit into an area or spatial compartment, i.e. a volume given by positions (usually given by coordinates), and preferably form an electrostatic quantity map having spatial volume parts or spatial units (also may be called map crops or voxels).

[0046] The electrostatic data are organized into descriptors (mathematically formulated as vectors or arrays) of the novel in silico representation of the invention. When preparing the descriptors of the representation, macromolecular geometric structure is fitted with the data of the electrostatic field or values, preferably with the electrostatic quantity map, for the purpose to obtain electrostatic data-derived parameters.

[0047] For example in traditional Cryo-EM evaluation the atoms of the macromolecular e.g. protein structure (in most cases with the help of the sequence) are fitted into the electrostatic local maxima, to give the position of the atoms. The set of the quantity map and the quantitive electrostatic values (i.e. the amplitudes) are not utilized later on, and, in fact are discarded once the structure is obtained.

[0048] To the contrary, according to the idea or core concept of the invention, the descriptors use the quantitative electrostatic values, and the geometric structural data expressed may be omitted, i.e. the geometric coordinates / angles / distances do not form inherently the part of the descriptors themselves. The electrostatic data- derived parameters are assigned to atoms, groups of atoms or bonds, called together herein as submolecular local entities. These submolecular local entities are tagged, i.e. supplyed with values for proper handling and render useful in the descriptors. Inevitably, the submolecular local entities (like the atoms of a protein) and thus local entity values are originally identified by structure data, however, in the present invention the submolecular local entity values serve merely for assignation of the electrostatic data-derived parameters in the descriptors of the in silico representation according to the invention.

[0049] By way of illustration, the electrostatic data-derived parameters can be, among others, electrostatic potential density, electron density etc. These quantities form a field (distribution) in the space, and these fields can be experimentally determined or can be calculated e.g. by forcefield or ab initio methods.

[0050] By way of explanation, the submolecular local entity values, e.g. the atoms of the macromolecule, serve to select (identify or “pick”) those quantitative electrostatic data values, i.e. electrostatic data-derived parameters, which are then utilized to the creation of descriptors. Preferebly a multitude of the descriptors form essential element of the in silico representation of the macromolecule.

[0051] In the present invention a set of electrostatic data-derived parameters derived from the electrostatic distributions, preferably from the electrostatic quantity map of the biological macromolecule, is used, each electrostatic data-derived parameter being assigned to a single submolecular local entity within the set of submolecular local entities, said set of submolecular local entities being derived from but made distinct or independent of the structure data (typically expressed in geometric parameters e.g. coordinates), wherein in the descriptor the set of submolecular local entity values (like an identifyer of an atom or a part of the molecule) relate to a set of descriptor- related submolecular local entities (like the atoms or a part of the molecule). In the representation a multiplicity of descriptors are utilized to explore the interactions of electrostatic data- derived parameters (like a local charge) between the parameters assigned to the corresponding submolecular local entity values.

[0052] Thus, the key feature in the present invention is to represent protein spatial subunits (like a specific group of submolecular local entities) with their electrostatic parameters derived from the electrostatic field. For the definition of this representation the inventors utilize the structure of the given protein subunit only for determining the submolecular positions on which data samples are extracted from the electrostatic field. The electrostatic data selected this way are assigned to the submolecular local entity values but not to the spatial positions given by coordinates themselves.

[0053] Thus, the structure is used for sampling from the elecrostatic field (preferably the quantity map). The structure is a tool herein to select the electrostatic data-derived parameters from the map. After the electrostatic data-derived parameters are obtained, and the local entity values are defined or assigned, the geometric coordinates are not used to create the descriptors, and preferably may even be discarded.

[0054] The submolecular local entities can represent atoms, but can be represent other spatial volume parts as well, e.g. bonds between atoms, positions on the surface of the protein, inside the protein, anywhere. Due to the data extraction, electrostatic data-derived parameters become available, and they are oiganized into descriptors.

[0055] The descriptor can be understood as a vector comprising the electrostatic data-derived parameters which are ordered or arranged by the set of submolecular local entity values they are assigned to. To give an example the descriptor or vector may comprises local charge values (as electrostatic data-derived parameters) arranged by atoms of the peptide bonds in a row (e.g. N to C terminal direction), like the first alpha carbon atom, the first nitrogen, the first oxygen hydrogen and then the neighbouring second alpha carbon, the second nitrogen etc.

[0056] Thus, the set of submolecular local entity values provide a kind of ranking of or assignation to the set of electrostatic data-derived parameters.

[0057] In a kind of description the first dimension of the vector is thus a submolecular local entity value, related to a volume part (segment or compartment of the space in which the macromolecule is present), unrelated or at not directly connected to atomic coordinates. In the vector the present inventors use the electrostatic data-derived parameters, and there is no need for structure coordinates or torsion angles or anything else releating directly to the geometric structre.

[0058] The descriptor (vector) may serve as the set of predictor parameters (independent variables) in a machine learning model. The electrostatic data-derived parameters may be transformed in order to maximize the accuracy of the prediction. Such transformations can be among others the correction, normalization (as defined herein) or even the calculation of linear connection between the electrostatic data-derived parameters, that characterizes the connection between submolecular local entity values (e.g. the first and last amino acid of given alpha helices).

[0059] The dimension of the descriptor (vector) is defined by the set of submolecular local entities. To give an example, the submolecular local entities may be the chiral carbon atom of each amino acid in a peptide segment. In case of a 21 amino acid peptide the vector is a 21 dimention vector. If each peptide atom (C for the carbonyl carbon, CA for the alpha carbon, N for the amide nitrogen, O for the carbonyl oxygen) is considered, the dimensionality of the vector is 84.

[0060] The data points of sample extraction are selected purposely, depending on its definitive biochemical feature. Merely to give examples, this biochemical feature can be the type of a given amino acid inside the structure, the secondary structure state of the given structure, a torsion angle on a given structure, the tertiary structure state of the given structure (e.g. 7t-7t or disulfide bonds), the genetic origin of a given amino acid inside the structure, the biological state of the protein (e.g. active, inactive, misfolded, aggregated, allosterically modified etc.). This chosen definitive biochemical feature is the dependent variable that is predicted by using the representation of the invention, e.g. via a machine learning model with the help of the electrostatic data-derived parameters.

[0061] By way of example, the machine learning model is built upon a population of descriptors (vectors) with their known biological feature. This model computes the significant and unique trends (patterns) in the electrostatic pattern that characterizes the biological feature of interest. The model is preferably capable of the backtracking of the most important features of the prediction in order to give us a comprehensive mechanistic picture of the given biological feature.

[0062] The exemplary machine learning model is also capable of the significant determination (with a given probability) of the unknown biological feature of a given descriptor (vector). This determination score can be interpreted as a (pseudo)-quantitative diagnostic method.

[0063] The model is called “mechanistic”, model: each and every feature and step can be seen and monitored when the model is at work to obtain the biochemical feature in the form of a score. Due to this mechanistic property of the model, the present invention can be used for the definition of a descriptor (vector) that represents the given biological feature.

[0064] To give an example, the biochemical feature is the virulency of a bacteriophage, and thus the model is used to find a more virulent form of a bacteriophage in comparison with a reference bacteriophage. Once the model is trained, on a set of bacteriophage related biological macromolecules with a spectrum of effectivities, the model characterizes the relationship between the electrostatic status and virulency of the bacteriophages, a given test bacteriophage can be characterized by its expected virulency. This methodology also can be used to design more virulent bacteriophage.

[0065] Surprisingly, the model can differentiate not only between amino acid mutants, but also synonymous mutations as well. The electrostatic state of the given descriptor can be remodelled inside the protein via e.g. synonymous or missense mutation. This allows us to design multiple, different protein states, even with the same primary sequence .

[0066] Thus, the invention relates to solutions according to the following numbered paragraphs.

[0067] Paragraph 1

[0068] A computer-implemented method performed by one or more data processing apparatus, for assessing (or determining) a biochemical feature of a biological macromolecule (preferably a protein molecule) by providing an electrostatic data based (preferably charge-related) in silico representation of the biological macromolecule, wherein the biochemical feature assessed by the method is suitable to be brought into relationship with the biological function of the biological macromolecule and preferably thereby with its biological mechanism (of action), said biological macromolecule comprising a multiplicity of submolecular local entities (selected from atoms, bonds and groups of atoms (preferably consisting of e.g. at most 8, 5 or 4 atoms) wherein the nucleus of the atom or the nuclei of the group of atoms are within sphere of not more than 6 angstroms, or in a particular embodiment not more than 3 angstroms) and having a structure described by electrostatic quantities organized into an electrostatic quantity map, wherein said in silico representation comprises a descriptor or a multiplicity thereof, said descriptor (optionally a multiplicity of descriptors) comprising o a set of submolecular local entity values, identifying a set of descriptor-related submolecular local entities each submolecular local entity value belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably charge-related parameters (preferably characterizing local charges), preferably obtained from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity value within the set of submolecular local entities values, a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, wherein optionally the biochemical feature values can be considered as target variables, (as explained in more detail in the detailed description), a mathematical model for the relationship between the set of electrostatic data-derived parameters and the biochemical function of the biological macromolecule, said model being set up on a multitude of descriptors (preferably defining interaction patterns between the electrostatic data-derived parameters), the electrostatic data-derived parameters being predictor variables and the biochemical feature values being the target variables in the model, to obtain a biochemical feature score to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said computing-implemented method comprising the steps of obtaining a biochemical feature score of interest as the result of the model utilized (or applied) on at least one descriptor, optionally on a multiplicity of descriptors, both the biochemical feature score and the at least one descriptor being related to the biological macromolecule of interest, comparing the biochemical feature score of interest with at least one multitude of reference biochemical feature scores, each of said reference biochemical feature scores obtained as the result of the mathematical model set up on a multitude of reference descriptors of biological macromolecules with known biochemical features (wherein the corresponding known biochemical feature values are target variables of the model), assessing (or determining) a biochemical feature of the biological macromolecule of interest by comparing the biochemical feature score of interest with the at least one multitude of reference biochemical feature scores or with a reference interval derived therefrom, wherein preferably a relationship / connection is established between the biochemical feature (assessed by comparison of the biochemical feature score of interest with the at least one multitude of reference biochemical feature scores) and the biological function of the biological macromolecule and thereby its biological mechanism (of action) .

[0069] The submolecular local entity is preferably an atom or small groups of atoms consisting of e.g. at most 8, 5 or 4 atoms. In a preferred embodiment the submolecular local entity is a smaller than an amino acid residue

[0070] Preferably the submolecular local entity is an atom.

[0071] The submolecular local entity is thus different from a shape defined by geometrical data, e.g. coordinates, or a surface consisting of data points of a space.

[0072] The electrostatic data-derived parameters derived from the electrostatic quantities are preferably charge-related parameters or parameters characterizing charges, i.e. the local, submolecular local entity assigned charge.

[0073] The score is in a mathematical relationship (if desired: also statistical relationship) with the biochemical feature value so that based on the score the actual biochemical feature value can be estimated or calculated, optionally with a given probability. Thereby the in silico representation a multiplicity of descriptors are utilized to explore via application of the model, the interactions of the electrostatic data-derived parameters (like a local charge) between and / or within the descriptors. These interactions are the key features of the electrostatic pattern to characterize the biochemical function of a biological macromolecule, like a protein, and the novel representation of the invention provides a tool to characterized them by obtaining the interactions, the mechanistic (i.e. mechanism-focused) analysis of the said electrostatic patterns becomes possible.

[0074] The model creates a mathematical relationship between the electrostatic data-derived parameters of the descriptor, as predictor variables, and the biochemical feature values, which provide a quantitative value to the biochemical features, as target variables. Thus, the mathematical result of the model is a score value which is in a mathematical relationship with (or aligns with) the target variables and thereby the score evaluates the relationship between the multitude of descriptors and the biochemical feature values. Such models are known for a person skilled in the art of numeric regressions and classifications and can be appropriately selected, in view of the electrostatic data- derived parameters, to the biochemical features and in turn to the biochemical feature values.

[0075] Thus, the model itself is a quantification (or a quantitative or mathematical description) of the interactions between the electrostatic data-derived parameters, and thereby describes the electrostatic features (i.e. the electorstatic pattern) within the biological macromolecule.

[0076] In a wording the in silico representation of the invention (via the plurality of descriptors) provides (“encodes”) electrostatic pattern-dependent functional determinants that correspond to mechanistic contributions governing the biochemical feature of interest, wherein the lunctional determinants are considered as components of the functional feature, resulting in a set of functional feature values used by the mathematical model as target variables, wherein, the model-provided relationship between the plurality of descriptors and the biochemical feature values is preferably further analyzed to identify, decompose, and / or quantify electrostatic interaction patterns associated with the biochemical mechanism of the biological macromolecule.

[0077] The interactions also can be considered as an interplay between the electrostatic (e.g. charge -related) parameters assigned to the submolecular local entities, preferably atoms, which reflects the molecular mechanisms of the biological macromolecule, e.g the protein.

[0078] The submolecular local entities are preferably selected from the group consisting of atoms, groups, moieties, bonds, including bonds formed via primary and / or secondary interactions etc.

[0079] The submolecular local entitiy values can be understood as an indexed or tagged set of such entities, e.g atoms wherein the value (i.e the technical information) given in the descriptor about the submolecular local entity, is not its geometric position, but an identification “tag”.

[0080] The structure of a biological macromolecule relates to the fact that the macromolecule has a native form which correlates with its functional state. The term structure of a biological macromolecule implies the order, including patterns within the macromolecule and both its geometric arrangement and electrostatic field map which are associated with its function.

[0081] In an alternative wording the computer-implemented method used for assessing a biological feature of a biological macromolecule provides an electrostatic data based (preferably chaige -related) in silico representation of the biological macromolecule, said biological macromolecule comprising a multiplicity of submolecular local entities (e.g. atoms, groups, moieties, bonds, including bonds formed via primary and / or secondary interactions) and having a structure described by geometric structure data and electrostatic quantities preferably organized into an electrostatic quantity map, or said biological macromolecule comprising a multiplicity of submolecular local entities (e.g. atoms, groups, moieties, bonds, including bonds formed via primary and / or secondary interactions) and having an electrostatic field described by an electrostatic quantity map comprising geometric structure data; or said biological macromolecule has a geometric structure described by geometric structure data (coordinates etc.) and has an electrostatic field preferably organized into an electrostatic quantity map, said in silico representation being characterized as described above and said computing-implemented method comprising the steps as described above.

[0082] Geometric data can be e.g. backbone torsion angles and / or atomic coordinates.

[0083] By way of explanation, the set of submolecular local entity values belonging to the descriptors are different from geometric structure data and submolecular local entities are identified without such data.

[0084] Preferably the descriptor is related to in silico representation of the biological macromolecule of interest. The score of interest is the result of the model utilized on at least one descriptor.

[0085] Preferably the value(s) assigned to the known biochemical features (biochemical feature values) are the target variables of the model.

[0086] In an alternative wording the invention relates to a computer-implemented method performed by one or more data processing apparatus for assessing a biological feature of a biological macromolecule by providing an electrostatic data based representation of the biological macromolecule, said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described by o geometric structure data defining the structure of the biological macromolecule or substructure thereof and o electrostatic quantities preferably organized into an electrostatic quantity map, characterized in that said in silico representation comprises a descriptor comprising o a set of submolecular local entity values, defined by the structure data, relating to a set of descriptor-related submolecular local entities each belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity within the set of submolecular local entities, a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, a model, said model being set up on a multitude of descriptors, taking the electrostatic data-derived parameters as predictor variables and the biochemical feature values as target variables, to obtain a score being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said computing-implemented method comprising the steps of obtaining a score of interest as the result of the model utilized on at least one descriptor of (in silico representation of) a biological macromolecule of interest, comparing the score of interest with at least one multitude of reference scores, each of said reference scores obtained as the result of the model set up on a multitude of reference descriptors of (in silico representations ol) biological macromolecules with known biochemical features (, assessing a biological feature of the biological macromolecule of interest by comparing the score of interest the at least one multitude of reference score or with a reference interval derived therefrom.

[0087] Paragraph 2

[0088] The method according to Paragraph 1 (or any paragraph regarding preferred methods, in particular according to any of Paragraphs 12, 13, 14 or 15), wherein o (Paragraph 2.1) the model type is an alignment-type model and the obtained score values (and / or optionally the target variables) are values calculated between descriptors, preferably a differencetype model and the difference type values are calculated between two descriptors or o (Paragraph 2.2) the model type is a regression-type model, and the obtained score values (and / or optionally the target variables) are the weighted combination (preferably sum) of the predictor variables. o (Paragraph 2.3) the model type is a classification-type model, and the obtained score values (and / or optionally the target variables) are the weighted combination of the predictor variables.

[0089] The obtained score values reflect the relationship between the electrostatic data-derived parameters and the biochemical feature value, thereby characterize the macromolecular function by their electrostatic properties.

[0090] To give a few examples, alignment type models can be used to evaluate similarities, e.g. a state of a biological macromolecule, like protein; in a particular case to assess its similarity to another or a reference state; or is useful to compare structure -related biochemical features with that of other, e.g. reference molecules.

[0091] Use of regression type models is also multifold, and can be used e.g. finding relationships between electrostatic features of submolecular local entities or electrostatic feature patterns of descriptors.

[0092] Classification type models can be used when the functional feature is related to multiple categories applied on the biological macromolecule, like in case of assessment of silent mutations relating to the degeneration of genetic code, or when a protein can exist in multiple states.

[0093] Paragraph 3

[0094] The method according to any of Paragraphs 1 or 2 (or according to any paragraphs in particular any of Paragraphs 12, 13, 14 or 15), wherein the biochemical feature comprises multiple states of the macromolecule (preferably a protein molecule), preferably function-related states of the macromolecule, preferably of the protein, preferably at least two or two different states.

[0095] In a preferred embodiment the function-related states are quantified or numerically assigned by biochemical feature values as target variables.

[0096] In an embodiment, while the electrostatic data are local, a score relating to the whole of protein, to describe a state of that protein is obtained by the method. The target variables may be the mathematical representation of the two states.

[0097] Once the biochemical feature values (as target variables) describe electrostatic quantities or features of the macromolecule, or interactions between such quantities of features associated with parts of the molecule or submolecular entities thereof such description may advise the use of a regression type model. This use may be appropriate even when the ultimate result of the change or difference in electrostatic features is multiple stages or variants of the macromolecule.

[0098] Paragraph 4

[0099] The method according to Paragraph 3 or any of the paragraphs herein, wherein difference between the states is due to an effect selected from the group consisting of: mutation, silent (synonymous) mutation in a protein, biochemical modification, said modification preferably selected from a regulatory type modification, preferably phosphorylation, change in the glycosylation pattern, and external effect, said external effect preferably being selected from protein-ligand interaction, protein-protein interaction, domain-domain interaction, or other external effect like chemical effects, preferably a change in pH, ionic strength, polarity or hydrophobicity of the environment, presence of chaotropic agents, etc; or physical effects, like light, temperature, pressure, etc.

[0100] Paragraph 5

[0101] The method according to Paragraph 3 or 4 wherein difference between the states is due to or is related to or reflected in the alteration of the spatial structure of the biological macromolecule (preferably a protein molecule). Preferably the states are function-related states.

[0102] Paragraph 6

[0103] The method according to any of Paragraphs 3 to 5, or any of the previous paragraphs, wherein the model is an alignment-type model and the target variables are values calculated between descriptors, preferably a difference-type model and the difference type values are calculated between two descriptors, and in an embodiment the target variables are variables which take discrete values (discrete variables), wherein preferably said discrete values having a discrete probability distribution (wherein the variables may be random variables) preferably the target variables being Bernoulli variables.

[0104] In a preferred embodiment the descriptor comprising electrostatic data-derived parameters along polypeptide chain, preferably the model type being an Eucledian difference -type model or dynamic time warping (DTW).

[0105] Paragraph 7

[0106] The method according to any of Paragraphs 1 or 2 (or any of the previous paragraphs, preferably according to any of Paragraphs 12, 13, 14 or 15), wherein the biochemical feature comprises difference in the conformation, e.g. (in certain embodiment preferably) difference in the secondary structure, preferably said biochemical feature is selected from the group consisting of alpha helical structure, beta-sheet, termination thereof, or other conformation type, including turn structures, preferably the biochemical feature values are selected from torsion angles or any representation thereof, Bernoulli variables relating to different structures, etc.

[0107] The biochemical feature values may not be directly related or linked to a biochemical or biological function. In this embodiment the focus is on a chemical side of a biochemical feature, e.g. secondary structure elements or termination thereof, or their manifestation in electrostatic features. However, such biochemical feature may well have the effect on biochemical function, like in prions in which, in disease, in their impaired form, beta sheets replace alpha helices and a way to reverse this process is not known.

[0108] By way of example, either alignment type or classification type models can be used in these embodiments. Paragraph 8

[0109] The method according to any of the previous paragraphs, in particular any of Paragraphs 2 to 4 or Paragraph 7, wherein the model type is a regression-type model, and the target variables are the weighted combination (preferably sum) of the predictor variables, optionally a multiple linear regression model.

[0110] In a particularly preferred embodiment a multiplicity of descriptors are utilized to explore the model -described interaction of the electrostatic data-derived parameters (like a local charge) between and / or within the descriptors and a regression type model is used.

[0111] Paragraph 9

[0112] The method according to any of Paragraphs 1 or 2 (or any of the previous paragraphs, or according to any of Paragraphs 12, 13, 14 or 15), wherein the biological feature comprises a pattern based on primary structure, preferably said biological feature is selected from the group consisting of alpha helical structure, beta-sheet, termination thereof, or other conformation type, preferably the biochemical feature values are selected from torsion angles or any representation thereof, discrete variables, Bernoulli variables relating to different structures, etc.

[0113] Alternatively, the biological feature comprises any pattern of physical quantities along the polypeptide chain.

[0114] In a simple embodiment the pattern is a fingerprint-like pattern.

[0115] Paragraph 10

[0116] The method according to any of Paragraphs 1 or 2 (or any of the previous paragraphs, or according to any of Paragraphs 12, 13, 14 or 15), wherein the biological feature is a feature which indicates the folded nature of the protein.

[0117] This can be, without limitation, a general arrangement of electrostatic -data derived parameters indicating the folded state. It also may be a pattern as defined in Paragraph 9. Alternatively, it may be a more specific interaction related to the folded state, like the presence of relationships or interactions like hydrogen bonds or hydrophobic interactions.

[0118] The method according to this Paragraph, wherein the biochemical feature is the tertiary structure of the protein preferably characterized by 7t-7t relationships.

[0119] Paragraph 11

[0120] The method according to any of Paragraphs 1 or 2 (or any of the previous paragraphs, or according to any of Paragraphs 12, 13, 14 or 15), wherein the biochemical feature is a is a relationship between the genetic code and the electrostatic state of the environment of the encoded amino acid(s), especially of the same amino acid but with different coding codon origins.

[0121] Paragraph 12

[0122] A computer-implemented method performed by one or more data processing apparatus for providing, in silico, an electrostatic (preferably charge-related) data based descriptor relating to biochemical feature of a biological macromolecule (preferably a protein molecule), wherein preferably the biochemical feature assessed by the method is suitable to be brought into relationship with the biological function of the biological macromolecule and thereby with its biological mechanism, said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described by o geometric structure data (e.g. backbone torsion angles and / or atomic coordinates) and o electrostatic quantities preferably organized into an electrostatic quantity map, preferably said biological macromolecule comprising a multiplicity of submolecular local entities selected from atoms, bonds and groups of atoms, preferably consisting of e.g. at most 8, 5 or 4 atoms, and having a structure described by electrostatic quantities organized into an electrostatic quantity map, wherein said descriptor (optionally a multiplicity of descriptors) comprising a o a set of submolecular local entity values, identifying a set of descriptor-related submolecular local entities each submolecular local entity value belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably charge-related parameters characterizing local charges, obtained from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity value within the set of submolecular local entities values, said descriptor being suitable for setting up a model thereon wherein said electrostatic data-derived parameters are used as predictor variables and values assigned to the biochemical feature are used as targe variables, preferably said descriptor carrying information about the biochemical feature, said method comprising the steps of providing a structure of the biological macromolecule described by electrostatic quantities organized into an electrostatic quantity map, or, in a particular embodiment, o geometric structure data defining the structure of the biological macromolecule or substructure thereof, preferably a macromolecular structure data map (preferably protein structure data; preferably based on experimental data) o electrostatic quantities related to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map (preferably in case of Cryo-EM: the electrostatic quantity is potential, in case of XRD: electron density, in case of in silico calculations: calculated electrostatic potential, e.g. calculated electrostatic-quantity map) (preferably an electrostatic potential map (Coulomb -potential map) a charge-density map or an electron-density map), and preferably from experimental or computed physical data, if desired transforming the electro static -quantities (Cryo-EM: potential, XRD: electron density, in silico: calculated electrostatic potential) preferably quantities of the reconstructed electrostatic -quantity map into transformed electrostatic quantities, preferably forming a transformed electrostatic -quantity map (e.g. electrostatic potential is transformed into charge distribution), (preferably a quantity comprising local charge information) defining a set of biochemical feature values, said values being related to a biochemical feature of the macromolecule, forming a descriptor for the biochemical feature, by o selecting a set of submolecular local entities optionally using the geometric structure data (preferably of the macromolecular structure data map), o if desired, assigning submolecular local entity values to the submolecular local entities, o selecting a set of electrostatic quantities related to the set of submolecular local entity values, (preferably transformed electrostatic quantities) and assigning each electrostatic quantity to a related submolecular local entity value in the set of submolecular local entity values, to obtain electrostatic data-derived parameters ( / electrostatic data related / electrostatic type parameters), wherein the set of submolecular local entity values and the set of electrostatic data-derived parameters form the descriptor for the biochemical feature, (wherein thus the geometric structure data are preferably not used for the descriptor) if desired, transforming the electrostatic data-derived parameters into transformed electrostatic data-derived parameters (preferably a different parameter comprising local charge information, highly preferably ICD in case of Cryo-EM) if desired, processing the electrostatic -type parameters into processed electrostatic -type parameters (e.g. scaling, normalizing or correction; like ICD corrected, normalized, FO density) optionally defining a subset of the set of submolecular local entity values, thereby defining a part, e.g. a domain, substructure or moiety of the macromolecule and a related subset of the electrostatic -type parameters, (e.g. an oligopeptide, e.g. tripeptide, a domain (crop), or a set of selected atoms (trimmed) or the subset).

[0123] The quantities and parameters are preferably charge -related.

[0124] In an alternative wording said biological macromolecule comprising a multiplicity of submolecular local entities selected from atoms, bonds and groups of atoms, preferably consisting of e.g. at most 8, 5 or 4 atoms, and having a structure described by electrostatic quantities organized into an electrostatic quantity map.

[0125] The biological macromolecule is preferably a protein molecule.

[0126] The submolecular local entity is preferably an atom or small groups of atoms consisting of e.g. at most 8, 5 or 4 atoms. In a preferred embodiment the submolecular local entity is a smaller than an amino acid residue

[0127] Preferably the submolecular local entity is an atom.

[0128] The submolecular local entity is thus different from a shape defined by geometrical data, e.g. coordinates, or a surface consisting of data points of a space.

[0129] Preferably the electrostatic -quantities are selected from the group consisting of, in case of Cryo-EM: electrostatic potential, in case of XRD: electron density; in case of in silico calculations: calculated electrostatic potential; wherein preferably quantities of the reconstructed electrostatic -quantity map are transformed into transformed electrostatic quantities, preferably forming a transformed electrostatic -quantity map (e.g. electrostatic potential charge distribution), (preferably a quantity comprising local charge information).

[0130] If desired, the electrostatic data-derived parameters are transformed into transformed electrostatic data-derived parameters (preferably a different parameter comprising local charge information, highly preferably ICD in case of Cryo-EM.

[0131] If desired, the electrostatic -type parameters are processed into processed electrostatic -type parameters (e.g. selected from the group consisting of scaling, normalizing or correction; like ICD corrected, normalized, FO density).

[0132] Preferably the set of submolecular local entity values are defined by the structure data, but not given by geometric structure data.

[0133] Preferably the submolecular local entities are atoms.

[0134] In alternative embodiments the submolecular local entity is a group, e.g. an aromatic side chain group, a peptide bond, a covalent bond etc.

[0135] In an alternative wording (in this and other paragraphs) said descriptor (optionally a multiplicity of descriptors) comprises o a set of submolecular local entity values relating to a set of descriptor-related submolecular local entities each submolecular local entity value belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably from the electrostatic quantity map (preferably an experimental electrostatic quantity map) of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity value within the set of submolecular local entity values,

[0136] Paragraph 13

[0137] A computer-implemented method performed by one or more data processing apparatus for providing an electrostatic data based (preferably charge -related) in silico representation of a biological macromolecule (preferably a protein molecule) having a biochemical feature of a biological macromolecule said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described by o geometric structure data (e.g. backbone torsion angles and / or atomic coordinates) defining the structure of the biological macromolecule or substructure thereof and o electrostatic quantities preferably organized into an electrostatic quantity map, characterized in that said in silico representation comprises a descriptor (including a multiplicity of descriptors) comprising o a set of submolecular local entity values, identifying a set of descriptor-related submolecular local entities each submolecular local entity value belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably charge-related parameters characterizing local charges, obtained from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity value within the set of submolecular local entities values, a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, wherein optionally the biochemical feature values can be considered as target variables, a mathematical model for the relationship between the set of electrostatic data-derived parameters and the biochemical function of the biological macromolecule, said model being set up on a multitude of descriptors (preferably defining interactions (patterns) between the electrostatic data-derived parameters), the electrostatic data-derived parameters being predictor variables and the biochemical feature values being the target variables in the model, to obtain a biochemical feature score to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said method comprising the steps of providing a structure of the biological macromolecule described by electrostatic quantities organized into an electrostatic quantity map, or, in a particular embodiment, o geometric structure data defining the structure of the biological macromolecule or substructure thereof, preferably a macromolecular structure data map (preferably protein structure data; preferably based on experimental data) o electrostatic quantities related to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map (Cryo-EM: potential, XRD: electron density, in silico: calculated electrostatic potential) (preferably an electrostatic potential map (Coulomb -potential map) a charge-density map or an electron-density map), and preferably from experimental or computed physical data, if desired transforming the electrostatic -quantities (Cryo-EM: potential, XRD: electron density) preferably quantities of the reconstructed electrostatic -quantity map into transformed electrostatic quantities, preferably forming a transformed electro static -quantity map (preferably a quantity comprising local charge information), defining a set of biochemical feature values, said values being related to a biochemical feature of the macromolecule, forming a descriptor for the biochemical feature, by o selecting a set of submolecular local entities using the geometric structure data (preferably of the macromolecular structure data map), o if desired assigning submolecular local entity values to the submolecular local entities, o selecting a set of electrostatic quantities related to the set of submolecular local entity values, (preferably transformed electrostatic quantities) each quantity being assigned to a related submolecular local entity value in the set of submolecular local entity values, to obtain electrostatic data-derived parameters ( / electrostatic data related / electrostatic type parameters), wherein the set of submolecular local entity values and the set of electrostatic data-derived parameters form the descriptor for the biochemical feature, if desired, transforming the electrostatic data-derived parameters into transformed electrostatic data-derived parameters (preferably a different parameter comprising local charge information, highly preferably ICD in case of Cryo-EM) if desired, processing the electrostatic -type parameters into processed electrostatic -type parameters (e.g. scaling, normalizing or correction; like ICD corrected, normalized, FO density) optionally defining a subset of the set of submolecular local entity values thereby defining a part, e.g. a domain, substructure or moiety of the macromolecule and a related subset of the electrostatic -type parameters, (e.g. an oligopeptide, e.g. tripeptide, a domain (crop), or a set of selected atoms (trimmed) or the subset) setting up a mathematical model for the relationship between the set of electrostatic data-derived parameters and the biochemical function of the biological macromolecule, said model being set up on a multitude of descriptors (preferably defining interactions (patterns) between the electrostatic data-derived parameters), the electrostatic data-derived parameters being predictor variables and the biochemical feature values being the target variables in the model, to obtain a biochemical feature score to evaluate the relationship between the multitude of descriptors and the biochemical feature values thereby providing the in silico representation of the macromolecule.

[0138] In an alternative wording said biological macromolecule comprising a multiplicity of submolecular local entities selected from atoms, bonds and groups of atoms, preferably consisting of e.g. at most 8, 5 or 4 atoms, and having a structure described by electrostatic quantities organized into an electrostatic quantity map. In an alternative wording the said model being set up on a multitude of descriptors, taking the electrostatic data-derived parameters as predictor variables and the biochemical feature values as target variables, to obtain a score being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values.

[0139] Preferably, using the in silico representation of the macromolecule, a relationship / connection is established between the biochemical feature (assessed by comparison of the biochemical feature score of interest with the at least one multitude of reference biochemical feature scores) and the biological function of the biological macromolecule and thereby its biological mechanism (of action).

[0140] The electrostatic quantities (in case of Cryo-EM: potential, in case of XRD: electron density, in case of in silico calculations: calculated electrostatic potential) relate to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map (preferably an electrostatic potential map (e.g. Coulomb -potential map) a charge-density map or an electron-density map), and preferably from experimental or computed physical data.

[0141] If desired, the electrostatic -quantities (Cryo-EM: potential, XRD: electron density, in case of in silico calculations: calculated electrostatic potential), preferably quantities of the reconstructed electrostatic -quantity map are transformed into transformed electrostatic quantities, preferably forming a transformed electrostatic -quantity map (preferably a quantity comprising local charge information),

[0142] Paragraph 14

[0143] A computer-implemented method performed by one or more data processing apparatus for providing a reference multitude of score values for estimation of a value related to a known biochemical feature (known biochemical feature value) of a reference biological macromolecule (preferably a multitude of reference biological macromolecules) and for assessing the biochemical feature of a biological macromolecule of interest, using an in silico representation of the reference biological macromolecule, wherein preferably the biological macromolecule is a protein said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described by o geometric structure data (e.g. backbone torsion angles and / or atomic coordinates) and o electrostatic quantities preferably organized into an electrostatic quantity map, wherein said in silico representation comprises a descriptor (including a multiplicity of descriptors) comprising o a set of submolecular local entity values, identifying a set of descriptor-related submolecular local entities each submolecular local entity value belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably charge-related parameters characterizing local charges, obtained from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity value within the set of submolecular local entities values, a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, a model, said model being set up on a multitude of descriptors, taking the electrostatic data-derived parameters as predictor variables and the biochemical feature values as target variables, to obtain a score being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said method comprising the steps of providing a structure of the biological macromolecule described by electrostatic quantities organized into an electrostatic quantity map, or, in a particular embodiment, o geometric structure data defining the structure of the biological macromolecule or substructure thereof, preferably a macromolecular structure data map (preferably protein structure data; preferably based on experimental data) o electrostatic quantities related to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map (Cryo-EM: potential, XRD: electron density, in silico: calculated electrostatic potential) (preferably an electrostatic potential map (Coulomb -potential map) a charge-density map or an electron-density map), and preferably from experimental or computed physical data, if desired transforming the electrostatic -quantities (Cryo-EM: potential, XRD: electron density, in silico: calculated electrostatic potential) preferably quantities of the reconstructed electrostatic -quantity map into transformed electrostatic quantities, preferably forming a transformed electrostatic -quantity map (e.g. electrostatic potential is transformed into charge distribution), (preferably a quantity comprising local chaige information) defining a set of biochemical feature values, each values being related to a known biochemical feature of a known macromolecule, forming reference descriptors for the biochemical feature of a (multitude of) biological macromolecule(s), by o selecting a set of submolecular local entities for each descriptor, using the geometric structure data (preferably of the macromolecular structure data map), o if desired assigning submolecular local entity values to the submolecular local entities, o selecting a set of electrostatic quantities related to the set of submolecular local entity values for the given descriptor, (preferably transformed electrostatic quantities) and assigning each electrostatic quantity to a related submolecular local entity value in the set of submolecular local entity values, to obtain electrostatic data-derived parameters ( / electrostatic data related / electrostatic type parameters), wherein the set of submolecular local entity values and the set of electrostatic data-derived parameters form the descriptor for the biochemical feature, if desired, transforming the electrostatic data-derived parameters into transformed electrostatic data-derived parameters (preferably a different parameter comprising local charge information, highly preferably ICD in case of Cryo-EM) if desired, processing the electrostatic -type parameters into processed electrostatic -type parameters (e.g. scaling, normalizing or correction; like ICD corrected, normalized, FO density) optionally defining a subset of the set of submolecular local entity values thereby defining a part, e.g. a domain, substructure or moiety of the macromolecule and a related subset of the electrostatic -type parameters, (e.g. an oligopeptide, e.g. tripeptide, a domain (crop), or a set of selected atoms (trimmed) or the subset). setting up a mathematical model for the relationship between the set of electrostatic data-derived parameters and the biochemical function of the biological macromolecule, said model being set up on a multitude of descriptors (preferably defining interactions (patterns) between the electrostatic data-derived parameters), the electrostatic data-derived parameters being predictor variables and the biochemical feature values being the target variables in the model, to obtain a biochemical feature score to evaluate the relationship between the multitude of descriptors and the biochemical feature values; thereby providing the in silico representation of the macromolecule, applying the model to calculate a reference multitude of scores.

[0144] Preferably, using the in silico representation of the macromolecule, a relationship / connection is established between the biochemical feature (assessed by comparison of the biochemical feature score of interest with the at least one multitude of reference biochemical feature scores) and the biological function of the biological macromolecule and thereby its biological mechanism (of action).

[0145] In an alternative wording said biological macromolecule comprising a multiplicity of submolecular local entities selected from atoms, bonds and groups of atoms, preferably consisting of e.g. at most 8, 5 or 4 atoms, and having a structure described by electrostatic quantities organized into an electrostatic quantity map.

[0146] In an alternative wording the model on the electrostatic data-derived parameters of the descriptors as predictor variables and the known biochemical feature values as target variables, to obtain a multitude of scores being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values.

[0147] In the method of the invention preferably a biochemical feature score of interest is obtained by the model of the invention and using the in silico representation of the invention, wherein the biochemical feature score of interest is compared with the at least one multitude of reference biochemical feature scores, each of said reference biochemical feature scores obtained as the result of the mathematical model set up on a multitude of reference descriptors of biological macromolecules with known biochemical features (wherein the corresponding known biochemical feature values are target variables of the model.

[0148] Paragraph 15

[0149] The method according to any of Paragraphs 12, 13, 14or 15, wherein o the model type is an alignment-type model and the targe variables are values calculated between descriptors, preferably a difference -type model and the difference type values are calculated between two descriptors or o the model type is a regression model, and the target variables are the weighted combination (preferably sum) of the predictor variables, or o the model type is a classification-type model, and the obtained score values are the weighted combination of the predictor variables.

[0150] Paragraph 16

[0151] A use of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 12, 13, 14, or 15, or the multitude of score values for estimation of a value related to a known biochemical feature, in any of the following: drug design, preferably drug design by studying the interaction between a molecule, e.g. a drug candidate and a biological macromolecule of interest, drug screening (whereas drug screening may be included in drug design), diagnosis of a patient by assessing the biochemical feature of a macromolecule or a protein of interest of said patient in silico, preferably to assess whether said macromolecule is native or correctly folded or fimctional, protein engineering, in particular in designing mutations, e.g. which are non-toxic or which leave the protein structure folded and lunctional or which alter the structure or function in a desired way; in a preferred embodiment antibody engineering or channel protein engineering; in a preferred embodiment gene manipulations including designing silent mutations.

[0152] Paragraph 17

[0153] The use according to Paragraph 16 of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 12, 13, 14 or 15, or of the multitude of score values for estimation of a value related to a known biochemical feature, wherein the score of interest is the result of the model utilized on at least one descriptor, optionally on multiple descriptors, both the score and the at least one descriptor being related to the biological macromolecule of interest, the score of interest is compared with at least one multitude of reference scores, each of said reference scores obtained as the result of the model set up on a multitude of reference descriptors of biological macromolecules with known biochemical features the biochemical feature of the biological macromolecule of interest is assessed by comparing the score of interest with the at least one multitude of reference scores or with a reference interval derived therefrom.

[0154] The use according to Paragraph 16, of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 12, 13, 14 or 15, wherein any one of the computer-implemented methods of Paragraphs 1 to 15 is carried out.

[0155] Paragraph 18

[0156] The use according to Paragraph 16 or 17 of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 7, 8, 12, 13, 14 or 15, in particular 7 or 8, wherein said biochemical feature comprises difference in the conformation, e.g. (in certain embodiment preferably) difference in the secondary structure, preferably selected from the group consisting of alpha helical structure, beta-sheet, termination thereof, or other conformation type, including turn structures, and wherein preferably the biochemical feature values are selected from torsion angles or any representation thereof, Bernoulli variables relating to different structures, etc.

[0157] By way of example, either alignment type or classification type models can be used in these embodiments, or, alternatively, the model type is a regression-type model, and the target variables are the weighted combination (preferably sum) of the predictor variables, optionally a multiple linear regression model.

[0158] Paragraph 19

[0159] The use according to Paragraph 16 or 17 of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 9, 12, 13, 14 or 15, in particular 9, wherein the biological feature comprises a pattern based on primary structure, preferably said biological feature is selected from the group consisting of alpha helical structure, beta-sheet, termination thereof, or other conformation type, preferably the biochemical feature values are selected from torsion angles or any representation thereof, discrete variables, Bernoulli variables relating to different structures, etc. Alternatively, the biological feature comprises any pattern of physical quantities along the polypeptide chain. In a simple embodiment the pattern is a fingerprint-like pattern.

[0160] Paragraph 20

[0161] The use according to Paragraph 16 or 17 of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 10, 12, 13, 14 or 15, in particular 10, wherein the biological feature is a feature which indicates the folded nature of the protein. preferably, a general arrangement of electrostatic -data derived parameters indicating the folded state. A pattern, e.g. a fingerprint-like pattern as defined in Paragraph 19.

[0162] Alternatively, it may be a more specific interaction related to the folded state, like the presence of relationships or interactions like hydrogen bonds or hydrophobic interactions.

[0163] The method according to this Paragraph, wherein the biochemical feature is the tertiary structure of the protein preferably characterized by 7t-7t relationships.

[0164] Paragraph 21

[0165] The use according to Paragraph 16 or 17 of the in silico representation as defined in any of Paragraphs 1 to 15, preferably 1, 11, 12, 13, 14 or 15, in particular 11, wherein the biological macromolecule is a protein encoded by a genetic code and wherein the biochemical feature is a relationship between the genetic code and the electrostatic state of the environment of the encoded amino acid(s), especially of the same amino acid but with different coding codon origins.

[0166] Paragraph 22

[0167] The use according to any of Paragraphs 16 to 21 of the in silico representation as defined in any of Paragraphs 1 to 15, for assessing the biochemical feature of a biological macromolecule of interest, using the in silico representation of the reference biological macromolecule, wherein preferably the computer-implemented method of any of Paragraphs 1 to 15 is performed.

[0168] Preferably, at least the following steps are carried out: obtaining a score of interest as the result of the model utilized on at least one descriptor, optionally on multiple descriptors, both the score and the at least one descriptor being related to the biological macromolecule of interest, comparing the score of interest with at least one multitude of reference scores, each of said reference scores obtained as the result of the model set up on a multitude of reference descriptors of biological macromolecules with known biochemical features (wherein the corresponding known biochemical feature values are target variables of the model), assessing (or determining) a biological feature of the biological macromolecule of interest by comparing the score of interest with the at least one multitude of reference scores or with a reference interval derived therefrom.

[0169] Paragraph 23

[0170] The use according to any of Paragraphs 16 to 22 of the in silico representation as defined in any of Paragraphs 1 to 15, in any of the following: drug design, preferably drug design by studying the interaction between a molecule, e.g. a drug candidate and a biological macromolecule of interest, drug screening (whereas drug screening may be included in drug design), diagnosis of a patient by assessing the biochemical feature of a macromolecule or a protein of interest of said patient in silico, preferably to assess whether said macromolecule is native or correctly folded or fimctional, protein engineering, in particular in designing mutations, e.g. which are non-toxic or which leave the protein structure folded and fimctional or which alter the structure or function in a desired way; in a preferred embodiment antibody engineering or channel protein engineering; in a preferred embodiment gene manipulations including designing silent mutations, as disclosed in the detailed description or in the corresponding examples.

[0171] Thus, the embodiments of the invention are useful among others in drug design and diagnostics as well as protein engineering, drug screening, antibody engineering, gene manipulations.

[0172] DEFINITIONS

[0173] Geometric structural representation - data modelling 3D geometric state of the biological macromolecule or substructure thereof. Such representation contains e.g. the coordinates of submolecular local entities or torsion angles or bond lengths.

[0174] Structural representation - data modelling the fimctioning state of the biological macromolecule or substructure thereof. Such representation can be the geometric structural representation and / or a given type of electrostatic representation, further defined in the Brief description of the invention.

[0175] Quantity map - is a spatial distribution (or in other words an electrostatic field) of an electrostatic parameter from which electrostatic -data derived parameters can be originated. It can be an experimental measurement (e.g. ESP distribution or ED distribution) or an in silico model or a local resolution distribution map.

[0176] Submolecular local entity - is a selected atom or group of atoms, with geometric location defined by geometric structure parameters (i.e. the entity is local), within a biological macromolecule (i.e. it is submolecular), wherein the nucleus of the atom or the nuclei of the group of atoms are within sphere of not more than 9 angstroms, preferably not more than 7 angstroms, preferably not more than 6 angstroms, or in a particular embodiment not more than 3 angstroms. For example, the submolecular local entity can be an atom, a residue, a side chain, an aromatic group, a peptide bond, a covalent bond etc. The submolecular local entity may be characterized by e.g. the information on the geometric structure or the submolecular local entity value.

[0177] The submolecular local entity is preferably an atom or small groups of atoms consisting of e.g. at most 8, 5 or 4 atoms. In a preferred embodiment the submolecular local entity is a smaller than an amino acid residue.

[0178] Preferably the submolecular local entity is an atom.

[0179] The submolecular local entity is thus different from a shape defined by geometrical data, e.g. coordinates, or a surface consisting of data points of a space.

[0180] Submolecular local entity value - is the identificator tag of a selected atom or group of atoms within a molecule, preferably biological macromolecule (i.e. it is submolecular). The submolecular local entity value preferably does not contain parameters of the geometric structural representation (e.g. torsion angles or coordinates).

[0181] Descriptor - is an organized data set or collection of data assigned to one or more submolecular local entities comprising electrostatic parameters, and which sensitively describes inherent properties of the biological macromolecule or a selected part of said macromolecule, and on which a mathematical model can be utilized. Further definition can be found in the Brief description of the invention.

[0182] Model - Mathematical algorithm calibrated and fitted on a multitude of descriptors used as predictor variables and the biochemical feature values as target variables. The model is built upon a multiplicity of reference descriptors (preferably on a multiplicity of reference macromolecules or proteins) to obtain a multiplicity of reference scores. The model then can be used to determine the biochemical feature score of interest of the biochemical feature of interest. Thus, the set up model is used to obtain a score from at least one descriptor, thus making estimation of the biochemical feature value possible, and to obtain information on the biochemical feature. Further definition can be found in the Brief description of the invention.

[0183] Score (value) - Dimensionless number / parameter, that is calculated from a descriptor, and is used to estimate the biological feature value of interest. The score value is the output of a model set up on a multitude of descriptors. Further definition can be found in the Brief description of the invention.

[0184] Electrostatic - Electrostatic refers to phenomena or parameters related to stationary or infinitesimally slow- moving electric charges. These effects arise from the forces that charges exert on one another when they are at rest or in quasi-static equilibrium. The term often applies to the fields, forces, and potentials associated with electric charges that do not involve significant motion or current flow. Charge-related electrostatic quantities comprise charge as a physical quantity i.e. are not independent therefrom.

[0185] Electrostatic data-derived parameter - is a mathematical representation, having a numerical value assigned to a submolecular local entity, based on the matching of the quantity map and the structural representation. Preferably the assignation is made by linear interpolation. The numerical value can be given by measuring certain stationary (i.e. time-invariant) or slow-moving electric phenomena surrounding a physical entity, such as Coulomb -potential or electron-density. Thus, the electrostatic data-derived parameter is different from energy or energy density.

[0186] The electrostatic data-derived parameters are preferably chaige -related parameters.

[0187] ESP (ElectroStatic Potential) or Coulomb Potential - The amount of work / energy needed per unit of electric charge to move the charge from a reference point to a specific point in an electric field.

[0188] Biochemical feature value - is a numerical value describing the state of certain biological or chemical properties of a biological macromolecule, preferably a protein. Such value can be e.g. the activity of a specific domain inside of the macromolecule, or the logical value (i. e. 1 or 0) of whether or not an amino acid causes helix -termination in a protein or a descriptor-independent electrostatic data-derived parameter.

[0189] Predictor variable - is an independent variable in a predictive model that is used to forecast the outcome of another variable based on its value. In our case it is generally a continuous or quantitative variable such as an electrostatic data-derived parameter.

[0190] Target variable - is a dependent variable in a predictive model whose values are predicted by other variables. In our case it is generally a discrete or qualitative variable such as a biochemical feature value.

[0191] Inverse Charge Density (ICD) - a transformed charge related electrostatic parameter originated from the difference of the electrostatic data-derived parameter value at the position of the nucleus and the estimated position of the nearby local minimum - representing the highest density of electrons in the cloud - from the charge distribution quantity map. This parameter is proportional to the broadening / volume of the electron cloud.

[0192] Resolution - In the context of structural biology resolution is the ability to distinguish the presence or absence of atoms or groups of atoms in a biomolecular structure. Generally, the structure originates from methods such as X- ray crystallography, electron crystallography, or Cryo-EM. Resolution is used here as usually understood in the art.

[0193] Local resolution - during Cryo-EM imaging not every segment of the produced quantity map possesses the same resolution. Local resolution describes the resolution of the Cryo-EM image on a certain point or in a smaller volume, which is crucial information e.g. during the construction of an ESP map of a molecule.

[0194] Local Resolution Distribution (u distribution) - is a quantity map originated from the transformation of the experimental Cryo-EM ESP maps preferably with the MonoRes algorithm. The physical information of this quantity map is the local resolution. Atomic local resolution value (n) - physical information assigned to each atom. Calculated through linear interpolation - done on the discrete grid of the local resolution distribution - in relation to a selected set of at least one submolecular local entity and their spatial coordinates described by a structural 3D model of the molecule.

[0195] Contribution plot - a diagram used to represent the subgroups of the weighted predictor variables in order to show the different effects of addends of the estimations of target variables. Such representation is useful in the analysis of the relations of the biochemical features and the electrostatic states of biological macromolecules.

[0196] BRIEF DESCRIPTION OF THE FIGURES

[0197] Figure 1. The change of the interpolated charge value alongside the line drawn between the atomic position and the hypothetical electron cloud position. The nucleus ideally is at the local maximum, while the electron is at the local minimum.

[0198] Figure 2. The derived parameters and their dependency from ICD : Upper left: “pcharge”: atomic position charge values; Lower left: “ppotential”: potential (ESP) value interpolated at the atomic position; Upper right: “echarge”: charge value interpolated at a given point of the hypothesized electron cloud in the minimum searching volume; Lower right: “pcharge” plus “echarge”. The scales are not relevant, hereby omitted for better vision.

[0199] Figure 3. The calculated atomic (CA) ICD values of two chains of a SARS-CoV-2 spike protein (7KDL) plotted in the order of the amino acid index. The two curves present two different protein states. The up state (marked with full circle) and the down state (marked with x) show a significant differences on the receptor binding domain (300-500).

[0200] Figure 4. The calculated atomic (CA) o values of two chains of a SARS-CoV-2 spike protein (7KDL) plotted in the order of the amino acid index. The two curves present two different protein states. The up state (marked with full circle) and the down state (marked with x) show a significant difference on the receptor binding domain (300- 500)

[0201] Figure 5. The atomic o and ICD values of an example SARS-CoV-2 spike protein (7KDL) measured by Cryo- EM. As the o increase (so the resolution becomes worse) the maximum value of the ICD decreases. Because of this correlation globally comparison cannot happen between different sites of the protein.

[0202] Figure 6. The atomic o and ICDcorrected values of an example SARS-CoV-2 spike protein (7KDL) measured by Cryo-EM. The relationship is terminated, there is no correlation between the parameters. The cut off threshold is indicated by dashed lines.

[0203] Figure 7. The pattern of CA atoms of the ICDcorrected values of two unique chains of a SARS-CoV-2 spike (7KDL) Cryo-EM measurement. The difference between the values of the same positions can be used in further analysis.

[0204] Figure 8. The atomic o and ICDnOrmai values of an example SARS-CoV-2 spike protein (7KDL) measured by Cryo-EM. Only the atoms of the peptide chain are presented. The relationship is terminated, there is no meaningful correlation between the parameters. A threshold can be determined to cut off the values where the ratio is enormously large.

[0205] Figure 9. The pattern of the CA atoms of ICDnOrmai values of two unique chains of a SARS-CoV-2 spike (7KDL) Cryo-EM measurement. The difference between the values of the same positions can be used in further analysis.

[0206] Figure 10. The autocorrelation plot of the investigated SARS-CoV-2 spike protein (7KDL) active protein subunit. The horizontal lines in the plot correspond to 95% (full) and 99% (dashed) confidence bands. The spikes indicate the periodicity of the peptide backbone (N-CA-C) Figure 11. The autocorrelation plot of the investigated SARS-CoV-2 spike protein (7KDL) active protein subunit. The horizontal lines in the plot correspond to 95% (full) and 99% (dashed) confidence bands. As there is just CA atoms, the spikes are not present

[0207] Figure 12. The estimation pattern of the ICDcorrected values on the CA atoms. The estimation is unbiased, the plot shows a strong connection between the measured and estimated variable. In the estimation the ICD corrected values of 11 neighboring atoms were used.

[0208] Figure 13. Change of the Pearson correlation coefficient as the estimated atomic ICDCOrrected parameter is getting further away from the input information.

[0209] Figure 14. Boxplot of the grouped calculated DTW differences between 11 independent Cryo -EM measurements. The grey segments represent the median and the notch indicates the confidence interval of the medians (alpha = 0.05). There are significant effects detected in the electrostatic change if the biological state is different

[0210] Figure 15. Separation of the score values based on the structural information of alpha helix structures filtered for alanine and asparagine amino acids. The score value is a combination of the ICDcorrected values.

[0211] Figure 16. The contribution plot of the main chain ICDcorrected data. The dotted line refers to the CA, C, N, O atoms of the given ASN. The dashed line refers to the same atoms of the amino acid before the investigated ASN, while the full line refers to the atoms of the following amino acid. The curves represent the weighted sum of the atomic ICDcorrected values.. Note that there is a moving average applied on these curves.

[0212] Figure 17. The contribution plot of the side chain ICDcorrected data. The dotted line refers to the side chain atoms of the given ASN. The dashed line refers to the CB atom of the preceding amino acid , while the full line refers to the CB atom of the following amino acid. The curves represent the weighted sum of the atomic ICD corrected values. Note that there is a moving average applied on these curves.

[0213] Figure 18. Separation of the score values based on the structural information of tertiary structures filtered for histidine amino acid in beta structures. The score value is a combination of the ICDcorrected values.

[0214] Figure 19. The contribution plot of the main chain ICDcorrected data. The dotted line refers to the CA, C, N, O atoms of the given HIS. The dashed line refers to the same atoms of the amino acid before the investigated HIS, while the full line refers to the atoms of the following amino acid. The curves represent the weighted sum of the atomic ICDcorrected values. Note that there is a moving average applied on these curves.

[0215] Figure 20. The contribution plot of the side chain ICDcorrected data. The dotted line refers to the side chain atoms of the given HIS. The dashed line refers to the CB atom of the amino acid before the investigated HIS, while the full line refers to the CB atom of the following amino acid. The curves represent the weighted sum of the atomic ICDcorrected values. Note that there is a moving average applied on these curves.

[0216] Figure 21. An example descriptor of the atomic ICDcorrected values grouped together by certain conditions. The pattern represents the global electrostatical state of an amino acid. Multiplicity of these descriptors can surve as a model training data.

[0217] Figure 22. The calculated weights of the ICDcorrected parameters of each atomic value and the constant value. As the plot shows there are significant parameters among them and many smaller weights. The shown linear model is predictive.

[0218] Figure 23. The calculated results of the triplet separation. As the plot shows there is a significant difference determined by our model between the electrostatic state of the amino acids originated from different codons. Welch- p is indicated for the different box pairs.

[0219] Figure 24. The balanced accuracy scores of a complex XGBoost prediction aiming to classify the codons for each individual amino acid. The boxes show the distribution of the evaluation of different spilts during the modelling. The darker boxes indicate the Cryo-EM measuments, while the lighter ones are from XRD datasets. All of our predictions are significant and above reference benchmark (grey lines).

[0220] Figure 25. The results of secondary structure estimation (Helix / Sheet) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens ’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations.

[0221] Figure 26. The results of secondary structure estimation (Helix / Random coil) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations.

[0222] Figure 27. The results of secondary structure estimation (Random coil / Sheet) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations.

[0223] Figure 28. The B factor dependency of the obtained atomic fo density parameter. The vertical line indicates the chosen B factor level above which the data is dropped in the presented analysis. The chosen subset is indicated individually on the right figure.

[0224] Figure 29. The autocorrelation of fo density values on CA atoms of a selected protein (1ZK0). The vertical dashed line indicates the 99% confidence interval of the Pearson-correlation coefficient..

[0225] Figure 30. The result of interpolation estimation without amino acid filter. The correlation is considered to be strong.

[0226] Figure 31. The result of a far extrapolation estimation without amino acid filter. The correlation is considered to be relatively strong.

[0227] Figure 32. The change of correlation coefficient as the distance grows from the descriptor. As the plot shows the calculations are meaningful even far from the defined imaginary box.

[0228] Figure 33. The classification report of a tree based model that classifies the calculated linear coefficients of the given amino acids on an independent test dataset. According to the results the presented effects are very strong and the pattern formation capability of a given amino acid is very unique, fingerprint-like.

[0229] Figure 34. The classification report of a tree based model that classifies the secondary structure states of given peptide segments with electrostatic descpritors. The classification is well above the level of significancy.

[0230] Figure 35. The local resolution dependency of the obtained atomic CESP parameter. The shape of the distributions is partly from the dependency from the local resolution and partly from the different scales of different Cryo-EM measurements.

[0231] Figure 36. The local resolution dependency of the scaled DESP parameter. According to the requirements there is no correlation between the two parameters.

[0232] Figure 37. The local resolution dependency of the CESPCOrrected parameter. The effect of dependency is neglectable. The constant local resolution values typically come from measuments of proteins with one exact state.

[0233] Figure 38. The autocorrelation curve of the corrected CESP data (7KDL).

[0234] Figure 39. The autocorrelation curve of the DESP data (7KDL).

[0235] Figure 40. The goodness of estimation of the extrapolation calculation on the CESP data

[0236] Figure 41. The goodness of estimation of the extrapolation calculation on the DESP data Figure 42. The results of secondary structure estimation (Helix / Sheet) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens ’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the CESPCOrrected.

[0237] Figure 43. The results of secondary structure estimation (Helix / Sheet) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens ’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the DESP.

[0238] Figure 44. The results of secondary structure estimation (Helix / Random coil) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the CESPCOrrected.

[0239] Figure 45. The results of secondary structure estimation (Helix / Random coil) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the DESP.

[0240] Figure 46. The results of secondary structure estimation (Random coil / Sheet) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the CESPCOrrected.

[0241] Figure 47. The results of secondary structure estimation (Random coil / Sheet) on all protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score values. The grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the DESP.

[0242] Figure 48. The results of tertiary structure estimation on all possible aromatic protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score valuesThe grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the CESPCOrrected.

[0243] Figure 49. The results of tertiary structure estimation on all possible aromatic protein building amino acid central descriptors. The plot shows the effect sizes (Cohens’D value) of the separations of the estimated target score valuesThe grey shade region is an interval determined from average effect sizes of the randomized models and their standard deviations. The utilized parameter was the DESP.

[0244] Figure 50. Flowchart illustration of data processes and the creation of a descriptor related to a biochemical feature.

[0245] Figure 51. Flowchart illustration of the preparation of an in silico representation of a protein and obtaining a multitude of reference score values, wherein a score value is preferably made from at least one, preferably a multitude of descriptors by applying a model in accordance with the invention.

[0246] Figure 52. Flowchart illustration of the utilization of the in silico representation of the invention.

[0247] Figure 53. One of the calculated linear coefficients on a big population of XRD data-based descriptors, with multiple splits (difference in the initial population). As the plot shows the effect of the ALA amino acid is the same for the train and test set. The random sample has the same characteristics on average then the population, however its standard deviation is similar to the stratified sample. DETAILED DESCRIPTION OF THE INVENTION

[0248] There is little known about the correlations between the regularities of the primary structure of a given biological macromolecule and how the geometrical structure formation is determined in it. The most important step to acquire fiirther information on these questions is to analyze the macromolecule as a defined set of physical parameters.

[0249] The proteins are crucial elements of the biological pathways, however the exact rule of the protein - protein interactions remains unknown.

[0250] The determination of protein structure usually begins with determining the amino acid sequence. The methods providing structural information are based on this, with the help of which the spatial structure and advisably atomic positions of the peptide chain(s) in the given protein are determined.

[0251] Methods to obtain information on secondary protein structure (e.g. circular dicroism, CD and fourier transformed infrared spectroscopy, FT-IR) and on tertial structure of folding (XRD, NMR and most recently Cryo- EM) have developed enormously since the concept of protein folding has emerged. However, the important structural information is usually only approximately available even for XRD and Cryo-EM methods; specifically, in XRD is applicable only to crystallizable molecules and due to the crystal structure that only approximates the native structure, and in Cryo-EM due to the location-dependent resolution that currently characterizes the method.

[0252] When considering a protein structure, it's crucial to recognize that what we perceive and accept as the molecule's reality is always a representation. Similarly to small chemical molecules which can be described by chemical formulae of various design, or a model created by ab initio calculations, the actual (largely quantummechanical) reality of their existance remains different: it is a model what can be considered as the molecule.

[0253] Definitely there is a difference between molecular representations (also called models) derived from physical measurements, in particular XRD or Cryo-EM and in silico calculations. However, with the advance of the biochemical knowledge, molecular representations (or models) generated using modem in silico calculations, particularly those based on Al and using a huge amount of datasets - like the AlphaFold approach - definitely outperformed models based on measurements with low resolution. What is definitely has always been common in the concept of models based on physical measurements and those based on in silico calculations is the importance of and focus on spatial or geometric structural parameters.

[0254] For example, using the Cryo-EM method and the XRD method, different kinds of information can be obtained on the same protein. In Cryo-EM, the electrons sense the potential of the entire electrostatic potential field, while in the case of XRD, X-rays interact with electrons and XRD provides information on the electron distribution or electron density, i.e. further electro satics -related quantities around atomic nuclei.

[0255] These data have been used so far preeminently to obtain a structural representation of the macromolecular, preferably protein structure.

[0256] However, both the electron distribution and the potential field are suitable for characterizing the charge rearrangements in a macromolecule, preferably in a protein. The present inventors have recognized that in order to study this effect, charge values should be assigned to atoms or other submolecular local entities and thereby these organized data sets or collections (called descriptors herein) may provide a more sensitive characterization and be a useful descriptor of charge-related molecular changes than the structure itself. The inventors present applications in which they show that the represention of a biochemical macromolecule in interest by electrostatic related parameters only can describe biological aspects of the molecule better than its structure.

[0257] In fact, electron distribution or density, the potential field data as well as charges are all electrostatic quantities different from structural data. Thus, instead of charges other charge-related or electrostatic quantities can be used. The inventors have unexpectedly recognized that the electrostatic representation of a protein molecule allows a conceptual change in assessing protein structure and function relationships and which representation of the protein molecule opens up a novel perspectives and uses.

[0258] In the terminology used herein an electrostatic quantity can be assigned to an atom or other submolecular local entiti values to obtain an electrostatic type parameter.

[0259] The biochemical features brought into connection with the descriptors can be multifold.

[0260] For example a biochemical feature can be a change in the protein molecule due to a mutation. It is well known that some mutations may have a significant effect on the protein function whereas others actually does not have effect on the function.

[0261] Another exemplary biochemical feature is ligand effect. Ligand binding may be a regulator of protein activity.

[0262] The electrostatic pattern definition on a macromolecule, like protein molecule on the molecular, submolecular or atomic level is a novel way to represent the biochemical characteristics of such macromolecule. The produced generally comparable sets of electrostatically derived physical parameters are applicable to determine on different biochemical features of the protein like secondary structure states, activity, effect of mutations, ligands and regulatory modifications, differences between protein states and quantification of the the effects on primary, secondary tertiary and quaternary structure formation. These feature estimations can be utilized in various fields e.g. diagnoses, drug screening and design, macromolecular engineering, e.g. protein engineering, designing mutations etc.

[0263] More specifically, the inventive methods proved to be useful to predict, with a surprising certainty states of channel proteins, like active or inactive forms on NMD Ar Ca2+channel, of SARS-Cov2 spike protein (whether in the up or down state), etc. The representation of the invention performed well in predicting helix termination.

[0264] Quite surprisingly, the inventive method proved to be effective in finding a relationship between codon variants for the same amino acid and the electrostatic environment of the amino acid encoded thereby, which provides a tool for creating “silent” mutations and harmonizing codon usage and protein folding. The tool may also be effective in diagnosing genetic diseases caused by silent mutations.

[0265] The inventive macromolecular representation performed well based on electrostatic and related structural data obtained from Cryo-EM measurements (potential maps) and XRD measurements (electron density maps), too. While presently far from being optimal, the model proved to be applicable on protein structures and potential field data obtained by in silico model calculations eg. by the forcefield methods. This indicates, however, that the inventive method will be a useful supplementary to artificial intelligence (Al) based protein folding computing methodology like Alphafold.

[0266] There are several methods of varying accuracy for determining the electrostatic potential field currently available. The first group of methods is based on quantum mechanical and quantum chemistry methods (ab initio). These algorithms are capable of accurately approximating the conformations and charge distributions of small molecules with immense computational cost. However, since these methods are based on determining the wave function of the electron, they are usually exponentially dependent on the number of atoms in the given molecule. Therefore, their scope of application typically only covers dipeptides or oligopeptides made up of a few amino acids.

[0267] The second group of methods are forcefield methods based on quantum mechanical simulations of small molecules, capable of describing the interactions atoms and approximating charge rearrangements. Since with the help of a few parameters per atom, the interactions give an independent description of the phenomenological conformation, these methods have moderate computational cost, but are incapable of analyzing the effects of the folding of the protein. At the same time, with the knowledge of atomic charges specific to monomer amino acids resulting from forcefield methods, it is possible determine a 0thorder approximation of the electrostatic potential field surrounding the protein molecule based on equation 1, since it is possible to derive the electrostatic Coulomb potential from the charge density.

[0268] V(r) = 4TT~£0J ffJJf \ \rP—rr'\dr'3'

[0269] (Equation 1) where V(r) is the electrostatic Coulomb potential, eo is the permittivity of free space and p(r’) is the charge density. The charge density can be defined from the electrostatic potential, from the Poisson equation, equation 2.

[0270] -1

[0271] V2F(r) = — p(r) eo

[0272] (Equation 2)

[0273] There are empirical, semi-empirical and the aforementioned ab initio quantum mechanical calculations for the approximate computation of atomic charges, all of which have their own advantages and disadvantages. The most widespread and most used method is Restrained Electrostatic Potential (RESP). The principle of the method is that for the semi-empiric or ab initio calculation of the atomic charges, the charges have to be fitted onto the Electrostatic Potential (ESP) grid calculated at many points in the molecule’s environment. The advantage of this approach is that these charges can reproduce the multipole moments calculated with quantum mechanical techniques and can optimally reproduce the intermolecular interactions with surrounding molecules which are essential for simulations in complex phases. There is also a so-called penalty function in this approximation, which refines the conformation dependence. The essence of this is to clamp the fitted charges during the process of the fitting with the ESP and to affect the quality of the fit only to a very small degree.

[0274] The partial charges characteristic to the monomer molecules determined with the forcefield models make up the basis of the calculations of the electrostatic field. The method contains approximations in multiple places. The most noteworthy aspect is that for every atom of the protein, - as a macromolecule with a complex spatial structure - it determines the electrostatic potential of the partial charge value characteristics to the monomer molecules. After the process of translation and folding, these values will most likely be a rough approximation, thus the potential calculation will not have correct results near the atomic nucleus either. Hence, the interpretation of the calculation will be correct far from the atomic nucleus, mainly on the protein-water border, where the two dielectric constant of the two mediums diverges sharply. Because of this, in order to approximate the electrostatic potential field, it is necessary to take the effect of the physiological solution surrounding the protein into consideration and implement these effects into equation 1.

[0275] Two characteristics of the salt solution surrounding the protein also have an effect on the electrostatic potential field. On the one hand, the dielectric constants (polarizability) of the protein and water differ significantly, on the other hand, the salts dissolved in water contain mobile chaige carriers, which have a significant effect (solvent-screening) on the potential field surrounding the protein. To take these effects into consideration, the Poisson equation, equation 2, transforms into the Poisson-Boltzmann equation,

[0276] (Equation 3) where e is the electron charge, e(r) is the local relative dielectric constant, co is the average ion concentration of the solution, kBis the Boltzmann constant and T is the absolute temperature. Equation 3 is a partial differential equation that can be solved for a known charge distribution with a method that involves approximation. For large protein molecules and their environment, the solution can be calculated on a grid with linear approximation and taking the atomic behavior into account with an average microscopic model, this can be achieved for example with a program implemented in the DelPhi software package (Li et al., 2012). DelPhi uses the method of finite differences to place the electron charges assigned to the atoms into the center of the atoms, then it distributes them by linear interpolation on the surrounding eight grid points. It describes the environment of the protein with a location-dependent dielectric function (near the protein e=2, while in the salt solution e=80), which can vary at grid points. To determine the ion concentration, it artificially separates the protein and its surroundings by defining an exclusion zone and takes the effect of salt into account outside the exclusion zone. It determines the border of the exclusion zone with the van der Waals radius of the atoms that make up the protein molecule and the constant radius of the salt atoms. It substitutes the differential operators in equation 3 with finite differences. Since the solution of equation 3 requires boundary conditions, during the calculations it assumes an exponentially shrinking field to take the shielding effect of the salt molecules into account.

[0277] At the present stage of technology, generally more reliable result can be achieved by experimental methods for protein structure analysis.

[0278] Single crystal X-ray diffraction (XRD) is the most often used method to determine the spatial atomic positions. Here, a crystal structure is produced from a large number of proteins with the help of certain additives. There are protein molecules in every unit cell of the crystal lattice. Based on the intensity distribution of the diffraction image of the crystal structure, it is possible to determine the atomic positions and their occupancy. Essentially, this provides the opportunity for three-dimensional mapping of the electron cloud of the protein molecule in the lattice with atomic precision. As an additional measurement for certain smaller structural elements, nuclear magnetic resonance (NMR) can be used, and the structural information gathered from multiple sources is docked into one three-dimensional protein structure. The disadvantage of the XRD method is that the creation of a lattice containing many proteins is needed for the mapping and this likely changes the native structure of the proteins. Thus, the atomic position arrangement is, although similar to the native protein, is not equivalent with it and drawing further conclusions with it is only possible with limitations.

[0279] With the significant development of transmission electron microscopy techniques, the Cryo-EM method also became involved in structure determination, which is capable of providing structural information about the native shape of the protein under conditions similar to the aqueous medium of the proteins.

[0280] Experimental Cryo-EM measurements aim to construct a 3D electrostatic potential (ESP) distribution in which the structural atomic model can be fit, resulting in a final structure. Although the ESP is a relative value inside each individual Cryo-EM measurement, it is hardly ever used directly. Yet there is no known method, which targets the utilization of the direct amplitude values of an experimental ESP map. However, in the electrostatic representation of such macromolecule it is a crucial element, since its physical information immensely correlates with biological characteristics. Due to these correlations several in silico models can be built to quantify qualitative features of the protein molecule so that it has the potential to make critical breakthroughs in the field of predictive drug design.

[0281] The structural information determined by Cryo-EM usually gives a lower resolution image of the atomic groups. However, the differences in resolution between Cryo-EM and XRD methods have been significantly reduced in recent years due to developments in sample preparation techniques and in electron microscopy, and therefore within a few years it is expected that Cryo-EM will completely replace XRD methods in protein structure research.

[0282] Below a brief summary of electron microscopic imaging and the basics of Cryo-EM and its comparison with XRD is given. The Cryo-EM (cryogenic electron microscope) is a technique capable of atomic scale imaging of biological samples. The sample is placed in an aqueous solution on a transmission electron microscope grid and cooled to cryogenic temperatures in liquid ethane or a mixture of liquid ethane and propane. The cooling preserves the structure of the biological samples and they become embedded in a vitreous ice. During Cryo-EM imaging, the sample is imaged with phase contrast imaging, then a three-dimensional potential map is produced by evaluating a large number of recordings (usually several copies of the same molecule in different orientations).

[0283] The method of phase contrast imaging in a transmission electron microscope is a key. When a beam of electrons penetrates a low density, thin, so-called phase object, there is no significant change in the scattering of the electron, therefore the intensity of the electron wave passing through the object is essentially unchanged. However, because of the phase difference of the phase object V(x, y, z), the wavelength of the electron (X) changes:

[0284] (Equation 4)

[0285] Here, h is the Planck-constant, m, e and E are the chaige of the electron, the mass of the electron and the kinetic energy of the electron respectively. Because of the change in wavelength, a phase shift occurs in the electron wave passing through the phase object, which is determined by the sum (integral) of the phase changes taken in the direction of travel of the electrons:

[0286] (Equation 5)

[0287] Therefore, the amplitude of the plane wave of the electron changes as a function of the projected potential of the sample with the following multiplication factor, which, if a small value, means that the amplitude of the wave changes proportionally to the projected potential (weak phase-object approximation):

[0288] (Equation 6)

[0289] This change in the amplitude is imaginary. Since the “contrast transfer function” (CTF) includes imaginary terms during electron microscopic imaging, the imaginary amplitude change becomes visible in the intensity of the wave (Williams & Carter, 2008). With the combination of appropriate defocus and the spherical aberration of the transmission electron microscope, the imaging results in an almost constant CTF, this is called Scherzer-defocus, and in this case, transmission electron microscope imaging is ideal for weak phase objects, such as proteins. The low signal-to-noise ratio and radiation sensitivity of biological samples usually limit the quality of Cryo-EM images. The signal can be significantly increased with the intensity of the electron beam, which, however, changes the structure of the sample. Therefore, in the case of imaging biological molecules, only one image is usually taken with a low electron beam intensity. A (3D) potential map of the molecule containing adequate information can be obtained by the subsequent reconstruction of images taken on copies of the same molecule to compensate for the low signal-to-noise ratio. The sorting of 2D recordings, the creation of the 3D potential map and its refinement is done using methods based on maximum likelihood.

[0290] Interpretation of 3D electrostatic potential maps should be done in the following way. The atomic charge density can be separated into a nucleus and an electron cloud:

[0291] P ) = pn(r) + Pe(r) = Z8(r) + pe(r),

[0292] (Equation 7) where Z is the charge of the nucleus (the charge density of the core is concentrated in one point therefore it is represented by a Dirac delta, d(r) function), pn(r) and pe(r) are the charge density of the nucleus and the electron cloud, respectively. The electrostatic Coulomb -potential can be determined from the charge density using equation 1. From the electrostatic potential, the charge density can also be determined backwards based on the Poisson-equation with equation 2.

[0293] How to compare Cryo-EM and XRD methods? As mentioned above, in Cryo-EM the electrons sense the potential V(r) of the entire electrostatic field, while in the case of XRD, X-rays interact with electrons. Since the charge of the atomic nucleus is known in both cases, with the help of relations described by the previous equations in principle, both the electron distribution and the potential field are suitable for characterizing the charge rearrangements due to the bonds of different strengths created in the protein. However, this important structural information is usually only approximately available for both methods, in XRD due to the crystal structure that only approximates the native structure, and in Cryo-EM due to the location-dependent resolution that currently characterizes the method.

[0294] In their initial experiments the present inventors used experimental Cryo-EM measurements, providing a construct of a 3D electrostatic potential (ESP) distribution from the protein structure map, in which the structural atomic model can be fit. In this effort to provide an electrostatic representation of a macromolecule, it appeared to be important to handle the physical information correctly in order to describe how it represents biological characteristics. Due to these correlations in silico models were built to quantify features of the protein molecule in a form which may be sensitive to functional changes.

[0295] While the ESP is a relative value inside each individual Cryo-EM measurement, it is apparently difficult to be used directly. There is no known method besides this invention, which could appropriately handle this problem.

[0296] Since the ESP is not a local quantity, the inventors have calculated the charge density therefrom on a discrete grid. The main reason of this transformation is to obtain a locally more representative information, in other words to localize the Cryo-EM data using a charge distribution map. They have found that the distribution becomes narrower and more localized after the transformation.

[0297] In this effort it proved to be highly advantageous to consider the exact structural geometric data (coordinates) of the atoms or other submolecular local entities wherein a point can be defined to which the electrostatic quantity can be assigned and such assignation is made.

[0298] In this context after such assignation the term electrostatic parameter is used.

[0299] In a very specific example a charge related electrostatic parameter was created called inverse charge density (ICD). Here the values of the charge distribution map were taken at the site of the atomic nucleus. Said value could be obtained by interpolation of the charge quantities given in voxels, tiny space parts the size of which is depending on the resolution of the measurement, to obtain an approximate (usually positive) charge value at this site, whereas another value, a local negative minimum, was sought in a pre-defined distance from the nucleus. Typically, the surrounding electron cloud make the space around the nucleus more negative while the nucleus results in a high positive charge locally. Whereas the sum of these two values tends to carry high noise ratio and less sensitive for the determination or measurement of the partial charges. However, it has been found that a more preferred and unexpectedly effective option is to calculate the difference of the two values, which is a positive value which represents a value proportional to the broadening / volume of the electron cloud, whereas its connection to the charge distribution map has been maintained. This difference value is called the ICD.

[0300] The ICD value was found surprisingly preferred for multiple reasons. As being assigned to atoms (or other submolecular local entities) this value proved to be more specific (both local and having a higher absolute value) than the representative of the partial charge; moreover, as it turned out in later experiments, it proved to correctly reflect atomic properties.

[0301] Alternatives of the ICD value applied by the inventors are easily created by a person skilled in the art. It is plausible, based on the teaching herein, that if electrostatic data (including charges, electrostatic potential or even electron density) at a specific location (e.g. on an atom or at a spatial site defined by any other submolecular local entity) is used and is combined with another value of the same quantity at an appropriate site within the surrounding area of the same atom (e.g. within spherical shell the outer sphere of which is at 9, 6. 5 or 4, preferably 3 Angstroms, and which is eg. 3, 2, 1, 0,5, 0,2 or 0,1 Angstrom thick, preferably 1, 0,5 or 0,2, highly preferably 0,5 Angstrom), such locally assigned electrostatic type parameters may be usefiil to create descriptors which duly reflect biochemical properties of the macromolecule.

[0302] However, the skilled person will be aware and it is proven in the examples for the invention that other electrostatic-type parameters (or electrostatically derived parameters) are also useful. Preferably such parameters may comprise various electrostatic data, including charges or electrostatic potential or a combination thereof.

[0303] These electrostatic data-derived parameters are derived from the electrostatic quantities of the biological macromolecule wherein each electrostatic data-derived parameter being assigned to a single submolecular local entity value, like an atom or another specific entity like a side chain e.g. an aromatic ring (or the center thereof), a bond, like an amide bond etc.

[0304] Depending on the biochemical feature to be assessed an appropriate selection of atoms or, more correctly submolecular local entities are made which subset of such entities or atoms are typical to said biochemical features. For example, when a secondary structure is assessed it is reasonable to select the main chain on which said secondary structure element is examined.

[0305] If a part of the protein is clearly more relevant than other the electrostatic quantity map or the electrostatic parameter map can be cropped, i.e. a part thereof be considered only. In fact taking an oligopeptide or a functional portion like the active site or a domain like a transmembrane domain or a binding site is subject to such cropping.

[0306] When the state of a protein is to be assessed it is reasonable to select a structurally relevant part of the protein like a domain or the whole protein. In certain embodiment certain atoms (or submolecular local entities) are left out from analysis. Methods provided herein below to discard such atoms or entities which affect the biochemical features with a lower weight and therefore can be left out from the set of electrostatic data-derived parameters. This selection or trimming is a useful tool to simplify to model calculations and thereby adapt the inventive in silico representations to the biochemical feature, i.e the related problem to be solved.

[0307] The appropriate selection of submolecular local entities, e.g. atoms and the set of electrostatical data derived parameters assigned thereto forms a descriptor related to the biochemical feature, to which a biochemical feature value can be assigned.

[0308] According to the invention the electrostatical data derived parameters of a descriptor can be considered as independent or predictor variables in a mathematical model in which values relating to the biochemical feature are the dependent or target variables. By performing the model on a multitude of macromolecules these target variables, i.e. the biochemical feature values can be estimated by scores obtained for the descriptors. A multitude of scores, estimating the target variables forms a distribution or a ranged or interval derived therefrom. The range can be open on one side corresponding to a threshold level or can be between two values each possibly having characterized statistically or may be assigned a given specific value thereto. Preferably a set of reference scores is obtained on multiple reference macromolecules like proteins. In an embodiment different proteins are included into the reference protein set to arrive at a model which is largely independent from the specificities of a single protein type.

[0309] All these minor adaptations of the model are within the skills of a person skilled in the art of statistics.

[0310] Once a macromolecule with an unknown biochemical feature is studied the same model is applied on the descriptor or multitude of descriptors set up on that macromolecule of interest and a related score is obtained. Such score is compared to the reference scores and thereby, in the knowledge of the target variables related to the biochemical features, the biochemical feature of the macromolecule of interest can be assessed.

[0311] Just to give an example, if the score of interest falls within the range of the reference scores typical of a specific state of a given protein, e.g. the up state of a SARS-CoV-2 spike protein or the active state of an NMDA receptor, it can be assessed that the protein of interest is in that specific state. In certain cases such decisions can be brought with a certain probability. For example if the two protein states are characterized by two overlapping ranges or distributions of the reference score values then the score of interest is to be compared by statistical methods. Again, such statistical methods are well within the skills of a person skilled in the art.

[0312] In preferred embodiment the inventors created several processes to improve the applicability of the electrostatic parameters like ICD.

[0313] Certain embodiments allow the global comparison of these parameters independent of the specific Cryo-EM measurement and avoid the scaling problem and the local resolution dependency.

[0314] Specifically, one of the characteristics of the Cryo-EM imaging method is the inhomogeneity of the reconstruction quality of the ESP distribution. The reason for the change in resolution from place to place is that the groups created in the classification step included in the evaluation process come from a different number of 2D images. The reason for the quantitative differences is the structural variability of the flexible protein parts. During cryogenic fixation, several characteristic structural forms can be found on such parts, therefore these identical segments will be found in different classes at the same time. Several literary algorithms and software are available for calculating the local resolution. As an example the present inventors have used the MonoRes resolution calculation method (Vilas et al., 2018).

[0315] The present inventors have elaborated a correction as well as a normalization method to give us a processed electrostatic-type parameter that allows, for example, sets of electrostatic patterns comparable among proteins or various measurements. The patterns obtained with these transformed parameters are generally comparable and has an improved potential to derive additional information for the existing biomolecular knowledge. Such transformed parameters are particularly useful when e.g. different states of a protein or part thereof (like domains) are compared.

[0316] As an example, several calculations will be shown in the Examples to show regularities in electrostatic parameter found in the primary structure, as well as changes of electrostatic patterns dependent on folding and in protein-substrate and protein-protein interactions.

[0317] These methods provided ample support to the inventive finding that the electrostatic representation of a protein molecule can be created with generally comparable sets of parameters.

[0318] In certain embodiments, these parameter sets can be imagined as unique, fingerprint like patterns - just like the presentation of the ICDcorrected values of alpha carbon (or other) atoms alongside the peptide chain.

[0319] In this embodiment for example the ICDcorrected values on the appropriately selected part of the peptide chain together with the atoms they are related to form the descriptor of the invention. In this embodiment the ICD values are considered as predictor (independent) variables whereas a mathematical model is set up to calculate variables which are defined to be related to or define the biochemical feature in question. This may be simply the fingerprint like pattern (without any further knowledge) defining a given state (like the folded state) of the protein. If a target variable added to indicate this folded state, any other fingerprint significantly different can indicate a nonfolded state or not-sufficiently folded state including (but not necessarily limited to) unfolded or disordered states.

[0320] The target variable can be considered as a qualitative variable to which, however, a value can be assigned as a target variable (called biochemical feature value herein), and by appropriate means it can be differentiated from every other state.

[0321] Alternatively two states can be differentiated like folded and unfolded or active and inactive states with two target values.

[0322] The target variable can be estimated by a score using the model on the predictor variables.

[0323] A set of reference scores can then be applied to assess the biochemical feature of a protein of interest.

[0324] In a further embodiment, the descriptors obtained by transforming data from the Cryo-EM measurements, provide information on the native electrostatic states of a protein. These data were typically used in the prior art to obtain a three dimensional map of atomic coordinates only. Information on these electrostatic parameters can be further processed by the method of the invention, by creating descriptors or a set thereof as described herein, to obtain a representation of the protein molecule having the potential or being useful for fiirther biological or medical uses. This creates a new perspective in drug design and research as well.

[0325] In a preferred method it has been shown that the processing steps can result in generally comparable parameters.

[0326] The appropriately processed electrostatic -type parameters (i.e. parameter values carrying information on physical electrostatic quantities of the protein molecule) are further used to create descriptors of the protein or a part thereof.

[0327] In a simple exemplary embodiment such parameters can be used to form unique patterns through protein chains or their arbitrary or prefeably appropriately selected small subunits. It has been shown that even after several tens of covalent bonds, remarkable correlation can be observed in the extrapolation calculation. These patterns are not random, and their formation can be quantified. These quantified patterns have made it clear that macromolecular systems like proteins are electrostically determined. Thus, surprisingly, this electro static -related model of the protein provides a more sensitive detector of changes and / or differences in protein structure and lunction than 3D structural information in itself.

[0328] From the examples provided below it is clear that the electrostatic pattern of the protein is not randomly formed but based on a strict and strong relationship between the atoms which can last several covalent bonds. These relations are dependent from the type of the amino acids forming the peptide chain, thus the observed pattern is also dependent from the primary structure.

[0329] With this information in silico models can be built for estimating the effect of point mutations which is mainly an unclear field in biochemistry. The inventors are unaware of any research projects in advanced stages that deal with the electrostatic change induced by point mutations in a protein molecule which substantiate the novelty of this patent.

[0330] In a further embodiment, these electrostatic patterns, defining the differences between protein states, can be used to calculate scores which are related to and are sensitive to various states of protein, and by comparing such score for proteins of interest with a multitude of reference scores (e.g. in the form of a distribution) established on a set of reference proteins, or with an interval derived therefrom, the state of the protein of interest can be described or foretold.

[0331] Aprotein canbe in different states for several reasons. The different states may be due to its normal mechanism. Alternatively, it may be that the protein has a normal and a pathological state. The states may be converted into each other e.g. due to a regulatory effect, e.g. by a regulator substrate, e.g. as an activator or an inhibitor, or a “toxic” material, etc. Such states may differ due to protein regulation, mutations, substate binding etc. In such cases forming a difference between the electrostatic representations of the protein in the multiple states is a preferred method.

[0332] As an example, in an ANOVA test the SARS-CoV-2 spike down and up states can be differentiated by the electrostatic patterns. The ICDnOrmai values are, in the presented model, preferred for this purpose. In order to present that active and inactive protein states behave differently and have different patterns the Dynamic Time Warping (DTW) method, a widely used data analysis algorithm useful to find trend differences between data series, has been applied (“Dynamic Time Warping,” 2007).

[0333] The input parameters or predictor variables for the DTW method are the electrostatic data-derived parameters, like ICDnomai data series, calculated on the atoms of the peptide chain of a protein chain of interest. In case the biological feature is whether the protein is in one state or the other, e.g. a biological feature value of 1 (first state) or 0 (second state) can be set up as target variables.

[0334] By using such difference-type models in a highly preferred example the calculated difference value (obtained eg. by the DTW method) can be e.g. normalized in order to make direct comparison available, and the results are scaled into e.g. an interval, e.g. an interval of 0-1. These specific Bemoully variables of 0 and 1 are target variables (i.e biochemical feature values), whereas thereby the model provides a score value to characterize or estimate the biochemical feature value for the state of the protein.

[0335] This method provides among other a way of data analysis for ligand screening by measuring in vitro protein states and by quantifying the similarity of different electrostatic states e.g. set by different modifier substrates and the reference native state. This screening method would be beneficial for a perspectively new in silico screening method in drug design.

[0336] In an other embodiment the method of the invention is useful for describing and testing secondary structure elements, e.g. in an example the termination of alpha helix structures.

[0337] In experiments described in the Examples, the inventors show results which suggest the idea that the helix termination is regulated an amino acid before the real termination point. In the examples here MLR is applied as a model on a descriptor, wherein the descriptor comprises a set of atoms of e.g. the amino acid studied as a potential helix breaking amino acid (in the examples below Alanine and Asparagine) as a central amino acid as a tripeptide, as well as atoms of the previous and following amino acids. The tripeptide is moved along the potential helix breaking amino acid of the protein chain, It may be sufficient if the main chain atoms and possibly a few other atoms (e.g. beta carbons of the neighboring amino acids, if they have) are used.

[0338] The electrostatic parameters (like ICD or other) and the data of the atoms form the descriptor. The electrostatic parameters are used as predictor (independent) variables in the model and the target variables are the values assigned to the biochemical feature (helix termination). The model is applied, using the weights estimated by minimizing the error and thereby estimates of the target variables are obtained as scores. As a multitude of proteins are used to provide a reference, two groups of scores are obtained, one for those amino acids which are part of a helix and another for those which terminate the helix.

[0339] Utilizing this information we can make a theory on that one amino acid is not enough to certainly terminate a certain regular structure e.g. helix but the connection of several amino acids are needed. However, the electrostaic environment helps and is appropriate to assess whether the given potential helix breaking amino acid will actually break the helix or not. Furthermore, if the rule of the regulation can be detected that means target points may be identified on which an outer regulation can happen to modify the structure of a protein in a planned manner. This can be used in various misfolding diseases when the protein misfunctions because of irregular alpha helix formation. Similarly to the previous model, we investigated the estimation of the secondary structure to be Helix or Random coil on given peptide segments. The descriptors were filtered to a set that contains equal amount of Helix and Random coil secondary structural forms of the central amino acid glycine. Similarly to the previous models the consequence is the same, that the ICDcorrected based descriptors really carry information about the structural forms.

[0340] Another embodiment related to the estimation of the secondary structure to be Sheet or Random coil on given peptide segments. The descriptors were filtered to a set that contains equal amount of Sheet and Random coil secondary structural forms of the central amino acid leucine. Similarly to the previous models the consequence is the same, that the ICDcorrected based descriptors carry information about the structural forms.

[0341] The electrostatic representation of a protein is not only a diagnostic descriptor for the protein state but also carries information about the primary structure itself and is diagnostic to its secondary structure.

[0342] Several properties of the mechanism of protein folding are still unknown. The state of misfolding can be discovered in most disorders however there are no known algorithms that can predict such abnormalities with the help of in vitro experiments. Misfolded structures cannot be predicted yet in silico because of the absence of knowledge why not-native structure appears. In silico physiological structure canbe estimated with artificial intelligent algorithms such as AlphaFold, however the dependencies between the formed structure and the amino acid sequence are complex. By the present invention termination of alpha helix structures becomes predictable, thus a design for protein structure editing can be established.

[0343] As explained herein, the electrostatic patterns have the essential information on how the complex electrostatic state of a protein molecule is formed. Utilizing this information, it can be established that the patterns are unique to a native protein state and canbe used as diagnostic features relative differences between states. Further analyzing the patterns difference canbe detected between the effect on the pattern formation of the different protein building amino acids. Surprisingly, even a single mutation could result in a drastic change in the electrostatics of a protein, and the representation of a protein according to the invention is much more sensitive to these changes than geometrically defined structure.

[0344] As a further example, the effect of 7t - 7t interactions on the electrostatic parameter pattern on the main chain has been shown by the present inventors.

[0345] The 7t - 7t interactions are significant interactions between delocalized electron clouds that are situated in a given orientation often perpendicularly. Furthermore, several of these interactions can occur in the same place called 7t stacking. The typical distance of the electron clouds maximizes in ~7 Angstroms, preferably smaller.

[0346] In order to calculate these stacking interactions we need to define a biochemical target variable that contains the structural information of the stacking. In the experiments this variable was defined regarding Histidine amino acids, and the ‘position’ delocalized electron cloud is defined by the weighted average of the coordinates of the nitrogen atoms of the imidazole group and the gamma carbon atom of the amino acid to define a submolecular local entity in the target variable. In the calculation the number of neighboring delocalized electron clouds were considered within a given cut off distance, wherein if there are more than one neighbor selected from HIS, PHE, TYR, TRP and the guanidine group of ARG, than it was considered that the stack of minimum 3 amino acid is highly probable and the target variable was set to 1 otherwise 0.

[0347] The reference scores are the estimations of the target variable (describing the biochemical feature) calculated by the weighted sum of the ICDcorrected parameters (predictor varibles). Here the weights are the coefficients. The reference scores form two intervals derivable from two distribution thereof. These intervals can be use to decide the biological feature value of a score of interest. In this experiment the predictor (i.e. independent) variables were the ICDcorrected values on the atoms of a descriptor.

[0348] This shows again the unexpected sensitivity of the electrostatic model of the protein to even minor changes, whereas the traditional structure-type models are usually more conservative and tolerate even bigger changes.

[0349] It has also been found that the weights of the result of the regression suggest that the change in the HIS side chains is huge and individually significant enough to produce a good estimation on the tertiary structure in this regard. This corresponds to the structural knowledge that the 7t stacking interaction happens on the side chain. In our way of understanding the results this deviance on the side chain ICDcorrected values should influence the values of the main chain atoms ending up in a domino effect on the whole main chain pattern. The reason for this is that the ICD pattern of the whole protein contains long term relations that could not be neglected. This example provides a lurther evidence that the pattern formation means a new concept in drug design to find sensitive targets and screen protein states relative to a native electrostatic state.

[0350] In a further example the relationships between the genetic code, including silent mutations and the electrostatic environment of certain amino acids have been investigated.

[0351] In the field of biology, one of the greatest breakthroughs was the deciphering of the DNA code dictionary, which determines the sequence of amino acids that make up proteins, as well as the double helix structure. Three consecutive nucleotide bases code for one amino acid is called a codon. The code table consists of 64 codons, of which three are stop codons — signaling the termination of translation — so 61 triplets code for the 20 amino acids. The code dictionary lunctions in this way for almost all living organisms on Earth, with few exceptions. For example, in the mitochondrion, selenocysteine is coded by one of the stop codons (UAG). In mathematical terms, the genetic code is degenerate — most amino acids are coded by two to six codons.

[0352] Codon usage bias, the preferential use of certain synonymous codons, is a universal feature of all examined genomes, influencing folding, translational efficiency, accuracy, mRNA stability, and transcription. The reading of triplets occurs in the ribosomes. Codons that code for the same amino acid are called a synonymous codons and if mutation happens synonymous mutation, or silent mutation. Until the last few decades, little was known about their role. The change came with the development of analytical techniques that made it possible to study folding.

[0353] In their summary works, two research group (Komar et al., 2024) (Lyu et al., 2020) described that codon usage regulates the speed of translational elongation, which does not result in a uniform ribosome decoding speed. The folding of most proteins, especially larger ones, is co -translational protein folding, meaning that folding begins parallel with the translation in the ribosome. Deviations from the general rhythm can result in protein misfolding. It is known that the cause of many diseases is protein misfolding. Specific proteins, such as Escherichia coli Chloramphenicol Acetyltransferase (CAT), have been shown to exhibit changes in translation kinetics and misfolding structures when rare codons - meaning low intracellular tRNA concentrations matching that specific codon - were introduced by silent mutations. This resulted in altered protein folding when the translation progressed either slower or faster than the optimal speed.

[0354] Numerous publications describe the detectability of single synonymous mutations in various diseases, such as MDR1 (multidrug resistance), CFTR (cystic fibrosis), and breast cancer caused by the Ras gene family (KRAS). Also there are publications that suggest synonymous mutation utilization in protein optimization (Kim et al., 2015) (Liu, 2020) (O’Brien et al., 2014).

[0355] Machine learning approaches have already suggested that protein structures may carry information about their genetic coding. Such work was published (Rosenberg et al., 2022) (Ackerman-Schraier et al., 2022) showing that the protein building amino acids might have different expected backbone structural form depending on the codon. In that work numerous E. coli proteins and their structural models determined by XRD were utilized in statistical comparisons. The final result shows that there could exist a dependence between codon identity and backbone torsion of the translated amino acid.

[0356] In an embodiment of the invention electrostatic patterns are used to define the differences between codons and assess the effect of silent mutations.

[0357] In this section, we demonstrate that these patterns enable the detection of significant differences in the electrostatic properties of the same amino acids, even when they originate from different coding triplets in DNA and RNA. Research in this field is still in its rudimentary stages, though biological evidence suggests that there may be real differences between proteins with the same primary structure but different genetic origins. For instance, the speed of translation depends on mRNA structure and intracellular tRNA concentrations, and protein folding can be influenced by the rate of translation. Furthermore, some genetic diseases, such as a rare form of cystic fibrosis, may arise from silent mutations.

[0358] In nature the genetic code is degenerative, meaning multiple codons can encode the same chemical properties, the amino acids. In order to analyze the electrostatic patterns and detect any significant effect we need to complete our datasets with the exact genetic information. Although this seems to be a simple task there are uncertainties between the databanks that should be taken into consideration when presenting the results. Since the protein databanks do not contain any information about the genetic code there could be some mismatches between the amino acid and exact codon pairs that cannot be filtered out without knowing the exact gene that was used in the protein synthesis, but the ratio of the mismatches is considered pretty low.

[0359] The coding triplet type can be quantified by target variable, estimated by a score value, we named in our calculations triplet joint. This target variable is used as a Bernoulli variable for building MLR models in ways shown before. However, as we investigated the effect of the coding codon on the electrostatic state is a fine tuning and not as localized as the parameters already presented. That is the reason we need to gather ICDcorrected atomic values from the global environment of an amino acid. However, increasing the number of parameters dramatically could end up in finding false patterns that are not related to the triplet state itself but something else, meaning the produced fit might not include any meaningful information. Therefore, we included additional testing phases before the evaluation.

[0360] Quite surprisingly the inventors have found differences between the electrostatic state of the amino acids originated from different codons. Since the model performs well on the independent testing dataset as well, predictions can be made for the electrostatic properties of a global environment of an amino acid.

[0361] In summary, the previous paragraphs and our methodology with Cryo-EM records it was presented that a generally comparable set of physical parameters can represent a protein molecule of interest. These parameters can be ordered to be utilized as patterns defining electrostatical states. It was shown that on the main chain there is a significant autocorrelation meaning that even one amino acid can influence the electrostatic state of an amino acid several covalent bonds later. This proved that the defined patterns are not noise. With the help of different models, like difference-type models exemplified by DTW we determined difference between the biochemically different states and regression type models exemplified by MLR we determined significant differences between the electrostatic effect of different amino acids. This means that the electrostatic state is very sensitive for the state of a protein and also for the primary structure. Also, we could determine the causes of secondary structure changes in a way of alpha helix termination. The presented results showed that the structural change is induced by the electrostatical effects originated from the previous amino acids before a termination. Furthermore effect of certain tertiary structures was calculated on the electrostatic properties of the peptide chain and if the structures could not form because of mutations the whole electrostatic and structural system of a protein could collapse. And lastly, we established models predicting significant differences between global amino acid electrostatic state depending on the coding triplet the given amino acid is originated from. This means that the encoded chemical diversity an amino acid may seem to be degenerative but physically these are different types of influences of the electrostatic state of a protein, meaning a silent mutation can be considered as a fine impulse on the effectiveness of a protein.

[0362] The electrostatic representation of a protein by descriptors that has the potential to describe fine biological function is not only available by the experimental Cryo -EM method but also with experimental “X-ray diffraction” (XRD) techniques as well. As opposed to the Cryo-EM data, XRD methods result in experimental “electron density maps” (EDM), and contains no information about the chaige of the nucleus but the electrons only. There is another big difference between the two methods as well. The Cryo-EM technique works in a physiological solution flash- frozen by a cryogen, thus maintaining the native structure. Although the structure is considered to be native in XRD, there could be some deviances and errors that result in a misfolded protein meaning that the results of further structural analysis should be closely monitored.

[0363] The determination of the electron density distribution of the protein crystal is the key step in structure reconstruction in XRD measurements. Electron density and electrostatic potential are non-linearly proportional parameters, except from the fact that Cryo-EM measurements also contain the effect of the positive nuclei. Since the effect of the nuclei can be considered as a neutral constant background bias and can be disregarded. The electron density as a parameter can be ordered to each unique atom from the different distributions coming from the experimental measurements similarly to the previously presented methods.

[0364] The electron density data are arranged on a discrete grid and each value belongs to a given voxel in the grid assigned to said electron density value.

[0365] In the illustrative examples shown herein initially we used 125 XRD measurements all having cube-shaped voxels; this was for sake of convenience only.

[0366] Analyzing the atomic fo density values alongside the peptide chain of proteins - similarly to the Cryo-EM values - autocorrelation can be detected. This effect means that the formation of the definitive, fingerprint like patterns are not random but are influenced throughout longer distances on the molecule. Since XRD is known to have atomic resolution and electron density values which are clearly unique to each type of atom - meaning e.g. these values are significantly different for an alpha and a carbonyl carbon atom - this result affirms the hypothesis made up with Cryo- EM data that the electrostatic patterns of a protein molecule are not random but have an information content and follow a strict and quantitative rule. The calculation is carried out by autocorrelation calculation, the same way as for the ICD value in case of Cryo-EM measurement.

[0367] As presented in the examples the estimations are meaningful and there is a long term influence on the electrostatic characteristics formations just like with Cryo-EM data. Even in examples wherein the amino acid side chain is not included in the descriptor since it is unique for each type of protein building amino acid, by filtering the data the effect of the individual type of protein building amino acid can be estimated.

[0368] In another experiment, data were obtained not from measured datasets but from in silico modelling. For example in case of model calculations with the DelPhi method, no such differentiation of the carbonyl carbons can be observed, i.e. the FO values are not particularly high on these atoms.

[0369] In other words, the XRD data “know”, i.e. provide information about the individual atoms, and when we wish to calculate estimation on the carbonyl carbons, these atoms take part in the descriptor with an increase weight.

[0370] An example is also shown for the estimation of the helix terminus parameter with XRD. The terminus parameter is a binary variable (the target variable) depending whether the given amino acid is the last of the given helix or not. The XRD data also proved to be useful to estimate helix termination. In a further set of embodiments in silico calculations were used for determining patterns based on physical properties of proteins. In the examples supporting this embodiment DelPhi calculation results were used and compared with Cryo-EM results wherein the descriptors were adapted, in fact simplified to those applied based on the DelPhi calculation.

[0371] EXAMPLES

[0372] 1. 1.) Obtaining unique electrostatic parameters on the atomic level of a protein molecule of interest

[0373] There is little known about the correlations between the regularities of the primary structure of a given protein molecule and how the geometrical structure formation is determined in it. The proteins are crucial elements of biological pathways, however the exact rule of the protein - protein interactions remains unknown.

[0374] An important step to acquire further information on these questions is to analyze the protein as a defined set of physical parameters because it is more precise than only the geometrical and chemical descriptions. The precision is demonstrated in the examples.

[0375] In order to achieve this description the Cryo -Em measurements need to be utilized in a different and untypical perspective. The ESP distributions contain the necessary physical information but in order to obtain generally comparable parameters several technical problems must be solved.

[0376] Experimental Cryo-EM measurements aim to construct a 3D electrostatic potential (ESP) distribution in which the structural atomic model can be fit, resulting in a final structure. Although the ESP is a relative value inside each individual Cryo-EM measurement, it is hardly ever used directly. There is no known method, yet which targets the utilization of the direct amplitude values of an experimental ESP map. However, in the electrostatic representation of such macromolecule it is a crucial element, since its physical information immensely correlates with biological characteristics. Due to these correlations several in silico models can be built to quantify qualitative features of the protein molecule so that it has the potential to make critical breakthroughs in the field of predictive drug design.

[0377] 1.1.1) Calculation and processing of ICD parameters

[0378] First of all the presented molecule here is a SARS-CoV-2 spike protein (Walls et al., 2020), in the inactive state (PDB ID : 6VXX) The PDB database contains the structural model of the spike protein, while the EMDB (Turner & The wwPDB Consortium, 2023) database contains the reconstructed experimental ESP map model. The map is coded on a discrete grid with a given number of voxels with a given size (usually voxels are cubes but there can be exceptions). The coordinates of the grid and the coordinates of the structural model can be matched to each other, with a simple transformation:

[0379] W’ Ts’ ^s—esp * ^xi Vesp * ^y ^esp * ^z

[0380] (Equation 8)

[0381] Where x, y, and z refers to the coordinate value of a point (e.g. where of an atomic nucleus or other point selected), \ is the index of the structural model, esp is the index of the ESP map model and a is the voxel size.

[0382] The ESP map model comprises voxels wherein potential values are assigned to each voxel in the ESP map. By this method we can define where the point of interest (e.g. an atomic nucleus or other point selected) is present within a voxel. For example, by dividing the coordinates of the structural model (e.g. xs) with the voxel size in that coordinate (e.g. ax), we obtain the esp values defining where the point is within the voxel. Thus, the structural model and the ESP map model can be fit (reconciled) .However, the exact values of the altitudes are not used further in the structure evaluation.

[0383] A very typical characteristics of the experimental ESP maps is that there are regions with low atomic amplitudes. These are the regions with worse local resolution parameters, i.e. regions of low resolution. The effect of these local resolution inhomogeneity is confounding when making global comparisons inside a single experiment, since it disturbs the scaling. Due to this, every parameter originated from the ESP map should be corrected in order to make global comparison possible. The scale is also relative, and individual for an experiment or experimental group and varies accordingly. This problem also needs to be fixed with every parameter originated from Cryo-EM experiments.

[0384] Since the ESP is not local due to the long term effects, it is advisable to calculate the chaige density with a Laplace - operator on a discrete grid, defined in the following equation.

[0385] (Equation 9)

[0386] Where x is the voxel length, q is the reconstructed charge, V is the reconstructed electrostatic potential, eo is the vacuum permittivity constant and the expression after the summary is the Laplacian operator on a discrete grid with additional constants. Gradient provides a measure of the derivative of potential in the 3 direction

[0387] Equation 9 describes a Laplace operator to arrive at the divergence of the gradient Here, the Laplace operator or, more precisely, a discrete representation thereof, is applied to a grid consisting of discrete elements. While further calculations based on the ESP map would also be possible, with this transformation charge distribution is calculated from an electrostatic (potential) distribution for practical reasons and to achieve better results.

[0388] The main reason of this transformation is to obtain a locally more representative information from the measurements, thus its usage is favorable in data analysis. In an effort to further localize the Cryo-EM data using a charge distribution map is a useful tool. To make it illustrative, the Gaussian curves composing, in an interpretation, the maps are “narrower” in case of the chaige distribution. Nevertheless, it should be noted that electrostatic potential maps and charge distribution maps can be converted from each other. In the calculations the constant values can be omitted from the equation because of the arbitrary scale of each Cryo-EM measurement.

[0389] Both the the ESP map and the charge map consist of discrete values on the grid. The question arises whether we can obtain an approximate potential or charge value for any point in the map. In fact, from both the ESP map and the charge map the exact values can be interpolated into any arbitrary positions (points of the map) with 3D linear interpolation. The formula of the linear interpolation is presented in the following equation.

[0390] (Equation 10)

[0391] Where the value of function f in position x; is yi. In 3D after 7 consecutive linear interpolation every spatial value can be calculated.

[0392] In an example, the interpolation takes place on the positions defined by the atomic coordinates of the structure model. It can be considered as a directed sampling of the map, where the samples represent the electrostatic feature of atomic environments. These values after the necessary corrections can be utilized in model building with noticeable result.

[0393] However, the sampling cannot only take place on the atomic positions. In a preferred method of the invention, in order to fine tune the parameters, sampling can take place at any point in defined spaces in the neighborhood of the atoms.

[0394] When charge distribution map is applied, there is a broadening of the charge distribution typically ranging between 2.5 and 3.5 Angstroms around the position of each atomic nucleus. Inside this region there are not only positive values (as at the atomic positions), but negative values as well. The positive charge values originating from the positive nucleus, while the negative charge values originating from the negative electron cloud. By sampling in positions representing the features of the electron cloud the interpolated value can be enhanced with further information.

[0395] In our calculations and methods several instances of sampling happens in the defined 2.5 and 3.5 Angstrom range around all the atoms. With enough sampling the minimum value of the sampled distribution should be used to generally represent the electrostatic features of the electrons.

[0396] For example, an atom is the alpha carbon atom of a proline amino acid in the B index protein chain in the 39th position measured from the N terminal. Its coordinates are (188.252; 185.957; 224.120) while the found electron sampling point coordinates are (187.354; 183.583; 225.688). Drawing a segment between the two points, alongside on several samplings can take place in order to present the typical curve in the local environment of an atom. This curve is presented in Figure 1.

[0397] Specifically, Figure 1 shows the change of the interpolated charge value alongside the line drawn between the atomic position and the hypothetical electron cloud position. The nucleus ideally is at the local maximum, while the electron is at the local minimum.

[0398] The interpolated values can be utilized in several ways. In our calculations we consider to calculate the sum of the two values which - in our concept - represents a value proportional to the partial charge. However, this value tends to have a lot of noise - since the Cryo-EM is not sensitive for the determination or measurement of the partial charges so only an imprint is visible / detectable.

[0399] A more preferred and unexpectedly effective option is to calculate the difference of the two values, which is a positive value which - in our concept - represents a value proportional to the broadening / volume of the electron cloud. In this example this value is strongly related to the physical quantity of charges (herein calculated from the potential map) and therefore this derived value is referred to herein as inverse charge density (ICD) to express its connection to the charge distribution map and the fact that it is also related to the distribution of charges, somewhat similarly to charge density. However, the skilled person will understand that on the one hand this is a derived quantity and it can be denominated differently and on the other hand that other similar calculated values can be obtained from either the ESP map or the charge distribution map which can be utilized in the creation of the electrostatic protein (or macromolecule) model of the invention.

[0400] Figure 2 shows these derived parameters: “pcharge”: atomic position charge values; “ppotential”: potential (ESP) value interpolated at the atomic position; “echarge”: charge value interpolated at a given point of the hypothesized electron cloud in the minimum searching volume (here: a spherical shell); “pcharge” plus “echarge”.

[0401] It is interesting to note the tendency that the ICD values correlated with each of these derived quantities. This illustrates that these parameters are not independent from each other. For example, ICD is not only proportional to the widening of the electron cloud, but the partial charge within the space in which it is interpreted. The resolution of the starting data map (the ESP map) may not be atomic level, and we cannot folly separate the clear physical quantities. Thus, in each derived quantities comprise some electrostatically related information which may be useful in the model formation in the present invention.

[0402] Nevertheless, the ICD value is a preferred one for multiple reasons, e.g. this value is sharper than the representative of the partial charge and is discrete for the atom type, thus using the ICD values is preferred later. In Figure 2 the derived values from the distributions are scattered. Since the scales are irrelevant they are omitted. Investigating the plots it is noticable that they tend to correlate with each other. This means that the information of the independent parameters cannot be separated by these transformation, meaning all of the parameters is a mix of the electrostatical features.

[0403] Figure 3 shows the calculated atomic ICD values of two chains of a SARS-CoV-2 spike protein plotted in the order of the amino acid index. The two curves present two different protein states. The up state and the down state show a significant difference on the receptor binding domain (300-500). This is caused by the local resolution difference between the two states.

[0404] This is because this is the receptor binding domain, on which the spike protein makes connection with the receptor thus allowing the further interaction and fusing of the viral and host cell membrane.

[0405] Here the A chain is inactive and the B chain is active and the structure of the latter is of higher motility, and therefore the peak values and thus the ICD values are lower. Thus, the active chain results in higher o and lower ICD values.

[0406] Figure 4 illustrates the calculated atomic o of two chains of a SARS-CoV-2 spike protein plotted in the order of the amino acid index. The two curves present two different protein states. The up state and the down state show a significant difference on the receptor binding domain (300-500).

[0407] The goal is to develop a new and unique presentation of a protein molecule of interest which can be used to obtain further information on its regularities and utilize them in developing a new drug design. In this effort, in a preferred embodiment, the local resolution distribution of the given ESP distribution can be calculated. Calculation of the local resolution distribution is a preferred option because of the characteristics of every Cryo -EM measurement, specifically, the high scaling variability of the ESP distribution. The local scale of the ESP distribution shows a strong dependency from the local resolution. This effect results in great difficulties in the determination of comparable real electrostatic patterns on molecular, atomic or subatomic level even in the same Cryo-EM measurement. Reducing this effect of the inhomogeneity of the resolution distribution on the local scaling of the charge density distribution is a crucial step to make a technical improvement in the data utilization.

[0408] Several algorithms and software are available for calculating the local resolution in the literature. As an example, the local resolution distribution can be obtained for example with the Monores method, which is known in the art.

[0409] Having acquired the reconstructed charge distribution and the local resolution distribution of a given Cryo-EM measurement, the next step is to assign physical information to each atom. For this the utilization of the structural model of the same Cryo-EM measurement is needed. Knowing the atomic spatial coordinates and the coordinate system of the discrete grid in which the distributions are presented, the values can be obtained by linear interpolation from the grid to each point, representing the place of the atom. With this method an atomic local resolution value (s) is determined.

[0410] In an example to show the effect and utilization of these atomic electrostatic properties, it is shown herein how to obtain fingerprint like patterns for each protein state. With these patterns local and global characteristics can be identified. The s and ICD patterns of two unique chains of a SARS-CoV-2 spike Cryo-EM measurement is presented in Figures 3 and 4 respectively .

[0411] As mentioned above and demonstrated in the figures there is a dependency between the local scale of the ICD values and the o. Although there is no big difference between the maxima of the o in the different states, but the minimum value of the ICD is significantly decreased in the active state. Because of this effect the analysis of the ICD values or any other electrostatic properties calculated from the ESP distribution cannot happen globally even in a single Cryo-EM measurement. Across several independent measurements the difference in the arbitrary determined scale is also a technical difficulty in data analysis. This embodiment makes technically available the global comparison and then demonstrate the novelty of the utilization of these patterns. Thus the mentioned technical disadvantages - the scaling problem and the local resolution dependency - should be solved beforehand.

[0412] II. 2.) Developing a new method in data analysis that makes general comparison and model building available - Correction

[0413] As mentioned before the lack of standards in Cryo-EM data presentation is a huge difficulty in analysis. In the prior art there was no intention in utilizing the electrostatic data to improve drug research. Interpreting proteins as a system of different physical parameters requires complex and nontrivial sets of mathematical transformations. The presented ICD pattern shown in the previous paragraph cannot be used because of the correlation between the ICD and o. This relationship is presented in Figure 5.

[0414] Figure 5 shows the atomic s and ICD values of an example SARS-CoV-2 spike protein measured by Cryo-EM. As the o increase (so the resolution becomes worse) the maximum value of the ICD decreases. Because of this correlation globally comparison cannot happen between different sites of the protein, in order to solve this problem, two ways of data transformation were developed. Both of them have their own advantages and disadvantages, but overall they contribute to the processing of the measured data and provide significant step forward to achieve the goal of making drug research a more efficient and predictable process. First a correction is presented that needs the calculation of the o. Additionally based on the correlation in Figure 5 we define the following dependency between s and ICD values in Equation 7,

[0415] (Equation 11) where f is a typically linear scaling factor depending on the Cryo-EM measurement, a is the experimental exponent that describes the unwanted dependency and x is any other arbitrary factor that are resolution dependent and have influence on the distribution of the ICD values. An example for this x factor can be the effect of the overlapping distribution of the ESP map. Since any x factor is the function of the resolution Equation 11 can be approximated in the following way:

[0416] (Equation 12) where f is the scaling factor b is the theoretical exponent that describes the resolution dependency for each x parameter. The equation can be rearranged in the following way:

[0417] (Equation 13)

[0418] With this rearrangement the next step becomes the calculation of the combined a’ exponent. The calculation can be done with mathematical iteration in which the goal is to minimize the correlation between ICD ’ and o. Although the dependency becomes minimal, but the transformations need to be made for each protein subunit in the case of a protein complex like the SARS-CoV-2 spike. Since the calculated a‘ values might differ in each case in a single Cryo- EM measurement and in most cases between independent measurements a scale transformation must be done. This step is also needed because of the mentioned arbitrary scale of the ESP distribution presentation. In our examples the scaling is done by the modification of the mean and the standard deviation of each dataset in a way defined in Equation 14.

[0419] (Equation 14)

[0420] The overline means the calculation of the arithmetic mean while s is the standard deviation. After the calculations the distribution of the |ICDcorrected| data is trimmed above 5. Since big and unique macromolecules - which have almost the same primary structures and similar spatial conformations - are compared, the relative values of the ICD data should also be very similar. It is important to say that in small regions of the proteins with its own and specific local structures this statement is not always valid. The relationship between the s and ICDcorrected values are presented in Figure 6.

[0421] Figure 6 illustrates the atomic s and ICD values of an example SARS-CoV-2 spike protein measured by Cryo-EM. The relationship is terminated, there is no correlation between the parameters. The cut off threshold is indicated by dashed lines.

[0422] As the figure shows the correlation is minimized after the calculations so the ICD parameter is appropriate to the pattern comparison in the chosen protein. The pattern of the ICDCOrrected parameter of two unique chains is shown in Figure 7.

[0423] Figure 7 shows the pattern of the values of two unique chains of a SARS-CoV-2 spike Cryo-EM measurement. (The spike is a homotrimer from three chains, and these chains are called unique chains by the PDB database.) The difference between the values of the same positions can be used in further analysis.

[0424] As presented the correction method converts the electrostatic patterns into generally comparable ones in two unique chains of a chosen protein. To demonstrate that the method is comprehensive one-way ANOVA tests were calculated for the alpha carbon atoms of 9 independent SARS-CoV-2 spike protein. The results of the original ICD values are presented in Table 1.

[0425] Table 1 contains the means and standard deviations of the ICD values of the a carbon atoms of unique chains of 9 independent Cryo-EM measurement. There is a very big difference between the averages because of the arbitrary scales of the original ESP distribution. The result of the one-way ANOVA test shows that the expected values significantly differ (a = 0.05). The protein subunits are in the same down state.

[0426] The table shows the technical limitation of the analysis of the ICD data. Measurements published even in the same article can have a significantly different average value, for example in the case of 7DKG and 7KDH. This shows not only the need for mathematical transformations but also indicates the fact that the idea is new and there was no effort to utilize the electrostatic data for further purposes than the evaluation of the atomic structure. After the calculation of the ICDcorrected values the ANOVA test were calculated again on the same unique chains of independent Cryo-EM measurements. The results are presented in Table 2. Table 2 shows the means and standard deviations of the ICD values of the a carbon atoms of unique chains of 9 independent Cryo-EM measurement. The difference between the averages decreased. The result of the one-way ANOVA test shows that the difference is not significant between the expected values (a = 0.05). The protein subunits are in the same down state. The standard deviations became relatively large due to the transformation.

[0427] The table shows that the average values of the unique chains are scattered in a very small range. Different Cryo-EM measurements from different teams on different microscopes can now be compared with this method.

[0428] As an example, to show one of the diagnostic values of this method, another ANOVA test was made. In this test the input data are complexed with measured structure data of the protein subunits which are in the active, up state. In that case the main difference is due to a large alteration in molecular structure, when the receptor binding domain (RED) erects and changes to bind to its receptor in the membrane of the host cell. The additional values and the new p value is presented in Table 3. Note that the contents of Table 2 is also the part of this calculation.

[0429] Table 3 shows the means and standard deviations of the ICDcorrected values of a carbon atoms on the protein subunits that are in the up state. By complexing the content of table 2 with this data the expected values became significantly different (a = 0.05).

[0430] The table shows that the change of the protein state significantly results in different expected values of the alpha carbon atoms. This means that the average characteristics of these atoms change in a way that can be diagnosed. This result is the first step to create a new way of in vitro and in silico diagnostics of the protein states.

[0431] In the examples further results are shown to prove that the change in the electrostatical pattern is more responsive for abnormalities than only structural interpretation. Before that another method is presented to create generally comparable data.

[0432] II. 3.) Developing a new method in data analysis that makes general comparison and model building available - Normalization

[0433] Calculation of the generally comparable patterns is possible with an iterative method presented in the previous paragraph. Despite the results being as good as expected there are limitations in the calculation of the local resolution. Some algorithms require raw data that are not always published in the databanks. Also, the calculations might need a thorough insight in the steps of the 3D ESP map reconstruction thus automatization is not always possible. In order to develop a simpler data transformation method, the reconsideration of Equation 11 is needed. The o values, which are typically larger than 1.5 Angstrom, mean the local resolution. This means that this parameter changes on significantly larger volumes in space than 2-3 covalent bonds. That is why within this distance the o coefficient can be considered as a constant. The rewritten form of Equation 13 can be found in Equation 15. (Equation 15) Considering that in the nearby atoms only the ICD parameters are different, a ratio can be calculated in the following way in Equation 16.

[0434] (Equation 16)

[0435] Where n is the number of atoms used for calculating the mean, a and b are the indices of the different atom types. Note that a can be equal to b. The atomic characteristics in the denominator are important, the calculation gives different results when the atom types are different. This modification theoretically satisfies the requirement of unifying the scales and minimizing the o dependency of ICD values because the fraction can be simplified locally. To prove this, the relationship between the newly calculated ICDnOrmai values and the o values is presented in Figure 8.

[0436] As the figure shows, the scale of ICDnOrmai values is different from the one of ICD This is not a problem until a direct comparison is needed between the two parameters, which comparison seems to be meaningless at the moment. It must be emphasized that the two parameters contain different information about the characteristics of the protein states so that needs to be taken into consideration in the applications. This is the aim of the normalization because not just the resolution dependent parameters get simplified but also the locally same dependencies of the values. This means a change of the trends of the reconstructed patterns. To demonstrate this, the ICDnOrmai values are presented in Figure 9 in the order of the amino acid index. As the figure shows the normalization meets the requirements and produces a generally comparable parameter. The ANOVAtest of Table 2 is calculated to the ICD values as well and the results are presented in Table 4.

[0437] Figure 9 illustrates the pattern of the ICDnormai values of two unique chains of a SARS-CoV-2 spike Cryo-EM measurement. The difference between the values of the same positions can be used in fiirther analysis. Comparing with Figure 7 the trends are changed, the contained information is a bit different.

[0438] Table 4 shows the means and standard deviations of the ICDnOrmai values of the alpha carbon atoms of unique chains of 9 independent Cryo-EM measurement. The difference between the averages is really small, however the standard deviation became even smaller. The result of the one-way ANOVAtest shows that the difference is significant between the expected values (a = 0.05). The protein subunits are in the same down state.

[0439] As Table 4 illustrates the ANOVA test shows that there is a significant difference between the expected values of each unique chain. However, analyzing the means and the standard deviations the dataset shows us that the distributions of each atomic ICDnOrmai value became much narrower than the distribution of the ICDcorrected values and even in the original ICD values. This means that this dataset has the potential to derive a more precise information on the characteristics of each protein state.

[0440] To summarize, both the correction and the normalization method give us a transformed electrostatic -type parameter that allows, for example, a protein presentation as unique, fingerprint like sets of electrostatic patterns. The patterns obtained with these transformed parameters are generally comparable and has the potential to derive additional information for the existing biomolecular knowledge. An example for this is the ANOVA test shown in Table 3 that makes a difference between two protein states. Without any of the mentioned problems solved there is no known way to develop a data analyzing method that aims to massively improve the in silico part of any existing drug research. In this the Cryo-EM is the most advanced tool that measures the native state of a protein molecule of interest; thus, its usage is preferred over in silico protein modeling which is not accurate enough. On the other hand, it is obvious that with over 20.000 3D Cryo-EM protein models in the PDB database and the lack of standards in 3D ESP distribution reconstructions there was not a single intention to utilize the electrostatic information for further purposes.

[0441] Below several calculations will be shown that aim to find regularities in the primary structure, in the folding and in protein-protein interactions but without any data transformations these calculations could not have happened because of the existing technical problems.

[0442] II. 4.1) Analyzing the autocorrelation behavior of ICDcorrected parameters of unique protein chains

[0443] The previously described methods support the hypothesis that the electrostatic representation of a protein molecule is achievable with generally comparable sets of parameters. These parameter sets can be imagined as unique, fingerprint like patterns - just like the presentation of the ICD values of alpha carbon atoms alongside the peptide chain. Although the correction allowed for the identification of significant effects in the mean values of functionally distinct states of spike proteins, no further evidence has yet been presented to confirm that these parameters carry biologically relevant information. In this context, an autocorrelation analysis of the ICDCOrrecte parameter is performed to assess whether further calculations would serve a meaningful purpose.

[0444] Generally correlation is a statistical measurement that describes the direction and the strength of a relationship between two variables. In this document the calculation of correlation is always carried out linearly with the help of Pearson-correlation coefficient. This coefficient is defined in Equation 17.

[0445] Zfc - x ) CVi - y)

[0446] VZfe - V)2(j / j - y)2

[0447] (Equation 17) Where r denotes the measure of linear correlation between variables x and y. The calculation of this coefficient is only meaningful on strictly monotonic and linear relations between two variables. With this knowledge the correction step can be further investigated. In Figure 5 the relationship of the atomic s and ICD values is monotonic, which is a crucial criteria. If the correction is carried out on a parameter with which the relation is not strictily monotonic, the absolute value of the given parameter should be investigated. Although the other criteria of linear correlation calculation is the linearity, the neglecting of this tends to introduce only result in a small bias only. With the iteration process, which minimizes the dependency from the s value, an a’ value is determined, that leads to a small - preferably smaller than |0.05| - correlation coefficient measured between the processed (modified) electrostatictype parameter (ICD’ in the presented case - this example) and the s value.

[0448] Having introduced the important features of linear correlation coefficient and its calculation, a well -known type of utilization is carried out - the autocorrelation calculation. Autocorrelation is the correlation of a signal with a delayed copy of itself. This analysis is a mathematical tool for finding repeating patterns, periodic signals. In other words autocorrelation measures the relationship between a variable’s current value and its past values. The number of delay is defined as lag. Lag 0 means that the signal is correlated with itself, thus resulting in r = 1 by definition. Lag 1 means that every value of the signal is correlated with its immediate past value. Autocorrelation is a basic tool for detecting long term relationships on signals just like our electrostatic patterns representing unique protein chains. High values represent a strong linear relationship between consecutive points. The autocorrelation analysis on the functionally active protein subunit SARS-CoV-2 spike (PDB ID : 7KDL) is calculated on the ICDcorrected values of the peptide chain atoms and the alpha carbon atoms. Note that these calculations neglect the missing or trimmed datapoint resulting in that the lags not necceserily mean only one shift on the chemically periodical peptide chain structure. Firstly, in Figure 10 the autocorrelation of the peptide chain atomic ICD values is presented.

[0449] As the plot shows even at lag 90 (which means a minimum of a polypeptide consisting of 30 amino acids) there is a significantly relevant - not 0 - correlation. Although the measure of correlation is considered to be moderate or weak the result is remarkable. Since the chemical characteristics of peptide chains does not allow long term delocalization alongside the main chain - due to the insulator effect of the sp3alpha carbon atom - this result suggest that the electrostatic effects still influence each other in a way that should result in the demonstrated pattern. In other words, the pattern is not randomly formed. The periodic peaks on the diagram also shows a chemical relevance in the electrostatic parameter signal. The peaks mean that the same types of atomic ICDcorrected values - in the case of a peptide chain these are the amino nitrogen, the alpha carbon and the carbonyl carbon - tend to have higher correlations with each other. That is also a strong evidence for the fact that the type of main chain atoms is recognizable by Cryo- EM.

[0450] This autocorrelation is also has been carried out for a subgroup (trim) of this plot, only consisting of the alpha carbon atoms only. Relying on the above result showing autocorrelation in a surprisingly long distance along the polypeptide chain, now we expected that the correlation remains statistically significant while the peaks to disappear. The plot is shown in Figure 11.

[0451] The diagram meets our expectations. The long- term relations seem to be real, and the peaks were really duedue to the chemical regularities. This analysis suggests the idea that this relations between small submolecular entities like atoms could be defined by regression models, e.g. by with multilinear models, which means thating the regularities of the formation of the electrostatic pattern could can be estimated. Also, this result and the desired positive result of the forthcoming multilinear calculations / regression could support a preconception the inventive concept that these patterns have diagnostic value - which is could be expected after the ANOVA results in Table 3. II. 4.2) Multiple linear regression model for defining the regularities of the pattern formation

[0452] As presented in the previous paragraphs the transformation of the measured ESP distribution of a protein molecule is a crucial step in data analysis. The presentation of such macromolecules as a plurality of electrostatic parameters is a novel way to acquire further knowledge on the behavior of proteins, including working mechanism, formation of primary structure, the regularities of the folding and the interactions between proteins.

[0453] Firstly, an analysis will be shown to demonstrate that the results are not random and they contain the necessary information for further analysis. For this a regression method needs to be applied to our datasets.

[0454] With the help of Multiple linear regression (MLR) method the connections between the chosen variables can be quantified. The aim is to determine the effect of the change of one variable on the other variables. Ideally this connection can be defined as stated in Equation 18:

[0455] (Equation 18)

[0456] Where Y is the target (dependent) variable, bi is the coefficient corresponding to the Xi variable, used as predictor variables in the models of the invention, bo is the constant term and e is the regression error term. The target variable is set up as the biochemical feature value, i.e. the target varible is connected to the biochemical feature or describes it. In the regression the error term needs to be minimized by minimizing the sum of square differences.

[0457] When the coefficients comprise the regression error and are considered estimations (denoted by a hat) a score value is calculated which is the estimation of the target variable.

[0458] (Equation 19)

[0459] The score is the estimation of the target variable, i.e. of the biological feature value.

[0460] The coefficients and the constant term can be calculated in a way defined in Equation 20:

[0461] (Equation 20)

[0462] In the equation, is the vector containing the parameters of the regression line, X (XTis the transposed and X’1is the inverse matrix.) is the matrix containing the predictor (independent) variables, Y is the vector containing the dependent variable(s), while T denotes transposition, and the hat denotes estimation.

[0463] The interpretation of the coefficients is that a unit increase in a given predictor (independent) variable causes a change in the estimated value (score) of the target (dependent) variable. In the scenario given herein, plausibly, the coefficient is the measure or weight of an electrostatic -type parameter, like ICD, in its effect to determine another parameter.

[0464] (The elements of a vector are e.g. the parameter values, considered as dependent variables, along a section of the protein chain or in a partial space dissected from the protein structure.)

[0465] The advantage of the method is that it is possible to define so-called Bernoulli variables based on distinguished properties. These variables are artificial discrete numerical values used to define parameters that may have an important effect on the dependent variable, despite not being numerical in nature. Also making the Bernoulli variables as a target with MLR can introduce a small bias, so for these classification tasks methods like logistic regression are preferred.

[0466] With the MLR method the aim is to define the interactions between each atomic electrostatic parameter. The calculation of these regularities between the variables indirectly defines the formation of the earlier mentioned patterns. If the datasets are not randomly distributed, the calculated correlation between the real and the estimated coefficients is significant, which means that the system is determined and the electrostatic values are really in a locally or globally extensive connection. For the calculation the earlier presented Cryo-EM measurements were used, see for example the data in Table 1. In the first presentation a so called interpolation happened, meaning the information of the neighboring atoms of the determined electrostatic parameter were utilized. In the calculations the ICDcorrected values were used. The result is shown in Figure 12.

[0467] As the figure shows, the interpolation is very powerful, the ICDcorrected parameter of the alpha carbon atom is defined by the neighboring atoms. This means that the pattern formation is not random but based on mathematical regularities. The coefficients describe the average behavior of these atoms in SARS-CoV-2 spike proteins. These coefficients can be found in Table 5. Here the neighboring atoms are denoted a 0 and 2, respectively, whereas in the indicaton of atoms the first letter (C, O or N) means the atom, the second, if present, the position in the side chain wherein capital Latin letter replace Greek letters, i.e. A stands for alpha, B stands for beta atoms (etc.) of the side chain.

[0468] Table 5 shows the determined coefficients used for the calculations resulting the plot in Figure 12. The indices represent the amino acid number of a chosen tripeptide segment from the protein. The atoms relating to the descriptor are the peptide atoms (CA, C, O and N) of the central amino acid and each neighbouring amino acid (with indeces like O CA and 2_CA). The target variable is indicated with weight 0 in the Table (1_CA).

[0469] For further analyzation purposes not just interpolation were tested with the coefficients but extrapolation as well. By the utilization of the parameters shown in Table 5 the ICD value of several consecutive alpha carbon atoms were estimated with the MLR method. The local resolution of the measurements is typically 3 Angstrom and the interpolation happens within this distance. Thus, the extrapolation test is necessary to test the method’s performance outside this distance. The estimation in this example was carried out till the 24thalpha carbon atom, to safely exclude the possibility of unambiguous, thus meaningless calculations.

[0470] In Figure 13 the change of the correlation value between the estimation and the measurement was plotted. The change of the correlation coefficient as the estimated atomic ICDcorrected parameter is getting further away from the input information. Note that only the parameters that are shown in Table 5 were utilized

[0471] As figure 13 shows there is still a significant correlation between the measurement and the estimation, even if rough extrapolations happened. This proves the fact that the formation of the pattern is not random, the curves in the previous figures are not noise and the regularities can be quantified. Thus, providing and using the descriptor-based protein model of the invention has strict biological relevance.

[0472] II. 5.) Utilization of the electrostatic patterns, defining the regularities of the pattern formation based on the type of the amino acids, defining the effect of the primary structure.

[0473] The inventors have unexpectedly recognized that the electrostatic presentation of a protein molecule is an conceptual change which describes the protein molecule from a novel perspective.

[0474] In this embodiment, the Cryo-EM measurements, used before to obtain a three dimensional map of atomic coordinates, provide information on the native electrostatic states of a protein. Information on this electrostatic parameters can be transformed or processed by the method of the invention, by creating descriptors or a set thereof as described herein, to obtain a model of the protein molecule having the potential or being useful for further biological or medical uses. This creates a new perspective in drug design and research as well.

[0475] In a preferred method it has been shown that the transformation steps can result in generally comparable parameters.

[0476] The appropriately processed electrostatic -type parameters (i.e. parameter values carrying information on physical electrostatic quantities of the protein molecule) are further used to create descriptors of the protein or a part thereof.

[0477] In a simple exemplary embodiment such parameters can be used to form unique and fingerprint-like patterns through protein chains or their arbitrary small subunits. It has been shown that even after several tens of covalent bonds, remarkable correlation can be observed in the extrapolation calculation. These patterns are not random, and their formation can be quantified. These quantified patterns have made it clear that macromolecular systems like proteins are electrostically determined. The surprising discovery that this electro static -related model of the protein provides a more sensitive detector of changes and / or differences in protein structure and function than 3D structural information in itself.

[0478] In order to improve the previously shown model an amino acid “filter” can be applied to the datasets. This technique assumes that there is a difference between the effects of the different amino acid types on the pattern formation, e.g. on the pattern based on the main (polypeptide) chain.

[0479] For these calculations the extrapolation MLR method was applied and in this example the calculations were performed on the same Cryo-EM measurements that have been used previously. The estimated variable was the parameter assigned to the carbonyl carbon atoms. The parameters used are the parameters of the peptide backbone of the two amino acids preceding the amino acid to which the carbonyl atom of the estimated variable belongs. The amino acid filter was applied to the directly preceding amino acid. The results showed the quantified connection between the ICD_corrected_2_C parameter and the other parameters. The amino acid fdter was applied to the amino acid with index 1. The accuracy of the estimations remained significant.

[0480] The main factor in determining is the value of the ICDCOrrecte parameter of the carbonyl carbon atom is from the electrostatic feature of the previous carbonyl carbon atom, here characterized as an ICDcorrecte parameter. This indicates a periodicity in pattern formation. However, this is not typical in case of Aspartic acid and Tryptophan, where the previous oxygen atom dominates as the main determining factor. This atom is also dominant in the case of Histidine and surprisingly shows an inverse behavior in the case of Tyrosine. There is also a difference in the constant terms, which provides the fit to be unbiased. What can clearly is that each amino acid has a specific mathematically quantified effect on the pattern formation on this set of proteins. There were no side chain data utilized to keep the comparability of the equations but that would make another difference.

[0481] The calculations are provided for an even larger dataset containing crysotallographic electrostatic data (i.e. electron density) and the coefficients are stablized and are the same for different splits of the filtered data. The classification report of a tree based algorithm showed an accuracy score of 0.96 if the amino acid type for the given coefficients were predicted, meaning that each amino acid has specific characteristics in the electrostatic pattern formation. The macro average recall was 0.96, the macro average precision was 0.96 and the macro average fl -score was 0.95 in the determination whether which amino acid type served as a basis for the calculations. The whole classification result can be seen in Figure 33.

[0482] To summarize: it is clear that the pattern is not randomly formed but based on a strict and strong relationship between the atoms which can last several covalent bonds. These relations are dependent from the type of the amino acids forming the peptide chain, thus the observed pattern is also dependent from the primary structure. With this information in silico models can be built for estimating the effect of point mutations which is mainly an unclear field in biochemistry. Nowadays there are no research projects in advanced stages that deal with the electrostatic change induced by point mutations in a protein molecule which substantiate the novelty of this patent.

[0483] II. 6.) Utilization of the electrostatic patterns, defining the differences between protein states based on the electrostatic patterns. Screening ligands in the new way of drug design.

[0484] Based on the aforementioned points, it is clear that a deeper, quantitative analysis of electrostatic, ICDnOrmai, pattern differences is accessible. As previously shown the presented curves are not random but based on mainly the primary structure of the given protein. However, a protein can be in different states for several reasons. The different states may be due to its normal mechanism. Alternatively, it may be that the protein has a normal and a pathological state. The states may be converted into each other e.g. due to a regulatory effect, e.g. by a regulator substrate, e.g. as an activator or an inhibitor, or a “toxic” material, etc. In such type of cases forming a difference between the electrostatic representations of the protein in the multiple states is a preferred method.

[0485] As an example, previously it was demonstrated in an ANOVA test (Table 3) that the SARS-CoV-2 spike down and up states can be differentiated by the electrostatic patterns. The ICDnOrmai values are, in the presented model, preferred for this purpose, e.g. because a more precise test is needed when the standard deviations are relatively low. In order to present that active and inactive protein states behave differently and have different patterns we analyzed several Cryo-EM measurements with the Dynamic Time Warping (DTW) method.

[0486] The DTW method is a widely used data analysis algorithm, primarily aimed at highlighting trend differences between data series. During calculations, the distance between two data series is determined not by a static, so called Euclidean method, but dynamically. The data series undergo temporal (sequence) optimization with the goal of minimizing the distance. In our case, the method helps reduce differences arising from measurement noise while emphasizing differences and transformations spanning larger protein segments.

[0487] The input parameters for the DTW method were ICDnOrmai data series calculated on the atoms of the peptide chain of unique protein chains. In case of SARS-CoV-2 spike proteins each Cryo-EM measurement should contain the information of 3 protein subunits. Each subunit of the 3 subunits can be either in active or inactive state. For the calculations we applied a relatively small window size of 5. Before the calculations the datasets were merged by the amino acid index in order to prevent larger differences resulting from poor data analysis. The calculated difference value was normalized for the input data point size in order to make direct comparison available. The results than were scaled into an interval of 0-1 and rounded to the nearest tenth for the following plot.

[0488] There are two prominent results that can be observed almost instantly. First, the trend of protein 7LYK which is the only spike protein that is in the 2 active - 1 inactive state. This difference correlates with the current biological understanding of SARS-CoV-2 spike protein states. The other trend is observed at protein 6VXX, 7KDG and 7KDK. The subunits of these proteins typically show a lower difference compared to each other, respectively, than any others. This is an artifact coming from the Cryo -EM ESP distribution reconstruction since in the elaboration of the map C3 symmetry was applied which is detectable in our analysis. Three biologically different groups could be elaborated containing the differences of the down-down, up-down and up-up states. The distribution of the groups is plotted in boxplot form in Figure 14.

[0489] Analysing the results, the medians are clearly separated from each other, even though, the whiskers of the distributions may overlap.

[0490] Note that between the inactive states there can be a difference coming from the different mutations as well. When comparing active states it can be surprising that there is an even bigger average difference between them than any other groups. This can also be the effect of different mutations and stabilizing effects. This analysis further investigated can provide information on the evolution of the virus.

[0491] In any case, with the analysis of SARS-CoV-2 spike protein states evidence has been found that the patterns arrived at by our newly developed transformation steps are not random and can be used to distinguish between the different electrostatic forms of proteins. This demonstration is supported by the existing biochemical knowledge based on the spatial differentiation between forms or in silico biochemical tests.

[0492] To provide another, very different example, we elaborated a protein state analysis on the NMDA receptor, in this example, Rattus norvegicus NMDA receptor subunits GLuNIB and GLuN2b. The role of this glutamate dependent Ca2+ion channel is very crucial since either its overactivity or underactivity can lead to severe neurological disorders. Modulation of this receptor is a key therapy and that is the reason a comparison was made between Cryo- EM measurements of native states of NMDA receptor and modulated states. The investigated change of states originates from modulator ligands that are competitive antagonists. One of them is the molecule SDZ 220-040 (SDZ) that can substitute glutamate interaction with the GLuN2 protein subunit while the other molecule is the glycine competitive antagonist L689,560 (L689) interacting with GLuNl. The molecules remained in preclinical phase because of the in vivo observed side effects. The in vitro tests showed the inhibition of the receptor as the measured Ca2+flow decreased.

[0493] We calculated DTW differences between 5 independent Cryo-EM measurements. The input parameter was ICDnomai values, where the mean for the normalization was calculated on the amino nitrogen, alpha carbon and carbonyl oxygen atoms for each amino acid. Chain A and chain B are the GLuNl subunits of NMDA receptors.

[0494] The pattern showed relatively low differences between the exactly same states are which is supported by elementary biochemical knowledge. The native inactive state of measurement 6WHR clearly differs from the native active state, activated by Glycine. This observed difference defines the physiological activation of the protein. In case of antagonist modulated measurements the electrostatic pattern differs significantly from the pattern of the active state, which means they do not have the same effect as the physiological activators. However, in case of 6WHW GLuNl subunit there was in interaction with glycine and the observed difference is because of the GLuN2 modulator SDZ. This means that SDZ has an effect on this subunit as well. 6WHW also differs from 6WHT that means the state of 6WHW is artificial and not similar either to the native inactive or active state. The change induced by L689 in case of 6WHY is even more extreme. This artifact in vitro could produce the desired inhibition, however, in vivo the electrostatic patterns on proteins are important.

[0495] With this way of data analysis ligand screening can happen by measuring in vitro protein states with Cryo-EM and with the help of the developed in silico data transform ligands can be selected for lurther in vivo test according to a factor that quantifies the similiraty of the artificial electrostatic state and the physiological electrostatic state. This screening method would be beneficial for a perspectively new drug design since in vivo experiments are significantly more expensive, thus tests on a given ligand molecule should only be initiated with the highest reliability to achieve economic optimum.

[0496] II. 7.) Utilization of the electrostatic patterns, defining the termination of alpha helix structures.

[0497] Several properties of the mechanism of protein folding are still unknown. The state of misfolding can be discovered in most disorders however there are no known algorithms that can predict such abnormalities with the help of in vitro experiments. Misfolded structures cannot be predicted yet in silico because of the absence of knowledge why not-native structure appears. In silico physiological structure can be estimated with Al based protein structure prediction algorithms, however the dependencies between the formed structure and the amino acid sequence are complex thus there is no intervention point in protein structure design yet. In this paragraph a model is demonstrated that quantifies the physics properties of a protein molecule in a way that the termination of alpha helix structures becomes predictable thus a design for protein structure editing can be established.

[0498] As mentioned before Bernoulli variables can be defined throughout the whole protein. These are pseudo quantitative descriptions of qualitative characteristics.

[0499] If two versions of the qualitative criterion are possible, or if it can be transformed into an alternative, then it can be made numerical by setting one occurrence equal to the value 0 and the other occurrence equal to the value 1. 0 if the condition is not met 1 if the condition is met. This is particularly useful in characterising systems with multiple states, like two states in the case of biological macromolecules like proteins. In this case the Bernoulli variable is the target variable, i.e. the biochemical feature value.

[0500] In general, it can be stated that by using Bernoulli variables, known, discovered and demonstrated, but not exactly quantifiable effects can be included in the given regression model. If the values of the dummy variable have been defined, then we define the regression model in the usual way. Such dummy variables can be e.g. the up and down state of a protein the inhibited or active state of a protein, or the folded-unfolded state. With the help of MLR method the relations between the ICDcorrected values and these Bernoulli variables can be calculated. The interpretation of these results can end up in aimed drug research.

[0501] For building our models the definition of helix termination has been investigated. This parameter will be called terminus later on. For this the only available source is the structural models of the proteins determined with Cryo- EM. The definition is simple: if the amino acid is part of an alpha helix structure, the terminus parameter is zero; if the amino acid is the last amino acid of an alpha helix structure then this parameter is one. As presented before, there are quite big differences of the effects of each atom with respect to the amino acid type that is why the datasets are filter for amino acid type in each calculation. Also, the datasets are filtered for amino acids that are part of helix structures, so random coils or beta sheets are omitted from calculation. First, the equation of terminus determination of Alanine is presented in Table 6, and the separation of the estimated target variable, called score value, is presented in Figure 12.

[0502] Table 6 shows the determined model for helix termination prediction for Alanine amino acids in 9 independent Cryo-EM measurements. Note that Beta carbon atoms of the neighboring amino acids (0_CB and 2_CB) are also part of the calculation, so Glycine neighbors are omitted. The descriptor also comprises the C, O and N atoms of the peptide bonds of the central amino acid and of the neighboring amino acids as well as the apha carbon of each amino acids (and the ICDcorrected values assigned thereto.

[0503] As the plot shows a model can be built to precisely predict the terminus parameter in case of alanine. The impact of this can be a huge advancement in drug design because several disorders exist that is due to an abnormal helix termination. The development of these models depends just on the size of the input datasets and the applied statistical methods that can be various. The modification of the electrostatic parameters can be estimated with ab initio methods thus the models can be utilized for in silico ligand screening and design.

[0504] As the results of alanines are impressive the analysis of the remaining amino acids is needed. The most interesting results are from asparagine. This separation box plot is alsoshown in Figure 15.

[0505] As the plot clearly shows the separation is still noticeable, however, is not as perfect as in the case of alanine amino acids. The reason is investigated and might be the effect of the error in structure determination rather than the error of the determined score value. In most cases when the calculated score value is high for a non-terminating amino acid the hydrogen bonds have already dissociated, despite the apparent helical structure remaining intact (e.g. 7v7n chain C, 822N). In other cases, a sudden curve can be seen on the helical structure, probably two independent helices coming after each other (e.g. 7v7v chain A, 910-939). On the other hand when the score value is low but the structure shows helix termination the hydrogen bonds still exist but the structure is deformed and the pair of torsion angles are outside of the region typical to alpha helices. Altogether this method can be used for a more precise structure determination especially when the torsion angles are on the margins of the characteristics of a given secondary structure.

[0506] Since the separation seems to be a possibility with the MLR method it is necessary to investigate what is the main difference between the different predictor variables. For this the data should be represented in a way we called contribution plots. The contribution plots show the parts of the final estimation value depending on the grouping of the input parameters - in this case the ICDcorrected values. The part of the estimation curve that originates from the ICDcorrected of the peptide chain of the protein is shown in Figure 16.

[0507] Note that there is a meaning average (window size = 10) applied on these curves to reduce the noise. As the plot shows there is a drastic change in the trends of the peptide chain data. In group 1 it is definite that the following amino acid data has a withdrawing effect on the final estimation value because of the negative sign. It seems to be correct with the biochemical interpretation, that the following amino acid is still in the helical structure. However, when the helix terminates this value - alongside with the ASN main chain value - significantly increases. This electrostatical change in the molecule can be used as an indication for the termination of the helix. In the uncertainty zone the values are like the terminating group. In our investigation examples from this group show a clear uncertainty in the structural representation as well, meaning that the electrostatical information should be utilized more in order to have a better atomic model. An important feature of the plot is that the final value is not only dependent from the values of the following amino acid, but the values of the ASN as well. This means that the terminating trends are really formatting on the last piece of the helix, giving a signal or reaching a threshold value when it should end. This information holds the potential of in silico molecule design that focuses on the local regulation of this electrostatical characteristics. However it is only the monotonous peptide chain of the protein whose electrostatic properties depend on the side chain phase as well. Since the following and previous amino acids are not filtered their side chains are represented with only the CB atoms. As it was previously mentioned the neighbouring GLY amino acids are omitted from the calculations and this is the main reason behind our choice. The side chain contributions are shown in Figure 17.

[0508] Note that there is a meaning average (window size = 10) applied on these curves to reduce the noise. Similarly to the peptide chain, the side chain atoms also show a significant difference between group 0 and 1. The CB atom of the following amino acid shows only slight decrease, which does not seem to be a determining factor in the estimation of the terminus. However the previous CB atom and the sum of the ASN side chain atoms tend to have an anticorrelation. This connection can be used to identify the defining moment in the helix termination. Since the peptide chain data can be heavily modified by the side chain data this is an experience that suggest the idea that the helix termination is regulated an amino acid before the real termination point. Utilizing this information we can make a hypothesis that one amino acid is not enough to terminate a certain regular structure but the connection of several amino acids are needed. Furthermore, if the rule of the regulation can be detected that means target points may be identified on which a outer regulation can happen to modify the structure of a protein in a planned manner. This can be used in various misfolding diseases when the protein misfunctions because of irregular alpha helix formation.

[0509] The result of calculations are also provided on an even bigger dataset where a tree based classification algorithm was utilized in order to predict the termination of helices on independent test sets.

[0510] II. 8.) Utilization of the electrostatic patterns, estimating the effect of side chain interactions.

[0511] As previously presented the proteins can be represented as a set of patterns of parameters proportional to physical electrostatic features with given mathematical transformations of a Cryo-EM measurement. It has been shown that these patterns are not noise since there is a significant autocorrelation on the data. This also means that the patterns have the essential information on how the complex electrostatic state of a protein molecule is formed. Utilizing this information, it can be established that the patterns are unique to a native protein state and can be used as diagnostic features and measure relative differences between states. Further analyzing the patterns difference can be detected between the effect on the pattern formation of the different protein building amino acids. Surprisingly, even a single mutation could result in a drastic change in the electrostatics of a protein. This shows the unexpected sensitivity of the electrostatic model of the protein to even minor changes, whereas the traditional structure-type models are usually more conservative and tolerate even bigger changes. This proves that primer structure of a protein molecule must be very strictly determined. As for the secondary structure formation the previous paragraphs showed examples that the termination of the Alpha helix structures is determined by the electrostatic patterns. Thus, the structure is ordered under the effect of the electrostatic features.

[0512] In this example it will be shown that our method is not only able to detect the characteristics of local structure formation but also to determine the existence of tertiary structures. The tertiary structure of a protein largely depends on the stabilizing interactions between the side chains of the amino acids. These interactions can be salt bridges, covalent bonds in form of disulfide bridges, hydrogen bonds or other secondary bonds.

[0513] In this section, as an example, the effect of 7t - 7t interactions will be shown. The 7t - 7t interactions are significant interactions between delocalized electron clouds that are situated in a given orientation often perpendicularly. Furthermore, several of these interactions can occur in the same place called n stacking.

[0514] In order to calculate these stacking interactions we need to define a variable that contains the structural information of the stacking. In our calculations this variable is called deloc neigh _1 _CA and defined on Histidine amino acids. The ‘position’ delocalized electron cloud is defined by the weighted average of the coordinates of the Nitrogen atoms of the imidazole group and the Gamma carbon atom of the amino acid. Then in a calculation the number of neighboring delocalized electron clouds is defined within a given cut off distance. Besides the most commonly used amino acids - HIS, PHE, TYR, TRP - we also considered the guanidine group of ARG that is capable of forming such K stacking structures which is a chemically justifiable claim. If there are more than one neighbor - meaning a stack of minimum 3 amino acid is highly probable - within a defined 6.45 A distance, then the deloc neigh _1 _CA variable will be 1 otherwise 0. In order to reduce the structural differences, we filtered our datasets to HIS only appearing in Beta sheet form. Also, in order to have a larger amount of input data we increased the used dataset to 29 independent Cryo-EM structures containing SARS-CoV-2 spikes, NMDA receptors, TRPV proteins, CFTR proteins and many more. We estimated the defined variable with the help of the atomic ICDcorrected values similarly to the helix termination. The linear equation - representing the weights of each atomic ICDCOrrected value - can be seen in Table 7. Table 7 shows the determined model for tertiary structure prediction for Histidine amino acids in 29 independent Cryo-EM measurements. Note that Beta carbon atoms og the Histidine (1_CB) and of the neighboring amino acids (0_CB and 2_CB) are also part of the calculation, so Glycine neighbors are omitted. CD_2, CE1, CG, ND1 and NE2 mean the delta, epsilon and gamma carbons as well as the delta and epsolon nitrogens of the histidine, respectively.

[0515] Utilizing the presented equation an estimation can be made for the given variable. Grouping the estimated variable with the assistance of the original variable afterwards a boxplot can be drawn presenting the existing separation. Note that this separation on the tertiary structure is dependent on the ICDcorrected patterns only, providing evidence that there is a real difference detected on the electrostatic states on the different tertiary structures. This also can indicate that the abnormal - not native - structural formations can disrupt the fine-tuned and sensitive electrostatic system of a protein molecule in interest. The boxplot is shown in Figure 18.

[0516] As it can clearly be seen in the figure the ICDcorrected patterns contains the information on the tertiary structure formation. The overlap is minimal meaning that the input dataset shows a unique electrostatical characteristics that is dependent from the 7t stacking. For further analysis the estimation needs to be differentiated into contribution plots to investigate the change in the main chains and the side chains in the considered amino acids. Firstly, the contribution plot of the main chain is shown in Figure 19.

[0517] As the plot shows there is a drastic difference between the two groups. First of all the given HIS amino acids are definitely show a significant increase on the effect of tertiary structure. Analyzing this fact and the previously shown results, the change of the effect of the preceding and following amino acid main chain is also logical. Since the pattern is different because of the existing tertiary structure this influences the whole protein electrostatic pattern formation as well meaning that the whole protein state will be delectably different whether the given tertiary structure is formed or not. In other words inspecting the fundamentals of the formation of a native electrostatic pattern that can be used as a diagnostatic value could end up in sensitive molecular target definition in silico. Cystic fibrosis is a well-known human disorder is that is the result of the misfolding of the CFTR chloride channel protein. In most cases the disease is caused by a PHE deletion. This PHE is part of a 7t stacking tertiary structure involving an ARG and a TYR as well. Without this amino acid the tertiary structure cannot be formed ending up in the degradation of the protein by biochemical control processes. By our method it will be possible to detect these crucial points easily and a new way of drug design would aim to restore the native electrostatic state by pre-screened ligands or planned gene modifications. To demonstrate that the change of the preceding and following amino acid main chain is not the effect of their variable side chain the contribution plot of the side chains is shown in Figure 20.

[0518] As the plot shows and the weights of the result of the regression suggest the effect of the neighboring amino acid side chains are minimal, and they can be considered as constant terms since only a slight decrease is noticeable. On the other hand, the change in the HIS side chains is huge individually significant enough to produce a good estimation on the tertiary structure. This corresponds to the structural knowledge that the 7t stacking interaction happens on the side chain. In our way of understanding the results this deviance on the side chain ICD values should influence the values of the main chain atoms ending up in a domino effect on the whole main chain pattern. The reason for this is that the pattern of the whole protein contains long term relations that could not be neglected. This example provides a further evidence that the pattern formation means a new concept in drug design to find sensitive targets and screen protein states relative to a native electrostatic state. The calculations were also carried out on a bigger dataset where a tree based classification model was applied and predicted the possible existence K stacking interactions on independent test sets, significantly. II. 9.) Utilization of the electrostatic patterns, defining the differences between codons, the effect of silent mutations.

[0519] As it has already been presented previously the electrostatic pattern definition on a protein molecule on the molecular, atomic or subatomic level is a novel way to represent the biochemical characteristics of such macromolecule. The produced generally comparable sets of physical parameters are applicable to determine diagnostic differences between protein states and to quantify the effect of primary, secondary tertiary and quaternary structure formation. In this section, we demonstrate that these patterns enable the detection of significant differences in the electrostatic properties of the same amino acids, even when they originate from different coding triplets in DNA and RNA. Research in this field is still in its rudimentary stages, though biological evidence suggests that there may be real differences between proteins with the same primary structure but different genetic origins. For instance, the speed of translation depends on mRNA structure and intracellular tRNA concentrations, and protein folding can be influenced by the rate of translation. Furthermore, some genetic diseases, such as a rare form of cystic fibrosis, may arise from silent mutations.

[0520] In nature the genetic code is degenerative, meaning multiple codons can encode the same chemical properties, the amino acids. In order to analyze the electrostatic patterns and detect any significant effect we need to complete our datasets with the exact genetic information. Although this seems to be a simple task there are uncertainties between the databanks that should be taken into consideration when presenting the results. Since the protein databanks do not contain any information about the genetic code there could be some mismatches between the amino acid and exact codon pairs that cannot be fdtered out without knowing the exact gene that was used in the protein synthesis, but the ratio of the mismatches is considered pretty low.

[0521] The coding triplet type can be quantified by a score value, we named in our calculations triplet _point. This score value is used as a Bernoulli variable for building MLR models in ways shown before. However, as we investigated the effect of the coding codon on the electrostatic state is a fine tuning and not as localized as the parameters already presented. That is the reason we need to gather ICDcomcted atomic values from the global environment of an amino acid. However, increasing the number of parameters dramatically could end up in finding false patterns that are not related to the triplet state itself but something else, meaning the produced fit might not include meaningful information. Therefore, we may include additional testing phases before the evaluation.

[0522] Firstly, define the phase of parameter grouping. Every chiral amino acid has a side chain and a main chain. The side chain is unique to a certain amino acid and can be generally grouped - we name them Side_chain_n where n is a defining number for a certain type of atom. Every amino acid has the same main chain atoms (CA, C, N, O), which can be considered as another group alongside the main chain - we name them Main chain n where n is a defining number for a certain type of atom of a certain neighbour alongside the main chain.

[0523] Alongside the main chain we also can gather side chain atoms, however in this case the atom types are very likely to differ for each environment - we name them Aspecific side chain _n where n is a number defining the atoms in order of a nominal distance from the given amino acid (meaning a numbering of the neighbors alongside the side chain both in N-terminal and C-terminal ways).

[0524] Besides these atomic groups a 4thgroup of atoms can be defined depending on the exact spatial distance from the central of the given amino acid - this group is named Spatial n where n is a number defining the atoms in order of the exact distance from the CA atom of the given amino acid.

[0525] To clarify this gathering an example is shown on a Serine amino acid. The side chain of the SER is made of 2 non-hydrogen atoms, the beta carbon atom and the alcoholic oxygen atom. Their ICD corrected parameter is ordered to Side chain O and Side chain l. Alongside the side chain we define 10-10 neighbors in directions N-terminal and C- terminal, ending up into a small protein chain containing 21 amino acids, the neighbors and the SER itself. From N- terminal to C-terminal we start numbering the main chain atoms - in alphabetical order - resulting in a set of parameters from Main chain _0 to Main chain 83 where each parameter represents the ICDcorrected value of the given atom.

[0526] For the side chain data alongside the peptide chain exactly 25 atoms are gathered in the previously mentioned nominal distance and in alphabetical order within a given nominal distance and ending up in a set of parameters from Aspecific side chain to Aspecific side chain 24. Lastly the Spatial parameters are defined with the calculation of the distance from every atom and the alpha carbon atom of the given SER. In the ascending order of the distances a the value is ordered to the group of parameters from Spatial O to Spatial_149.

[0527] After defining this pattern from the global environment of the given amino acid, one of the parameters that are the same are dropped from the pattern. This is a crucial step to maintain the non-singularity of the calculated X matrix, that contains these variables for each amino acid. Since the presented way of gathering of the parameters is a simple and general method it is likely to end up with the same parameter named in two different ways, so their drop is mandatory. After this the pattern is finally formed in a way that contain the necessary information between triplets. Such pattern for a given SER amino acid in a SARS-CoV-2 spike protein is shown in Figure 21.

[0528] The Figure shows a triplet specific pattern of an amino acid. Alongside with hundreds of these patterns model building is available with the MLR method. In our examples we built the models on 34 independent Cryo-EM records - called learning dataset. While building the data needs to be filtered to a given amino acid and to a given pair of triplets. It is also needs to be taken into consideration to put as little bias into the regression step as possible, thus in our calculations the input pair of triplets contains exactly the same amount of amino acid neighborhood representation.

[0529] In this step the triplet _point variable as a biochemical feature value can obtain a binary value 0 or 1 depending on which is the coding codon for the given amino acid from the pair of triplets. After the fit the model is likely to contain information that is unique to the input proteins thus a testing is needed on an independent set of Cryo-EM records. However, from the defined set of parameters the significant ones are not known, so before the testing step an iterative model optimalization is needed.

[0530] In our calculation this happens on a set of protein originated from 7 Cryo-EM records (optimalization dataset) that are very distinguishable from the 34 records used for the initial fit. In the iterations an optimalization happens when we test the triplet separation calculated by our model on the optimalization dataset until we reach a certain condition in the separation. The optimalization consists of dropping variables from the model, calculate a new fit on the remaining variables and if that new fit performs better on the optimalization dataset then the variable is finally omitted from the calculation. These steps go on until the condition is reached, which is in our calculations reaching significant difference between the expected value of the distribution of the estimation of the two triplet _point parameter. We calculate this with Welch-t-test statistical method.

[0531] Finally, when this condition is met, we further test the model on a 3rdgroup of independent Cryo-EM records, and if the condition is met on these electrostatic patterns as well we consider our model to be predictive on the coding codon by utilizing the information of the electrostatic patterns. An example is shown in Figure 22 for a model aiming to distinguish between the codons TCC and TCT, both coding SER amino acid.

[0532] With the help of the weights the difference between the two coding triplets can be calculated on an independent, testing dataset, containing 4 independent Cryo-EM records. The exact PDB IDs of the Cryo-EM records and their grouping is presented in Table 8.

[0533] Table 8 showhs the used Cryo-EM records for the model building, and testing identified by PDB IDs.

[0534] The distributions of the estimation on the TCC or TCT codon state for the 3 group of records shown as boxplots in Figure 23 with the p value related to the Welch-T-test on the distributions.

[0535] Analyzing the result there is a way to make differences between the electrostatic state of the amino acids originated from different codons. Since the model performs well on the independent testing dataset as well, predictions can be made for the electrostatic properties of a global environment of an amino acid. Such recognitions can be used in molecular dynamics modelling or in synthetic antibody design. With the help of our invention rare diseases caused by synonym mutations can be taken under electrostatical analyzations and the pathological differences can be established in reference to a native protein. Also in vitro experiments could be carried out to detect maximal conversion rates in industrial biochemical synthesizes if synonym mutation happens on the active site of a protein.

[0536] In summary, the previous paragraphs and our methodology with Cryo-EM records it was presented that a generally comparable set of physical parameters can represent a protein molecule of interest. These parameters can be ordered to be utilized as patterns defining electrostatical states. A summary of the uses for various biochemical features is given in the Detailed description of the invention. Specifically about codon usages, we established models predicting significant differences between global amino acid electrostatic state depending on the coding triplet the given amino acid is originated from. This means nothing more than the encoded chemical diversity an amino acid may seem to be degenerative but physically these are different types of influences of the electrostatic state of a protein, meaning a silent mutation can be considered as a fine impulse on the effectiveness of a protein.

[0537] 12 further results are shown in Table 9 for predictive triplet separation on the 3 groups of records with the help of p values obtained from Welch-T-test.

[0538] Table 9 shows the results of predictive model build on the triplet separation. The triplet pairs column defines the separated codons. The Input dataset columns define the number of utilized data in the calculations (note that the learning dataset in this extract contains the amino acids that were omitted from the calculations in order to guarantee the equal amount of amino acids representing each codon, thus it can be odd number). The difference of the mean is significant on a significance level provided by the p value of the Welch-T-test. In this extract alpha = 0.1 was taken as a standard.

[0539] In another example with an even more diverse dataset we proved that for each amino acid there can be a significant codon classification developed with XGBoost model ensembles. In order to further improve the prediction correction we developed models on the structure dataset as well and it was presented that combining these structure data with the electrostatic descriptors the accuracies improve, meaning the electrostatic data has additional information that cannot be seen in the structure. Cryo-EM data outperformed XRD data on the amino acids that are coded by two different codons. The results can be seen in Figure 24.

[0540] II. 10.) Estimation of the secondary structures based on electrostatic predictors As it has been presented a biochemical macromolecule - in this example specifically a protein - can be represented by electrostatic parameters that are generally comparable and have diagnostic value. Utilizing this representation as sets of parameters, mechanistic machine learning models can be calculated using for instance the MLR method. These models in this paragraph intend to estimate Bernoulli variables that are quantified forms of biological qualitative features. Such feature can be the secondary structural state of an amino acid. This is a very similar approach to the terminus estimation.

[0541] The secondary structures of proteins are based on the hydrogen bond characteristics of the main chain. The two largest groups of these structures are alpha helices and beta sheets. In this presented paragraph every other structural form will be grouped as random coil structures. In data manipulation and model building a pairwise method is used, meaning the result will be three different multilinear models per amino acid estimating the secondary structure to be Helix / Sheet or Helix / Random coil or Sheet / Random coil.

[0542] Also it is possible to build a classification mechanistic mechina learning model with more than two target classes, thus classifying the secundary structure state to be Helix / Sheet / Random coil. In the demonstrations the parameter will be used, and the descriptor will consist of a defined central amino acid and its preceding and following amino acids’ peptide chain atoms. In order to present the natural error of the utilized algorithm the very same descriptors will be used in an independent model building in which the target is randomized / shuffled Bernoulli variables. This is a way to present the average effect size and its standard deviation for a given set of multiple descriptors.

[0543] In the first model, we investigated the estimation of the secondary structure to be Helix or Sheet on given peptide segments.. The descriptors were filtered to a set that contains equal amount of Helix and Sheet secondary structural forms of the central amino acid valine. The data were gathered from more than 30 independent Cryo-EM experimental ESP map and structural model, consisting of various types of proteins. The choice for the presented example is the number of individual descriptors in the set, which is exactly 1952 valine amino acids and their direct peptide neighbors. The calculation of the randomized models were carried out on the same descriptor set, with the exactly same ICDcorrected parameters but with a shuffled target score value. In total 200 randomized models were calculated, one of which is randomly chosen in the result presentation.

[0544] The result shows the separation is better on the real value and with the interval of the randomized average Cohen’s D value taken into consideration it is a direct evidence that the ICDcorrected parameter contain information about the secondary structural form. The results of all the amino acids are presented together in Figure 25.

[0545] As the plot shows the estimation of the secondary structural form to be Helix or Sheet was successful. On every amino acid the built models on the biological target score perform better than on its shuffled version. This means that there is a real connection between the descriptors and the biological feature which are in this case the ICD values of small peptide chain segment and the secondary structure - furthermore the Ramachadran angles. This may suggest that our invention can be utilized in structural model building and secondary structure determination. Also, analyzing the calculated multilinear models may result in further understanding of protein molecules and their electrostatic characteristics.

[0546] Similarly to the first model, we investigated the estimation of the secondary structure to be Helix or Random coil on given peptide segments. The descriptors were filtered to a set that contains equal amount of Helix and Random coil secondary structural forms of the central amino acid glycine. The data were gathered from more than 30 independent Cryo-EM experimental ESP map and structural model, consisting of various types of proteins. The choice for the presented example is the number of individual descriptors in the set, which is exactly 2224 glycine amino acids and their direct peptide neighbors. The calculation of the randomized models were carried out on the same descriptor set, with the exactly same ICD parameters but with a shuffled target score value. In total 200 randomized models were calculated, one of which is randomly chosen in the result presentation. The results shows the separation is better on the real value and with the interval of the randomized average Cohen’s D value. The results of all the amino acids are presented together in Figure 26. Similarly to the previous models the consequence is the same, that the ICD based descriptors really carry information about the structural forms.

[0547] Lastly, we investigated the estimation of the secondary structure to be Sheet or Random coil on given peptide segments. The descriptors were filtered to a set that contains equal amount of Sheet and Random coil secondary structural forms of the central amino acid leucine. The data were gathered from more than 30 independent Cryo-EM experimental ESP map and structural model, consisting of various types of proteins. The choice for the presented example is the number of individual descriptors in the set, which is exactly 1686 leucine amino acids and their direct peptide neighbors. The calculation of the randomized models were carried out on the same descriptor set, with the exactly same ICDcorrected parameters but with a shuffled target score value. In total 200 randomized models were calculated, one of which is randomly chosen in the result presentation.

[0548] The result shows the separation is better on the real value and with the interval of the randomized average Cohen’s D value. The results of all the amino acids are presented together in Figure 27.

[0549] Similarly to the previous models the consequence is the same, that the ICDcorrected based descriptors really carry information about the structural forms.

[0550] To sum up Cryo-EM experimental ESP map and atomic structure model originated descriptors can be utilized to determine quantified version of qualitative protein features such as Ramachandran angle intervals. This presentation is also an applicable method in structure model determination but also in Ramachandran plot analysis. The electrostatic representation of a protein is not only a diagnostic descriptor for the protein state but also diagnostic to its secondary structure. Viewing the results of these determination we can clearly say that the presented examples in helix termination - regardless the smaller number of utilized descriptors - is meaningful and valuable.

[0551] The results of a multiclass classification model based on a bigger dataset are also presented. The set up was similar to the beforementioned methods but we utilizied a tree based classification method. The results are significant. III. 1.) Utilizing XRD electron density values

[0552] Data preparation

[0553] The electrostatic representation of a protein by descriptors that has the potential to describe fine biological function is not only available by the experimental Cryo-EM method but also with experimental X-ray diffraction (XRD) techniques as well. As opposed to the Cryo-EM data, XRD methods result in experimental electron density maps (EDM), which contain no information about the charge of the nucleus but the electrons only. There is another big difference between the two methods as well. The Cryo-EM technique works in a physiological solution flash- frozen by a cryogen, thus maintaining the native structure. In case of crystallography a regular crystal grid needs to be formed, in which the cells are formed by protein molecules. Although the structure is considered to be native in XRD, there could be some deviances and errors that result in a misfolded protein meaning that the results of further structural analysis should be closely monitored.

[0554] The determination of the electron density distribution of the protein crystal is the key step in structure reconstruction in XRD measurements. Electron density and electrostatic potential are non-linearly proportional parameters, except from the fact that Cryo-EM measurements also contain the effect of the positive nuclei. Since the effect of the nuclei can be considered as a neutral constant background bias and can be disregarded. The electron density as a parameter can be ordered to each unique atom from the different distributions coming from the experimental measurements similarly to the previously presented methods. The electron density data are arranged on a discrete grid and each value belongs to a given voxel in the grid assigned to said electron density value. In most of the illustrative examples shown herein initially we used 125 XRD measurements all having cube-shaped voxels; this was for sake of convenience only. In some examples we used a bigger dataset of thousands of XRD measurement.

[0555] In online databases the EDM of most of the recorded proteins in PDB database is available. The XRD method will result in an observed EDM, called FO, and during the structure building process several fitting methods are used. In most cases there is a calculated EDM - called FC - which is considered to represent the electron density of the predicted preliminary fit. These are in silico electron density data obtained from fitting.

[0556] An (Jelsch et al., 2023) produces an EDM for the average unit cell of the protein crystal. The asymmetric unit is the smallest portion of a crystal that provides a complete solution. In our novel way, when duplicated and moved by crystal symmetry operations, the unit cell of the crystal can be generated, and from the unit cell, large portions of the crystal can be built. The amino acid sequence of the crystallized polymer(s) is known in advance. The crystallographer fits the atoms of the known molecules into the EDM, and refines the model and map. As with the Cryo-EM measurements, the resolution in case of a crystal is limited by the level of order or disorder in the crystal. In the PDB a model of the asymmetric unit of the crystal is deposited, along with the experimental diffraction data (intensities of the X-ray reflection spots, or "structure factor amplitudes"). From these combined, the EDM can be reconstructed.

[0557] The electron density values fill the 3 -dimensional space of the map. These values are fit to a discrete grid, and each value is assigned to a voxel of given size, shape and coordinates. Here the center of the grid is considered as the (0; 0; 0) coordinate.

[0558] The standard deviation of all density values, t , is used to describe the strength of features in a map. For example, a "3 o peak" is a feature that has a density value higher than three o over the average value, which typically is set to zero.

[0559] The "all features" map, 2FO-FC, normally considered as the EDS map, is the best way to calculate an estimate of the true electron density from diffraction data and atomic model. (It is called 2FO-FC because the calculation involves combining the observed diffraction data, FO, with the diffraction data calculated from the atomic model, FC, in a way that gives the least-biased result).

[0560] FO-FC is the "EDS difference map". It shows where the experimental density and the atomic model disagree. If the atomic model fitted the experimental density perfectly (and there was no experimental noise), the difference map would have no densities. Typically, it has negative densities (atoms in the model where there is no electron density, isomesh conventionally colored red) and positive densities (electron density where there is no atom in the model, isomesh typically colored blue or green) (Lamb et al., 2015).

[0561] With the help of the two distributions 2FO-FC and FO-FC maps are produced for the further steps of structure building, and these are the distributions that are published in databases. Since our invention contains a step on the alignment of the electrostatical data to an existing structural model the available maps are transformed back to the original observed EDM with a simple subtraction defined in the following equation.

[0562] (2FO - FC) - (FO - FC) = 2FO - FO - FC + FC = FO

[0563] (Equation 21)

[0564] The FO values then could be ordered to the structural information, preferably to atomic positions, with 3D linear interpolation just like with the Cryo-EM data. Since the chemical additives used in the crystallization process can have a great variety and the whole EDM scale is dependent from the average electrostatical properties of these compounds, the atomic FO data should be scaled for each dataset with the following equation (similarly to Cryo-EM scaling):

[0565] FO - FO FO' = -

[0566] SFO (Equation 22) rn

[0567] Here FO ’ is the relative electron density value, FO represents the individual FO values,r vis the average of the FO values, SFo is the standard deviation. The background is set to zero in case of XRD calculations. After the scaling the dataset is trimmed above and below 5 and -5.

[0568] The values are relative values and we consider the relative electron density relative to the average electron density of the protein, in terms of the multiplicity (fold) of the standard deviation value.

[0569] This type of scaling is analogous to what is used with Cryo-EM measurements. These steps can be done with the 2Fo-Fc values as well, indicated as xrd density in specific examples.

[0570] Another similarity to the Cryo-EM method is that due to local errors, thermal movements and blurring effects, the amplitudes of the FO values show a connection with a resolution factor which is called B -factor in crystallography. The dependency of electron density from the B -factors is analogous to the dependency of ICD from the resolution (o or s in Cryo-EM). The connection has an effect on the global comparability although it is considered to be significantly less than the effect of local resolution in Cryo-EM.

[0571] This effect can be minimized with the presented correction method, or the data can be fdtered for subgroups in which the dependency is undetectable. In our experiments and calculations the effect is considered to be negligible, however, to be consequent we used the atomic values with low B-factor level (lower than 25) only. Thereby the FO density values with B-factors higher than 25 are discarded. The dependency of the FO density values from the associated B-factor values is presented in Figure 28 using a sample of 125 experimental EDM and PDB structure. The utilized subgroup is also presented in Figure 28. This is analogous to the step at Cryo-EM when we consider values with best resolution only.

[0572] Autocorrelation analysis

[0573] Analyzing the atomic fo density values alongside the peptide chain of proteins - similarly to the Cryo-EM values - autocorrelation can be detected. This effect means that the formation of the definitive, fingerprint like patterns are not random but are influenced throughout longer distances on the molecule. Since XRD is known to have atomic resolution and electron density values which are clearly unique to each type of atom - meaning e.g. these values are significantly different for an alpha and a carbonyl carbon atom - this result affirms the hypothesis made up with Cryo- EM data that the electrostatic patterns of a protein molecule are not random but have an information content and follow a strict and quantitative rule. In Figure 29 the autocorrelation of CA atomic fo density values is shown for an alcohol dehydrogenase protein (PDB : 1ZK0, resolution: 1.55 A, Structure of Lactobacillus brevis R-specific alcohol dehydrogenase (mutant G37D) from Lactobacillus brevis in complex with phenylethanol and NADH). It is to be noted that this protein is merely an example and this feature is detectable in case any protein type with a defined 3D structure. Methods for the calculation is carried out the same way as for the ICD value in case of Cryo-EM measurement given above at the chapter describing Cryo-EM data processing. On the plot autocorrelation on the carbonyl carbon atoms is indicated as dependent on the Lag values.

[0574] Analyzing the goodness of interpolation and extrapolation in fo density estimations.

[0575] Since the autocorrelation analysis shows a good and expected result the calculation of multiple linear regression models for fo density estimation can be taken place. As in the Cryo-EM analysis the investigations are concentrated both on non-filtered and filtered amino acid datasets. In the experiments below both interpolation and extrapolation methods are used.

[0576] In the models a linear regression model was used to obtain exemplary descriptors for a given partial space within the protein molecule. In an example, the target variables, which were calculated in the experiments and which are to be defined with the data of the descriptor, are typically within the descriptor (and thus the partial space or “box”) when interpolation is applied and out of the descriptor (and the partial space or “box”) when an extrapolation experiment is carried out.

[0577] The “partial space” or “(imaginary) box” may also be a subset of data useful to obtain a descriptor, like a network or “mesh” of data, or a set of partial spaces, e.g. non-significant or not important or superfluous parameters may be discarded, however, the data of the set should be assumed to be inter-related or have effect to others. In general the subset of data used to obtain a descriptor may relate to a “crop” or “map crop”.

[0578] As an example, firstly, the estimation of the carbonyl carbon atomic fo density value is presented without any filters on the amino acid types. The descriptor made up from the peptide chain of (i.e. formed by) two consecutive amino acids. In interpolation the target carbonyl carbon atom is inside the imaginary box - meaning that the exact experimental fo density values of its bond partners are known. During extrapolation the target is outside of this box, the goodness of estimation is considered to be proportional of the general electrostatic effects of the descriptor. In the presented calculations the furthest extrapolation distance is 51 covalent bonds. We consider the correlation coefficient value to be meaningful and usable above 0.5. The result of interpolation, the result of the furthest extrapolation and the change of correlation coefficients with regards to the number of bonds are shown in Figure 30, 31, 32.

[0579] These calculations were performed on carbonyl carbon atom targets. The same results for other atoms of the amino acid residues would be obtained.

[0580] As the presented figures show the estimations are meaningful and there is a long term influence on the electrostatic characteristics formations just like with Cryo-EM data. Although in this example the amino acid side chain is not included in the descriptor since it is unique for each type of protein building amino acid, by filtering the data the effect of the individual type can be estimated. The peptide chain is considered to represent the side chain as well and by filtering the descriptors the effect, should be sharper and unique. For the presentation of this extrapolation were used where the descriptor contains the peptide atomic fo density data of the given amino acid and of the preceding amino acid as well.

[0581] The target is the atomic fo density data of the carbonyl carbon atom of the following amino acid - that happens to be exactly 3 covalent bonds away from the imaginary box of the formed descriptor. After the calculations the correlation coefficients are above 0.7 - ranging from 0.722 (VAL) to 0.834 (GLY) - which meets the set requirements.

[0582] In order to compare the effect of the amino acids and to analyze whether they have unique influence on the patter formation, we present a classification model for the amino acid type classification based on the calculated weights. For the train and test data independent descriptors were used. The results of accuracy score 0.96 show that the utilized connections between the xrd density parameters on a dataset containing thousands of PDB entries are definitive for the amino acids. The classification report is shown in Figure 33. The classification report understates that utilizing electrostatic properties for the determination of unique patterns introduced by an amino acid type is clearly better than the Ramachandran-angles.

[0583] The calculated weights are originated from multiple one variable linear regression model. The choice was preferred, since multicollinearity can destabilize the coefficients. We utilized 21 -long peptide segments from a experimental, big protein population, and we used the xrd density parameters. According to the middle amino acid (11thindex), the descriptors can be separated into multiple subgroups - random or stratified sampling. The predictor variable was one backbone atom of the central amino acid, and the independent variable was one backbone atom for another amino acid for each equation. Altogether we calculated 4 (number of backbone atoms of the 11thamino acid) * 20 (number of remaining amino acids) * 4 (number of backbone atoms of the target amino acid) * 2 (one intercept and one coefficient value) different coefficients on multiple splits. We did the calculations on a big population, on a stratified sample of given amino acid based descriptors and on a random sample with equal number of descriptors than the stratified sample. The big population was a sample of an even bigger population thus multiple splits could be calculated.

[0584] We present one part of the result in Figure 53 for a better understanding. As it can be seen, on the independent train and test set, the characteristics of the population is the same on average. The random sample behaves the same but with a bigger standard deviation (the error is dependent from sample size). On the other hand the stratified sample (in this case for ALA) shows a different average characteristics and this deviance is the same on both sets. For the 20 amino acids we calculated the 160 coefficients that served for the input of a classification model where the target was the type of the amino acid. With this method we proved that the calculated effects are uniquly and dominantly dependent from the type of the amino acid, furthermore directly from the primary structure.

[0585] In the examples with the fo density parameter the extrapolation is carried out for carbonyl carbons of the peptide bonds. The coefficient values are particularly high in case of these carbonyl carbons (see fo density O C and fo density l C), because the FO data physically and inherently comprise information about the fact that these are carbonyl carbons, due to the fact that these data are derived from actual measurements.

[0586] In the experiments a two amino acid “box” is applied and extrapolation is made. The carbonyl atom marked as fo_density_2_C is outside this box with a few (3) covalent bonds. The weights of the carbonyl atoms within the box (fo density O C and fo density l C) in established linear equations are higher than that of other atoms.

[0587] In another experiment, data were obtained not from measured datasets but from in silico modelling. For example in case of model calculations with the DelPhi method, no such differentiation of the carbonyl carbons can be observed, i.e. the FO values are not particularly high on these atoms.

[0588] In other words, the XRD data “knows”, i.e. provide information about the individual atoms, and when we wish to calculate estimation on the carbonyl carbons, these atoms take part in the descriptor with an increase weight. We noticed that if we wish to focus on chiral atoms that their weight will increase.

[0589] According to these calculation - similarly in Cryo-EM as well - the effects of the unique amino acids on the pattern formation are different, meaning that the primary structure is very significant for the patterns - regardless the fact that the spatial structure of a protein is considered to be more conservative than the gene itself.

[0590] Analyzing the goodness of helix termination estimations.

[0591] As previously was demonstrated with the Cryo-EM electrostatic data the termination of an alpha helix could be defined with the electrostatic representation of proteins. Here an example is shown for the estimation of the helix terminus parameter. The terminus parameter is a binary variable depending whether the given amino acid is the last of the given helix or not. The terminus parameter is calculated on XRD structures with the help of submolecular local entities belonging to given amino acid types. In the calculations a control model was set up, in which the same descriptors were used to predict a randomized, shuffled terminus data that should validate the calculations. During the regressions the sets of descriptors are sampled to be the same size in order to reduce the bias.

[0592] The separation on the original, experimentally supported termination values are clearer whilst the shuffling makes the separation indefinite. For the quantitation of the separation the Cohen’s D value is computed and its value is 1.351 and 0.419 for the real and the controll set respectively. This value indicates the measure of the difference in the averages of the separations with a correction with the standard deviation. Analyzing the goodness of secondary structure estimations.

[0593] Although estimating the termination of an Alpha helix structure with the help of the electrostatic features of sets of proteins is possible, it is not the only secondary structure related parameter that canbe utilized as a classification target. The type of the secondary structure - Alpha helix, Beta strand or random coil (Non) can also be quantified into Bemoulli-variables.

[0594] It is advantageous because the number of helix terminating amino acids inside a protein is equal to the number of helices inside the protein thus large number of experimental models need to be used to minimize the error of the calculations. That is why the following plots will show how can be these structures separated pairwise by the fo density values. The following effect sizes show us the estimation of the Non or Beta sheet structures of GLY amino acids, that is 0.799 and 0.277 for the real and the controll set respectively. As for the Non and Alpha Helix structures, they are separetad with a score of 0.465 and 0.208 for the real and the controll set respectively. Lastly the separation of the Helix and Sheet result in effect sizes of 0.46 nd 0.346 for the real and the controll set respectively.

[0595] On a bigger dataset, a multiclass classification model was built with the help of the xrd density electrostatic parameter and the significant classification was evaluated with a balanced accuracy score of X and X for the real and the controll set respectively. The classification report is shown in Figure 34

[0596] Analyzing the goodness of tertiary structure (TT- n stacking) estimations.

[0597] Similarly to the type of the secondary structure the 7t- 7t stacking state can also be quantified with Bernoulli variables. In this case 0 will be the state when a given aromatic side chain does not have n- n stacking partners and it will be 1 if it does. The results are presented in the following Table 10.

[0598] As the Table shows the calculated effects are considered to be moderate although and better than random.

[0599] IV. 1.) Utilization of in silica calculations as a different method for determining patterns based on physical properties of proteins.

[0600] DelPhi calculation results with regards to Cryo-EM results

[0601] Data preparation

[0602] DelPhi is a software package providing a way to solve the Poisson-Boltzmann partial differential equation and thereby calculate an in silica model electrostatical potential (ESP) distribution to a known spatial protein structure. For the calculation the structural model and atomic partial charge values are needed. These atomic partial charge values can be estimated to monomer amino acid molecules with force field models or ab initio models in silico. However, the calculations neglect the effect of the peptide chain formation, therefore, the deviances caused by neighbouring amino acids are not taken into consideration.

[0603] The ESP distribution calculated by DelPhi (DESP) similarly to the experimentally determined Cryo-EM ESP (CESP) distribution can be utilized in order to calculate atomic DESP values by linear interpolation. This step is sufficient for the electrostatic representation of macromolecules. Since DESP is, in its physical interpretation, a very similar quantity to CESP, this paragraph will show the comparison of the two dataset. Here, similarly to the interpolation method described for Cryo-EM above, interpolation was carried out to obtain values at the atomic positions. The transformation of the parameters assigned to atoms was scaling - a method also applied on certain Cryo-EM-based models.

[0604] This methodology may be considered in cases wherein an electrostatical potential map is obtained which is in analogy with the electrostatical potential map obtained by processing Cryo-EM-data. It is to be considered, however, that forcefield model provide a very rough estimate of the actual potential field of the protein.

[0605] For the beginnig CESP similarly to ICD shows a dependency from the local resolution of the Cryo-EM measurement. This dependency should be corrected with our iterative correction method. Unlikely to ICD, CESP can be negative so the correction method is only applicable to the absolute value of CESP, but after correction the signs of the values need to be remade. The scaling of the values is also necessary. On the other hand DESP is in silico determined and should not show any connection with the local resolution of the experimental Cryo-EM measurement, however scaling is necessary as well. The raw atomic CESP, the atomic corrected and scaled CESP and the atomic scaled DESP connection with the local resolution on the very same 34 protein molecules are shown in Figures 35, 36 and 37 respectivelly.

[0606] After the necessary preparative calculation the atomic CESP and DESP parameters are ready to used in fiirther analysises, out of which the autocorrelation alongside the CA atoms is observed. Just like in previous examples, the autocorrelation shows us the regularities of the electrostatic parameter formation alongside the peptide chain, and the longer the significance relations the larger the effects of each amino acid are. The autocorrelation plots for CESP and DESP are shown in Figure 38 and 39 respectivelly, on a wild type SARS-CoV-2 spike protein in active form (PDB: 7KDL).

[0607] The obvious difference between the two plots is that the in silico DESP data rarely show significant correlations, and they are considered to be very low correlations. On the other hand the experimental CESP shows very long term relations, similarly to the ICD parameter values. The main difference between the results is the fact that in silico ESP representation of such macromolecules are not a solved problem yet, and until ab intio calculation could not be done within acceptable time this will not change. Despite the DESP showing little hope for fiirther significant results, the analysis is carried on.

[0608] Interpolation and extrapolation with CESP and DESP data

[0609] Continuing with our analysis the goodness of interpolation and extrapolation was investigated. Since there is no difference between the methodical manipulations of the CESP and DESP parameter, the descriptor building is the same as well. As defined previously here the interpolation means the estimation of an atomic parameter inside the imaginary box of the descriptor, while the extrapolation means the estimation of an outsider atomic parameter. First, in the model building we do not consider the filter of the unique amino acids. The calculations were carried towards the n,hneighbour, and the change of correlation coefficient is shown with regards to the extrapolation distance (in bonds) in Figures 40 and 41 for DESP and CESP respectively.

[0610] The descriptor herein is considered as the electrostatic parameters of the main chain atoms of a two amino acid long peptide segment. Here electrostatic data are correlated with electrostatic data. As the plots show experimental CESP data is better for use in extrapolation calculations, however not as good as Cryo-EM ICD values or XRD fo- density values. The extrapolation can be investegated with amino acid filter on and the CESP data has the feature the recognize the type of the different atoms.

[0611] On the other hand DESP parameters do not have this feature (it suggests that the chiral carbon atom has more effect), which means that the in silico potential model carries less information than the experimental, and the missing information is important.

[0612] This means that the structural information and the characteristic electrostatic parameters modelled with force field methods and then utilizied with DelPhi do not carry the necessary information to predict the long term connections in the electrostatic field of a protein molecule. This understates the fact that the data we utilize is clearly unique and is capable to enhance the current biochemical views.

[0613] Secondary structure estimation comparisons

[0614] Here we present secondary structure estimations that were calculated on the CESP and DESP dataset. The calculation include only three multilinear models per amino acid, based on the pairwise grouping of the secondary structure Helix, Sheet or Random coil. The utilized descriptors are unique amino acids with their preceding and following peptide chain part of the given amino acid. Figure 42 and 43 present the Helix / Sheet estimation of the CESP and DESP data, with error ranges according to randomized biological feature target. In this example again the biochemical feature values have been set as 0 for the helix and 1 for the sheet.

[0615] Comparing the Figures it can be concluded that both experimental CESP and in silico DESP data contains the neccesary information of the structural features and the estimation is better than the typical error range of the algorithm. Note that the scales are different. The effect sizes tend to be a bit more in the case of CESP data suggesting that it outperforms the descriptors from in silico DESP origin. The number of utilized descriptors might differ because of the necessary trimming of the data, however the difference is not relevant.

[0616] It is the same in case of the Helix / Random Coil and Sheet / Random coil estimation as well. These results are presented in Figures 44 , 45, 46 and 47 for the CESP and DESP data, respectivelly.

[0617] As the Figures show in both cases the structural parameted can be significantly estimated by both CESP and DESP data, and CESP performs better. The following conclusion can be made according to the the results: In silico originated DESP parameter does not perform well on the local inter- and extrapolation calculations and the type of the atoms cannot be really well determined meaning the computational approximation of the electrostatic features of the protein lacks this type of information. However the structural information seems to be correctly reconstructed meaning the secondary structure estimations are better than the error range.

[0618] On the other hand experimental originated CESP parameter outperforms in all fields, meaning the utilization of experimental data is highly suggested for further purposes. This recognitons also concludes that DESP data can be used for in silico determined structural information and the secondary structural estimation results can be interpreted on significantly bigger protein datasets in silico. In order to support this statement the tertiary structure estimation is also computed in the following paraghrap

[0619] Tertiary structure estimation comparisons

[0620] In the presented invention the tertiary structure estimation is carried out on the estimation of the possible existence of the K- K stacking interactions between aromatic side chains. The selected set of descriptors is based on the given aromatic amino acids and their preceding and following peptide chain partners. The target is a binary variable which value is based on the biological feature if this K- K stacking is possible or not - meaning that the given aromatic side chains have at least two other aromatic side chains in the distance of 7 A. The results for the CESP and DESP parameters with given error ranges determined by statistical calculations are presented in Figures 48 and 49. Noting that the scales are different the following conclusion can be made. Both CESP and DESP parameters perform better than the error range meaning the information exists. Suprisingly the experimental CESP data cannot outperform the in silico DESP data meaning DelPhi potantial estimation is a great tool in hands when analysing this type of tertiary structures.

[0621] REFERENCES

[0622] Ackerman-Schraier, L., Rosenberg, A. A., Marx, A., & Bronstein, A. M. (2022). Machine learning approaches demonstrate that protein structures carry information about their genetic coding. Scientific Reports, 12(1),

[0623] 21968. https: / / doi.oig / 10.1038 / s41598-022-25874-z

[0624] Berman, H. M. (2000). The Protein Data Bank. Nucleic Acids Research, 28(1), 235-242. https: / / doi.org / 10.1093 / nar / 28.1.235

[0625] Cieplak, P, Dupradeau, F.-Y, Duan, Y, & Wang, J. (2009). Polarization effects in molecular mechanical force fields. Journal of Physics: Condensed Matter , 21(33), 333102. https: / / doi.oig / 10.1088 / 0953-

[0626] 8984 / 21 / 33 / 333102

[0627] Dynamic Time Warping. (2007). In M. Muller, Information Retrieval for Music and Motion (pp. 69-84). Springer

[0628] Berlin Heidelberg, https: / / doi.org / 10.1007 / 978-3-540-74048-3_4

[0629] Frank, J. (2002). Single-Particle Imaging of Macromolecules by Cryo-Electron Microscopy. Annual Review of

[0630] Biophysics and Biomolecular Structure , 31(1), 303-319. https: / / doi.org / 10.1146 / annurev.biophys.31.082901.134202

[0631] Friesner, R. A. (2005). Ab initio quantum chemistry: Methodology and applications. Proceedings of the National

[0632] Academy of Sciences, 102(19), 6648-6653. https: / / doi.org / 10.1073 / pnas.0408036102

[0633] Guo, G., Goldfeder, J., Lan, L., Ray, A., Yang, A. H., Chen, B., Billinge, S. J. L., & Lipson, H. (2024). Towards end- to-end structure determination from x-ray diffraction data using deep learning. Npj Computational Materials, 10(1), 209. https: / / doi.org / 10.1038 / s41524-024-01401-8

[0634] Jelsch, C., Mitev, M., & Guillot, B. (2023). Method for Evaluating the Stability of the Structure of a Molecule-

[0635] Environment Complex. https: / / patentscope.wipo.int / search / en / WO2023104969

[0636] Jumper, J., Evans, R., Pritzel, A., Green, T., Figumov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Zidek,

[0637] A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B.,

[0638] Nikolov, S., Jain, R., Adler, J., ... Hassabis, D. (2021). Highly accurate protein structure prediction with

[0639] AlphaFold. Nature, 596(7873), 583-589. https: / / doi.oig / 10.1038 / s41586-021-03819-2

[0640] Kim, S. J., Yoon, J. S., Shishido, H., Yang, Z., Rooney, L. A., Barral, J. M., & Skach, W. R. (2015). Translational tuning optimizes nascent protein folding in cells. Science, 348(6233), 444-448. https: / / doi.org / 10.1126 / science.aaa3974 Komar, A. A., Samatova, E., & Rodnina, M. V. (2024). Translation Rates and Protein Folding. Journal of Molecular

[0641] Biology, 436(14), 168384. https: / / doi.Org / 10.1016 / j.jmb.2023.168384

[0642] Kramer, A., Green, J., Pollard, J., & Tugendreich, S. (2014). Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics, 30(4), 523-530. https: / / doi.org / 10.1093 / bioinformatics / btt703

[0643] Lamb, A. L., Kappock, T. J., & Silvaggi, N. R. (2015). You are lost without a map: Navigating the sea of protein structures. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1854(4), 258-268. https: / / doi.Org / 10.1016 / j.bbapap.2014.12.021

[0644] Li, L., Li, C., Sarkar, S., Zhang, J., Witham, S., Zhang, Z., Wang, L., Smith, N., Petukh, M., & Alexov, E. (2012). DelPhi: A comprehensive suite for DelPhi software and associated resources. BMC Biophysics, 5(1), 9. https: / / doi.org / 10.1186 / 2046-1682-5-9

[0645] Liu, Y. (2020). A code within the genetic code: Codon usage regulates co -translational protein folding. Cell Communication and Signaling, 18(1), 145. https: / / doi.org / 10.1186 / sl2964-020-00642-6

[0646] Lyu, X., Yang, Q., Li, L., Dang, Y, Zhou, Z., Chen, S., & Liu, Y. (2020). Adaptation of codon usage to tRNA 134 modification controls translation kinetics and proteome landscape. PLOS Genetics, 16(6), el008836. https: / / doi.org / 10.1371 / joumal.pgen.1008836

[0647] O’Brien, E. P, Vendruscolo, M., & Dobson, C. M. (2014). Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates. Nature Communications, 5(1), 2988. https: / / doi.org / 10.1038 / ncomms3988

[0648] Perrakis, A., & Sixma, T. K. (2021). Al revolutions in biology: The joys and perils of AlphaFold. EMBO Reports, 22(11), e54046. https: / / doi.org / 10.15252 / embr.202154046

[0649] Pfab, J., Phan, N. M., & Si, D. (2021). DeepTracer for fast de novo cryo-EM protein structure modeling and special studies on CoV-related complexes. Proceedings of the National Academy of Sciences, 118(1), e2017525118. https: / / doi.org / 10.1073 / pnas.2017525118

[0650] Rosenberg, A. A., Marx, A., & Bronstein, A. M. (2022). Codon-specific Ramachandran plots show amino acid backbone conformation depends on identity of the translated codon. Nature Communications, 13( 1), 2815. https: / / doi.org / 10.1038 / s41467-022-30390-9

[0651] Senior, A. W, Kirkpatrick, J., Sifre, L., Evans, R. A., Penedones, H., Qin, C., Sun, R., Simonyan, K., & Jumper, J. (2021). Machine learning for determining protein structures (United States Patent No.

[0652] US20210304847A1). https: / / patents.google.com / patent / US20210304847Al / en7oqMJS2021304847Al

[0653] Senior, A. W, Kirkpatrick, J., Sifre, L., Evans, R. A., PENEDONES, H., QIN, C., SUN, R„ SIMONYAN, K., & JUMPER, J. (2024). Iterative protein structure prediction using gradients of quality scores (European Union Patent No. EP3821433B1). https: / / patents. google. com / patent / EP3821433Bl / en?oq=EP3821433Bl Turner, J. & The wwPDB Consortium. (2023). EMDB - the Electron Microscopy Data Bank. Bioinformatics. htps: / / doi.org / 10.1101 / 2023.10.03.560672

[0654] Vilas, J. L., Gomez-Blanco, J., Conesa, P, Melero, R., Miguel De La Rosa-Trevin, J., Oton, J., Cuenca, J., Marabini, R., Carazo, J. M., Vargas, J., & Sorzano, C. O. S. (2018). MonoRes: Automatic and Accurate Estimation of Local Resolution for Electron Microscopy Maps. Structure, 26(2), 337-344. e4. https: / / doi.Org / 10.1016 / j.str.2017.12.018

[0655] Walls, A. C., Park, Y.-J., Tortorici, M. A., Wall, A., McGuire, A. T., & Veesler, D. (2020). Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Cell, 181(2), 281-292. e6. https: / / doi.Org / 10.1016 / j.cell.2020.02.058 Williams, D. B., & Carter, C. B. (2008). Transmission electron microscopy: A textbook for materials science (2nd ed). Springer.

[0656] Zwanzig, R., Szabo, A., & Bagchi, B. (1992). Levinthal’s paradox. Proceedings of the National Academy of Sciences, 89(1), 20-22. https: / / doi.org / 10.1073 / pnas.89.L20

Claims

CLAIMSClaim 1A computer-implemented method performed by one or more data processing apparatus, for assessing (or determining) a biochemical feature of a biological macromolecule (preferably a protein molecule) by providing an electrostatic data based (preferably charge-related) in silico representation of the biological macromolecule, wherein the biochemical feature assessed by the method is suitable to be brought into relationship with the biological function of the biological macromolecule and thereby with its biological mechanism, said biological macromolecule comprising a multiplicity of submolecular local entities selected from atoms, bonds and groups of atoms, preferably consisting of e.g. at most 8, 5 or 4 atoms, and having a structure described by electrostatic quantities organized into an electrostatic quantity map, wherein said in silico representation comprises a descriptor or a multiplicity thereof, said descriptor (optionally a multiplicity of descriptors) comprising o a set of submolecular local entity values, identifying a set of descriptor-related submolecular local entities each submolecular local entity value belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably charge-related parameters characterizing local charges, obtained from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity value within the set of submolecular local entities values, a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, wherein optionally the biochemical feature values can be considered as target variables, (as explained in more detail in the detailed description), a mathematical model for the relationship between the set of electrostatic data-derived parameters and the biochemical function of the biological macromolecule, said model being set up on a multitude of descriptors (preferably defining interactions (patterns) between the electrostatic data-derived parameters), the electrostatic data-derived parameters being predictor variables and the biochemical feature values being the target variables in the model, to obtain a biochemical feature score to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said computing-implemented method comprising the steps of obtaining a biochemical feature score of interest as the result of the model utilized (or applied) on at least one descriptor, optionally on a multiplicity of descriptors, both the biochemical feature score and the at least one descriptor being related to the biological macromolecule of interest, comparing the biochemical feature score of interest with at least one multitude of reference biochemical feature scores, each of said reference biochemical feature scores obtained as the result of the mathematical model set up on a multitude of reference descriptors of biological macromolecules with known biochemical features (wherein the corresponding known biochemical feature values are target variables of the model), assessing (or determining) a biochemical feature of the biological macromolecule of interest by comparing the biochemical feature score of interest with the at least one multitude of reference biochemical feature scores or with a reference interval derived therefrom,wherein preferably a relationship / connection is established between the biochemical feature (assessed by comparison of the biochemical feature score of interest with the at least one multitude of reference biochemical feature scores) and the biological function of the biological macromolecule and thereby its biological mechanism (of action).Claim 2The method according to any of claim 1, wherein o the model type is an alignment-type model and the targe variables are values calculated between descriptors, preferably a difference -type model and the difference type values are calculated between two descriptors or o the model type is a regression model, and the target variables are the weighted combination (preferably sum) of the predictor variables o the model type is a classification-type model, and the obtained score values are the weighted combination of the predictor variables.Claim 3The method according to any of claims 1 or 2, wherein the macromolecule is a protein and the biological feature comprises multiple states of the protein, preferably two different states.Claim 4The method according to claim 3 wherein difference between the states is due to an effect selected from the group consisting of: mutation, biochemical modification, said modification preferably selected from a regulatory type modification, preferably phosphorylation, change in the glycosylation pattern, external effect, said external effect preferably being selected from protein-ligand interaction, protein-protein interaction, or other external effect like chemical effects, preferably a change in pH, ionic strength, polarity or hydrophobicity of the environment, presence of chaotropic agents, etc; or physical effects, like light, temperature, pressure, etc.Claim 5The method according to any of claims 3 to 4, wherein the model is an alignment-type model and the target variables are values calculated between descriptors, preferably a difference-type model and the difference type values are calculated between two descriptors, and the target variables are Bernoulli variables, preferably the descriptor comprising electrostatic data-derived parameters along polypeptide chain, preferably the model type being dynamic time warping (DTW).Claim 6The method according to any of claims 1 or 2, wherein the biological feature comprises difference in the secondary structure, preferably said biological feature is selected from the group consisting of alpha helical structure, beta-sheet, termination thereof, or other secondary structure type, preferably the biochemical feature values are selected from torsion angles or any representation thereof, Bernoulli variables relating to different structures, etc.Claim 7The method according to claim 6, whereinthe model type is a regression model, and the target variables are the weighted combination (preferably sum) of the predictor variables, preferably a multiple linear regression model, preferably the descriptor wherein preferably the elements pattern multiple states of the macromolecule, preferably the protein, preferably two different states.Claim 8The method according to any of claims 1 or 2, wherein the biological feature comprises a pattern based on primary structure, preferably said biological feature is selected from the group consisting of alpha helical structure, beta-sheet, termination thereof, or other secondary structure type, preferably the biochemical feature values are selected from torsion angles or any representation thereof, Bernoulli variables relating to different structures, etc.Claim 9The method according to any of claims 1 or 2, wherein the biological feature is a feature which indicates the folded nature of the protein.Claim 10The method according to claim 9, wherein the biological feature is 7t-7t relationshipClaim 11The method according to any of claims 1 or 2, wherein the biological feature is a feature is a relationship between the genetic code and the electrostatic environment of amino acids.Claim 12A computer-implemented method performed by one or more data processing apparatus for providing, in silico, an electrostatic data based descriptor relating to biological feature (preferably charge-related) of a biological macromolecule (preferably a protein molecule), said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described by o geometric structure data (e.g. backbone torsion angles and / or atomic coordinates) defining the structure of the biological macromolecule or substructure thereof and o electrostatic quantities preferably organized into an electrostatic quantity map, characterized in that said descriptor (including a multiplicity of descriptors) comprising a o a set of submolecular local entity values relating to a set of descriptor-related submolecular local entities each belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity within the set of submolecular local entities, said descriptor being suitable for setting up a model thereon wherein said electrostatic data-derived parameters are used as predictor variables and values assigned to the biochemical feature are used as targe variables, preferably said descriptor carrying information about the biochemical feature, said method comprising the steps of providing a structure of the biological macromolecule described byo geometric structure data defining the structure of the biological macromolecule or substructure thereof, preferably a macromolecular structure data map, o electrostatic quantities related to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map map, and preferably from experimental or computed physical data, if desired transforming the electrostatic -quantities preferably quantities of the reconstructed electro static - quantity map into transformed electrostatic quantities, preferably forming a transformed electrostatic - quantity map, defining a set of biochemical feature values, said values being related to a biochemical feature of the macromolecule, forming a descriptor for the biochemical feature, by o selecting a set of submolecular local entities using the geometric structure data, o selecting a set of electrostatic quantities related to the set of submolecular local entities, and assigning each electrostatic quantity to a related submolecular local entity in the set of submolecular local entities, to obtain electrostatic data-derived parameters ( / electrostatic data related / electrostatic type parameters), wherein the set of submolecular local entities and the set of electrostatic data-derived parameters form the descriptor for the biochemical feature, if desired, transforming the electrostatic data-derived parameters into transformed electrostatic data-derived parameters if desired, processing the electro static -type parameters into processed electrostatic -type parameters optionally defining a subset of the set of submolecular local entities thereby defining a substructure of the macromolecule and a related subset of the electrostatic -type parameters.Claim 13A computer-implemented method performed by one or more data processing apparatus for providing an electrostatic data based in silico representation of a biological macromolecule (preferably a protein molecule) having a biological feature of a biological macromolecule said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described by o geometric structure data defining the structure of the biological macromolecule or substructure thereof and o electrostatic quantities preferably organized into an electrostatic quantity map, characterized in that said in silico representation comprises a descriptor (including a multiplicity of descriptors) comprising o a set of submolecular local entity values relating to a set of descriptor-related submolecular local entities each belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity within the set of submolecular local entities,a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, a model, said model being set up on a multitude of descriptors, taking the electrostatic data-derived parameters as predictor variables and the biochemical feature values as target variables, to obtain a score being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said method comprising the steps of providing a structure of the biological macromolecule described by o geometric structure data defining the structure of the biological macromolecule or substructure thereof, preferably a macromolecular structure data map, o electrostatic quantities related to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map map, and preferably from experimental or computed physical data, if desired transforming the electrostatic -quantities preferably quantities of the reconstructed electro static - quantity map into transformed electrostatic quantities, preferably forming a transformed electrostatic - quantity map defining a set of biochemical feature values, said values being related to a biochemical feature of the macromolecule, forming a descriptor for the biochemical feature, by o selecting a set of submolecular local entities using the geometric structure data, o selecting a set of electrostatic quantities related to the set of submolecular local entities, each quantity being assignedto a related submolecular local entity in the set of submolecular local entities, to obtain electrostatic data-derived parameters, wherein the set of submolecular local entities and the set of electrostatic data-derived parameters form the descriptor for the biochemical feature, if desired, transforming the electrostatic data-derived parameters into transformed electrostatic data-derived parameters if desired, processing the electro static -type parameters into processed electrostatic -type parameters optionally defining a subset of the set of submolecular local entities thereby defining a substructure of the macromolecule and a related subset of the electrostatic -type parameters, setting up a model on the electrostatic data-derived parameters of a multitude of said descriptors as predictor variables and the biochemical feature values as target variables, to obtain a score being usefiil to evaluate the relationship between the multitude of descriptors and the biochemical feature values, thereby providing the in silico representation of the macromolecule.Claim 14A computer-implemented method performed by one or more data processing apparatus for providing a reference multitude of score values for estimation of a value related to a known biochemical feature of a reference biological macromolecule and for assessing the biochemical feature of a biological macromolecule of interest, using an in silico representation of the reference biological macromolecule, said biological macromolecule comprising a multiplicity of submolecular local entities and having a structure described byo geometric structure data defining the structure of the biological macromolecule or substructure thereof and o electrostatic quantities preferably organized into an electrostatic quantity map, characterized in that said in silico representation comprises a descriptor comprising o a set of submolecular local entity values relating to a set of descriptor-related submolecular local entities each belonging to said descriptor, o a set of electrostatic data-derived parameters derived from the electrostatic quantities, preferably from the electrostatic quantity map of the biological macromolecule, each electrostatic data-derived parameter being assigned to a single submolecular local entity within the set of submolecular local entities, a set of biochemical feature values, comprising one or more value(s) associated with a biochemical feature of the macromolecule, a model, said model being set up on a multitude of descriptors, taking the electrostatic data-derived parameters as predictor variables and the biochemical feature values as target variables, to obtain a score being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values; said method comprising the steps of providing a structure of the biological macromolecule described by o geometric structure data defining the structure of the biological macromolecule or substructure thereof, preferably a macromolecular structure data map, o electrostatic quantities related to the structure of the biological macromolecule, preferably a reconstructed electrostatic -quantity map, and preferably from experimental or computed physical data, if desired transforming the electrostatic -quantities preferably quantities of the reconstructed electro static - quantity map into transformed electrostatic quantities, preferably forming a transformed electrostatic - quantity map, defining a set of biochemical feature values, each values being related to a known biochemical feature of a known macromolecule, forming reference descriptors for the biochemical feature of a (multitude of) biological macromolecule(s), by o selecting a set of submolecular local entities for each descriptor, using the geometric structure data, o selecting a set of electrostatic quantities related to the set of submolecular local entities for the given descriptor, and assigning each electrostatic quantity to a related submolecular local entity in the set of submolecular local entities, to obtain electrostatic data-derived parameters, wherein the set of submolecular local entities and the set of electrostatic data-derived parameters form the descriptor for the biochemical feature, if desired, transforming the electrostatic data-derived parameters into transformed electrostatic data-derived parameters, if desired, processing the electro static -type parameters into processed electrostatic -type parameters, optionally defining a subset of the set of submolecular local entities thereby defining a substructure of the macromolecule and a related subset of the electrostatic -type parameters.setting up a model on the electrostatic data-derived parameters of the descriptors as predictor variables and the known biochemical feature values as target variables, to obtain a multitude of scores being useful to evaluate the relationship between the multitude of descriptors and the biochemical feature values, thereby providing the in silico representation of the macromolecule, - applying the model to calculate a reference multitude of scores.Claim 15The method according to any of claims 12, 13 or 14, wherein o the model type is an alignment-type model and the targe variables are values calculated between descriptors, preferably a difference -type model and the difference type values are calculated between two descriptors or o the model type is a regression model, and the target variables are the weighted combination (preferably sum) of the predictor variables.