An artificial intelligence-based method of enterobacterial repetitive intergenic consensus PCR analysis
The AI-based method addresses the limitations of manual band detection in PCR analysis by using deep learning for isolate-based image processing and structural similarity, enabling efficient and accurate phylogenetic tree construction.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- T C ERCIYES UNIVERSITESI
- Filing Date
- 2025-12-23
- Publication Date
- 2026-07-02
Smart Images

Figure TR2025051843_02072026_PF_FP_ABST
Abstract
Description
[0001] AN ARTIFICIAL INTELLIGENCE-BASED METHOD OF ENTEROBACTERIAL REPETITIVE INTERGENIC CONSENSUS PCR ANALYSIS
[0002] Technical Field
[0003] The invention relates to an artificial intelligence-based method of Enterobacterial Repetitive Intergenic Consensus (ERIC) PCR (Polymerase Chain Reaction) analysis that determines the degree of affinity of bacteria.
[0004] State of the Art
[0005] The purpose of determining the degree of affinity of bacteria is to determine the level of genetic relatedness between bacteria and to reveal the relationships between species or strains, thus contributing to microbiological and epidemiological studies such as monitoring disease factors, classifying bacterial populations, and determining the source of outbreaks. Various software and methods are used in the current system. The QIAxel System and BioDoc analysis software are only used to display the gel. Features such as gel cropping, tree construction, etc. are not available. The Gel Doc EZ system finds the existing bands in the gel and turns the results into an Excel table. However, this software cannot perform analyses such as cutting the bands in the gel and constructing a tree. The Gelquest and ClusterVis software includes gel cutting, tree construction, etc., but those who use the software manually adjust these features. The software called gel scan is only used to display the gel, and there is no gel cutting, tree construction, etc. available.
[0006] The Quantity One GelQuest software is only used to display the gel and determine the quality of the gel and band, and there is no gel cutting, tree construction, etc. available. The cluster cannot be created with PyElph, Gel Analyzer, and the relevant bands cannot be placed opposite each isolate. In addition, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and WPGMA (Weighted Pair Group Method with Arithmetic Mean) are used to construct trees with this software. However, with the software in this project, the tree is constructed based on UPGMA and NI. In addition, any desired isolate data cannot be integrated into the new tree using previous analysis data. Also, the similarity line cannot be automatically generated.By combining multiple gel images with the software called GelJ, the tree of multiple isolates cannot be obtained. In addition, the cluster cannot be created, the relevant bands cannot be placed opposite each isolate, and the similarity line cannot be automatically generated. In addition, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and WPGMA (Weighted Pair Group Method with Arithmetic Mean) are used. In addition, any desired isolate data is not integrated into the new tree using previous analysis data.
[0007] In the current system, there are patent / utility model applications and articles related to the subject. Patent application "CN103773887A" in the state of the art includes the method of preparing the ERIC-PCR fingerprint spectrum showing changes in mouse intestinal flora. The preparation method includes modeling of mouse gut flora changes caused by antibacterial-associated diarrhea, removal of cecum content DNA, removal of humic acid from cecum content DNA, removal of PCR inhibitor from cecum content DNA, ERIC-PCR amplification, and native polyacrylamide gel electrophoresis (Native-PAGE). The results obtained for the creation of the ERIC-PCR fingerprint spectrum of changes in mouse intestinal flora are compared to the results of traditional flora analysis.
[0008] Patent application "US5523217A" in the state of the art discloses oligonucleotide primers and methods for identifying bacterial strains with genetic fingerprints. These methods can be applied to various examples. The test procedure includes replicating the bacterial DNA in the sample to be tested by adding an outwardly oriented pair of primers to the sample. The primers are capable of hybridizing repeating DNA sequences in bacterial DNA and expanding from one hybridizable repeater sequence to another hybridizable repeater sequence. After replication, the elongation products are decomposed according to their size, and the specific bacterial strain is determined by measuring the pattern of the sized elongation products. The procedure for identifying bacterial strains with fingerprints has various uses in areas such as bacteria in infections, bacteria in agriculture and horticulture, bacteria in bioremediation processes, food monitoring, production monitoring, quality assurance, and quality control.
[0009] Currently, all applications in the literature draw dendrograms based on the detection, connection, and comparison of band images. In the current system, it is necessary to manually define the missing definitions in the line detection mechanism passing through the middle of the isolates. In addition, it does not work in unclear PCR images. Because it is not possible to detect the bands in each isolate in unclear images.Another disadvantage of the current system is that for the applications to work, the PCR images must be horizontally and vertically aligned. In addition, some applications progress to dendrogram drawing based on the matching of the bands. In addition, there are not always isolates in the entire area in the PCR images. For this reason, it is necessary to cut the empty spaces. This process is carried out manually in the current system.
[0010] Another disadvantage is that there are many parameters that need to be determined manually in all steps during a series of processes such as line detection, band detection, line matching. For this reason, the necessary adjustments for the analysis of a PCR image take a very long time and require experience.
[0011] As a result, due to the negativities described above and the inadequacy of the current solutions on the subject, there is a need for a technology that works on isolate-based instead of band-based, uses the structural similarity of the images and can recognize even unclear or slanted bands with a unique model based on deep learning.
[0012] Brief Description and Objects of the Invention
[0013] The invention relates to an artificial intelligence-based method of Enterobacterial Repetitive Intergenic Consensus (ERIC) PCR (Polymerase Chain Reaction) analysis that meets all the above-mentioned requirements and determines the degree of affinity of the bacteria that eliminate the setbacks and disadvantages of the existing system.
[0014] In the method developed with the invention, it works based on the image similarity of the isolates instead of the band images. In this way, it eliminates the need for manual straightening in all these process steps, which is encountered in determining the lines on the isolates, detecting the isolates, detecting the band regions in the isolates, ensuring the connection of the bands between more than one isolates. Working based on the structural similarity of the isolates provides the solution to the problems encountered in all these applications.
[0015] In the method developed by the invention, the isolates are detected automatically and completely. Since it is studied based on the structural image similarity, it is not necessary to detect the bands. Thus, there is no need for a line system passing through the middle of the isolates. The algorithm does not need to detect the bands in the isolates. It works in all PCR images. In order to increase the accuracy in unclear PCR images, a unique model design based on deep learning was made at this stage.Thanks to the deep learning model developed by the invention, it can also be recognized in images that are oblique, at a certain angle to the right or left.
[0016] With the method developed with the invention, a number of unique functions such as automatic detection of isolate images, isolate names, isolate sources and groups and automatic placement of these in the phylogenetic tree, searching for isolates obtained from the pool in future studies for isolates of different studies that have been analyzed before, producing phylogenetic trees by creating an isolate assembly consisting of different studies according to their similarities, combining isolates in more than one PCR image on each tree and producing comprehensive phylogenetic trees containing 100 or more different isolates are presented.
[0017] With the method, the empty areas in the PCR images can be cut automatically.
[0018] After the PCR image is given with the method, the dendrogram trees are drawn in the desired detail in a very short time without the need for any parameter adjustment. In this respect, the entire algorithm is operated with a faster and simpler process than all the applications used in the current system.
[0019] Basically, thanks to the isolate-based method instead of band-based study, a different algorithm is used as the basic approach from all applications. Thanks to this method, in which the structural similarity of the images is used, all the problems encountered in bandbased applications are eliminated. In addition, thanks to the original model based on deep learning, identification and dendrogram drawings can be made in all images that are not clear, tilted to the right or left, and where the bands are not clearly visible.
[0020] Descriptions of the Figures
[0021] Figure 1 : Automatic cropping of the PCR image and isolate detection algorithm.
[0022] Figure 2: Isolated image comparison and automatic creation of the PCR tree.
[0023] Figure 3: Original images of PCR.
[0024] Figure 4: Gradient values calculated with pixel values collected horizontally and vertically. Figure 5: PCR Core Area.
[0025] Figure 6: Gradient values calculated with pixel values collected vertically.
[0026] Figure 7: Cut image by detecting the PCR Isolate start and end point.
[0027] Figure 8: A sample image of the isolates with their starting points determined and marked. Figure 9: Sample images of the isolates obtained separately.Figure 10: A fully automated end-to-end artificial intelligence model that segregates and extracts isolates from PCR images.
[0028] Figure 11 : Examples of augmented education images.
[0029] Figure 12: Test results of the artificial intelligence model.
[0030] Figure 13 : Performance values obtained as a result of the training of the artificial intelligence model in sample isolates.
[0031] Figure 14: Inter-Isolate Difference and Threshold Analysis
[0032] Figure 15: Phylogenetic tree construction methods.
[0033] Figure 16: A graph showing the conversion and flow required to apply UPGMA.
[0034] Detailed Description of the Invention
[0035] The invention relates to an artificial intelligence-based method of Enterobacterial Repetitive Intergenic Consensus (ERIC) PCR (Polymerase Chain Reaction) analysis that determines the degree of affinity of bacteria. This method is carried out with computer support in a desktop application.
[0036] The method developed with the invention basically has two stages. In the first stage, it is the stages that include the processing of the image, which is shown in Figure 1. The second stage is the image similarity algorithm shown in Figure 2 and the processes of producing the similarity tree. In this way, it is aimed to perform all operations within the flow diagram after the PCR image is uploaded to the system by the user and to produce 4 different PCR trees and present them to the user.
[0037] First, the based determination of the part where only the isolates are performed from the PCR images (1st Method). The PCR images are basically taken from the device as shown in Figure 3 and are loaded into the desktop application in this way.
[0038] As can be seen in Figure 3, in the original images (a, b, c, d, e), the perimeter of the image is framed with a black image, while in the converted image (f), it is framed with white. In order to reduce the complexity of the process, if the image is converted, the image is converted to its original state in the first stage. Thus, a black framed image is obtained similar to other images. In the example image, the converted shape of the inverted PCR image shown in 3-f is the image shown in Figure 3-d. After this conversion, all images are converted to the same format.As can be seen in the images shown in Figure 3, the black area is not at a fixed width both on the right and on the upper and lower region. In addition, the dimensions of the images vary on a pixel basis. For example, some of these images are 641x479 pixels and have an image resolution of 63dpi, while others have an image resolution of 775x513 pixels and 96dpi. For these reasons, a fixed cutting process is not possible to reach the core area of the PCR image by cleaning the black area. In order to overcome this and to detect the black frame automatically, gradient values were used to remove the margins of the image. The mentioned gradient values are calculated using the data processing tool. The pixel values of the image were collected vertically and horizontally separately and recorded in two different vectors. This process is carried out to determine the breakpoints. These values were normalized by dividing the vertical totals by the number of rows and the horizontal totals by the number of columns. The margins of the image are cropped by selecting the points where the vertical gradient is maximum and minimum and the first point where the horizontal gradient is greater than 1 and the last point where it is less than -1. For example, the horizontal and vertical gradient values are as shown in Figure 4. The areas above and below the isolate image are cut (cropped) by a processor and removed from the image.
[0039] As a result of these processes, the black frame in the PCR images was cleaned and as a result, the area containing the original PCR isolates shown in Figure 5 was achieved. As can be seen when the images are examined, another process is to detect the sections with isolates in the image. This process is performed by the artificial intelligence-powered image processing unit. The entire area is not always used within the PCR main image area. While there are isolates to fill the entire area in the Figure 5-d image, a certain part of the area is used in Figure 5-e. For this reason, the starting and ending pixel values of the isolates in the main area vary from image to image. In some images, it may be in the images containing only 3-4 isolates and the remaining part is left blank. For this reason, in order to achieve the goal of reaching the isolate images, the starting and ending points of the isolates in the main area should be determined as a second stage. In order to achieve this, the same method of reaching the core area from the original image was repeated on these images.
[0040] After calculating the gradient values in the image with the margins taken, the beginning and end of the image are selected as the first point where the gradient is greater than 2 and the last point where it is less than -2. The remaining image width was calculated automatically by dividing by the number of bands. For example, the horizontal and vertical gradient values are as shown in Figure 6. The gradient values are calculated for the vertical lines by aprocessor or graphical processing unit, and the breakpoints are determined by analyzing the gradient values.
[0041] As a result of these processes, the images given in Figure 7 and obtained by automatically cutting from the starting and ending points of the isolates were obtained.
[0042] After the images in Figure 7, each isolate must be compared one by one and made into an image in order to obtain their similarities. For this, the starting point of each isolate should be determined and the image as wide as the isolate width should be cut and recorded as a separate image. For this process, it is necessary to determine the number of bands. Because as can be seen from Figure 7, the widths of the isolates are not constant. It varies from image to image. For this reason, it is necessary to cut different width isolates for each image. What needs to be done at this point is to determine the starting point. This determination can only be made by determining the number of isolates in the image. The number of isolates is obtained by counting by looking at the gradient value or by taking it from the user input. After this value is taken, the starting points of the isolates given in Figure 8 are determined. After this process, the images are cut vertically by referring to each starting point and the image of each isolate is obtained separately as shown in Figure 9. All this process is performed automatically by the artificial intelligence-powered image processing algorithm and works without being affected by the number of isolates, image size, resolution. Obtaining individual images of the isolates allows it to be compared in the next stage. In the artificial intelligence-based determination of the part where only the isolates are present in the PCR images (2nd Method), the masks (ground truth) of the isolates in 15 PCR images were created by the experts. For the training and testing of the artificial intelligence model, 10 of the 15 images were determined as training and 5 as testing. 500 increased training images and ground truths were obtained by applying various image manipulation techniques such as contrast, brightness, random rotation, horizontal-vertical rotation to the training images. The proposed end-to-end fully automated artificial intelligence-based isolate extraction methodology is shown in Figure 10.
[0043] Examples of augmented education images and basic real images are shown in Figure 11. The training and test images were resized to 565x584 due to the tensor and convolutional structure of the network and the computational limitations arising from the hardware.
[0044] 25 cycles were trained on the training images obtained from the modified Unet (Ocal et al., 2023) deep learning architecture with layer-based hybrid convolution. Dice loss functionwas used to obtain the loss value of the model during the training. In the test phase, the Dice Similarity Coefficient (DSC) similarity function was used to measure the similarity ratio between the model's estimated and the actual segmentation image. The prediction images in Figure 12 were obtained by testing the modified Unet architecture trained on the training data sets estimated during the test phase. The obtained test basic real images will be overlapped with the help of the overlay algorithm on the PCR images.
[0045] The image of the isolates is cut and resized with the help of the PsuedoCode given below from the overlapped images. CV2 (part of OpenCV) was used to read and edit the images. In the difference image, cv2.THR.ESH BINAR.Y INV and cv2.THRESH_0TSU were used. Then, contours were found with the cv2 findContours feature and the areas with isolates were placed in boxes. The boxes with the isolates were removed and recorded.
[0046] import cv2
[0047] # Upload image, grayscale, Otsu threshold
[0048] image = Read image
[0049] original = image. copy()
[0050] gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
[0051] thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[l]
[0052] Use Morph to eliminate noise
[0053] kernel = Get structure of pixels
[0054] Apply morphology
[0055] Find contours
[0056] Place isolates in bounding boxes
[0057] Subtract the area with the detected isolates (ROI)
[0058] Resize the isolates and save.
[0059] In the graphic in Figure 13, the Dice Similarity Coefficient and Dice Loss performance values obtained as a result of training the sample isolates in the deep learning architecture created using layer-based hybrid convolution are shown in Figure 13. In the performance values shown in Figure 13, the Dice Score is expected to approach 1, while the Dice Loss value, which expresses the calculation loss of the model, is expected to approach 0. The Dice Score value as seen in Figure 13 was 0.986 and the Dice Loss value was 0.061. Thequalitative analysis in Figure 9, 10 and 11 and the quantitative analysis in Figure 13 show that the artificial intelligence architecture designed and to be fine-tuned is an extremely robust model.
[0060] In the finding of similarities from the isolate images and in the pre-treatment process, an artificial intelligence-supported algorithm was used. Commonly used similarity algorithms in the literature include histogram-based approaches, structural similarity index, featurebased approaches, and deep learning-based approaches. In histogram -based approaches, histograms capture the distribution of pixel values in an image. By comparing the histograms of the two images, the similarities can be measured using a processor or graphical processing unit. Histogram intersection and histogram correlation metrics are commonly used for this purpose. Python's OpenCV Library provides tools for calculating and comparing histograms. In the structural similarity index (SSIM), it is a widely used measurement that evaluates the structural similarity between two images. It gives a score between -1 (unique) and 1 (same), taking into account brightness, contrast, and structure. The scikit-image library in Python offers a structural similarity index application. In feature-based approaches, it extracts distinctive features such as edges, comers, or important points from the images. Methods such as Scale-Invariant Feature Transformation (SIFT) and Speeded Up Robust Features (SURF) identify distinctive points in images, and these can then be compared between images, opencv-python library can be used for sifting and SURF. In deep learning, deep features can be extracted from the images by using pre-trained convolutional neural networks (CNNs) such as ResNet, VGG and Inception. OpenAI's CLIP (Comparative Language-Image Pre-Training) is an impressive, multimode, zero-shot image classifier that achieves impressive results in a wide range of areas without the need for fine-tuning. It applies the latest developments in large-scale transformers such as GPT-3 to the vision area. As a result of the tests performed within these algorithms, it is seen that the similarity values of the isolates vary depending on the variability of the algorithm. For example, while the 1st and 2nd isolates are determined as the most similar images with the SSIM algorithm, the 1st and 4th isolates are the most similar in the SIFT algorithm. For this reason, it is necessary to choose the best algorithm. Because this decision directly determines the appearance of the resulting PCR tree and the position of similar isolates in the tree.
[0061] It was understood that the best results of the trees produced as a result of the tests carried out within the scope of the invention were obtained from the trees produced with the SSIMimage similarity algorithm. SSIM is a kind of structural similarity index and is a measurement that can be used to measure how similar the two images are. It measures the brightness, contrast, and structure of the images and compares these values in both images. It imitates some aspects of human perception and can be used to recognize patterns. This algorithm has different variants such as Multi-scale SSIM, Multi-component SSIM, structural similarity, Complex Wavelet SSIM, SSIMPLUS, and cSSIM, and all of them were tested within the scope of the invention. It was concluded that the trees produced with the structural similarity version were better than the others. However, among the variants, the Complex-Wavelet Structural Similarity Index (CW-SSIM) can theoretically produce more accurate results than structural similarity. However, it was evaluated that it was not suitable for practical use in terms of performance. It may take several hours to produce the similar matrix of two images on an average computer. For this reason, the structural difference variant was preferred in SSIM versions. The situation of reaching a more accurate result was evaluated with reference to the trees produced with UPGMA methods.
[0062] There are two ways to understand how similar an image is to another image. The first is to look at the Mean Square Error (MSE), and the second is to look at the Structural Similarity Index (SSIM). MSE calculates the mean square error value between each pixel for two compared images. In contrast, SSIM searches for similarities within pixels if the pixels in the two images are aligned or have similar pixel density values. The only problem is that MSE tends to have arbitrarily high numbers, and therefore it is more difficult to standardize it. In general, the higher the MSE, the less similar they are, but if the MSE difference between sets of images seems random, it is more difficult to say something. SSIM is compared with different measurement methods such as MSE or PSNR. However, it has been previously explained in many studies that the SSIM index performs better than MSE and its derivatives in terms of accuracy. Another important point of the SSIM method is to measure the error value in determining the similarity of two images. This value is used to determine the isolates close to each other in tree construction. For this reason, it must be accurately measured and interpreted.
[0063] In order to calculate the image similarity in the structural similarity method, one of the most important problems is that the images must have the same size and resolution. In the problem in this study, since the number of isolates varies from image to image, the sizes of the isolates vary, but it is necessary to calculate the similarities of isolates of different sizes. To meet this requirement, the resolution of both images was resized using the cv2 library.Since the resizing process is applied in the form of reducing the large size image to a small size image, the interpolation value is given as INTER AREA instead of INTER LINEAR or INTER CUBIC. INTER AREA resamples using the pixel area relationship. After this process, in the SSIM method, the images must be gray-scale. At this point, since the isolate images are already gray-scale, the method fully complies with the problem addressed at this point.
[0064] A score and difference image value is calculated using the SSIM function in the scikit-image library. The score represents the index of the structural similarity between the two images, and this value decreases between [-1, +1] and 1 refers to the highest match. A structural similarity index equal to zero indicates that there is no similarity between the images. A value of 1 indicates that the images are identical. The difference image contains the actual image differences between the two images to be visualized. The difference image is a floating-point number in the range [0,1], Therefore, the sequence must be converted to 8-bit unsigned integers and then made usable with opencv.
[0065] CV2 (part of OpenCV) was used to read and process the images. Thresholding was performed on the difference image using cv2.THRESH_BINARY_INV and cv2.THRESH_OTSU.
[0066] During the SSIM application phase, there are hyper parameters provided by the algorithm and affecting the operation of the algorithm. Correct selection of these parameters affects the similarity between isolates. The parameters selected as a result of the pre-tests are as follows.
[0067] multichannel= True, gaussian_weights=True, sigma=1.5, use_sample_covariance=False, win_size=None, gradient=False, data_range=1.0, channel_axis=None, gaussian_weights=False, full=True
[0068] win size: The length of the window used in the comparison. It must be a single value. If 'gaussian weights' is true, this is ignored and the window size depends on 'sigma'.
[0069] Gradient: degrade: bool, is optional. If true, it is also necessary to return the gradient according to im2.Data range : floating point, is optional. The data range of the input image (the distance between the minimum and maximum possible values). By default, this is estimated from the image data type. This prediction may be incorrect for floating-point image data. For this reason, it is recommended that this value is always clearly communicated.
[0070] Data range: If 'Data range' is not specified, the range is automatically estimated by the image data type. However, for floating-point image data, this prediction gives a result of twice the value of the desired range, since "dtype_range" in "skimage.util.dtype.py" has ranges defined between -1 and +1. This gives an estimate of 2 instead of 1, which is often required when working with image data (since negative light intentions are insignificant). In the case of working with YCbCr-like color data, it should be noted that these ranges are different per channel. Cb and Cr are twice the range of Y, so a single call to this function cannot calculate channel-averaged SSIM in the same way. The ranges are assumed for each channel. The isolate images discussed for this problem are structurally similar. For this reason, it is necessary to determine the data range value as a wide range. Otherwise, black, and white colors will be very close to each other in color intensity distribution, which means that there will be little difference in both images. This is not a desired situation in terms of this problem. For this reason, it is useful to determine a higher difference value.
[0071] channel axe: int or None, is optional. If there is none, it is assumed that the image is a grayscale (single channel) image. Otherwise, this parameter indicates which axis of the array corresponds to the channels.
[0072] gaussian weights: bool, is optional and if true, the average and variance of each patch is spatially weighted by a normalized Gaussian core with width sigma=1.5.
[0073] full: bool, optional, if true, it is also necessary to return the full structural similarity image.
[0074] Phylogenetics aims to reveal the evolutionary relationship between all organisms. The molecular mechanisms of organisms show that they have a single ancestor. From this point of view, species can be associated with each other. The graphical representation of this phylogenetic relationship is "phylogenetic trees".Phylogenetic tree methods are divided into two as "distance-based" and "character-based". Distance-based trees are also placed by taking into account the distances between all taxa. Distance-based methods are grouped into two groups as clustering and optimality-based methods. The UPGMA method and NJ methods used in the present invention are in the cluster-based group. Optimality-based methods are Fitch-Margoliash and Minimum Evolution. Character-based methods are Maximum Parsimony, Maximum Likelihood and Bayesian Analysis. The hierarchy described is shown in Figure 15.
[0075] The graph showing the conversion and flow required to apply UPGMA in said invention is as given in Figure 16. UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is an artificial intelligence-assisted hierarchical clustering algorithm that creates a dendrogram using taxonomic similarities. In this structure, the process starts with the grouping of the two closest taxa. All taxa are grouped by taking into account the distance increases. As the distances increase, taxa begin to be included in new groups and to move away from each other.
[0076] There are many libraries used in phylogenetic analysis in Python. Some of these are Biopython, DendroPy, ETE Toolkit, PhyloSuite and Scipy. In the invention, ' Scipy ' was used from these libraries. This library was used to create a phylogenetic tree, including the UPGMA algorithm.
[0077] In UPGMA, it is accepted that the taxa of the low similarity values in the similarity matrix are closer to each other. In the invention, unlike the standard use of UPGMA, there are percentage values in the similarity matrix in the data set. Those with higher percentages should be considered closer. For this reason, these percentage values need to be converted into a similarity score suitable for UPGMA. To ensure this, the value in all matrix cells was subtracted from 100. This process was not applied only to the values that came as zero. It was possible for the elements close to each other to have a low similarity value with the value obtained in this way.
[0078] For each PCR image, a matrix was prepared in the form of a square and class number x class number format in which the similarity value could be kept as the number of classes determined automatically. This process is performed by the graphics processing unit. In addition, with the data obtained from the SSIM similarity algorithm, while only the upper part of the diagonal of this matrix was filled, it was ensured that it was filled in the lower part of the diagonal with the code. In other words, the relationship between A and B and therelationship between B and A are of the same value. The table was filled with an artificial intelligence-powered algorithm in this format.
[0079] A dendrogram is a tree diagram that shows relationships or hierarchical clustering between similar datasets. It can also be called a block tree. This tree structure shows the evolutionary relationship between taxa when used in the field of phylogenetics. It is generally used to bring together points with similar characteristics and to understand data clusters. They consist of data points and connections. Squareform is a function in the Scipy library in Python. Its purpose is to flatten the similarity matrix. In other words, it transforms this matrix into a one-dimensional array. This form provides faster and more effective results in distance calculations with the converted data clustering algorithms. The linkage function determines the merging strategy used by the clustering algorithms. With this function, a linkage matrix is created by applying the artificial intelligence algorithm or the UPGMA algorithm by the processor on the similarity matrix. In hierarchical clustering algorithms, the distance used during the merging of two clusters measures the similarity or distance between these two clusters. Distance measurement may vary depending on the clustering algorithm. Three common ranges are:
[0080] 1. Nearest Neighbor Chain: It represents the distance between the closest elements of two clusters. In other words, the distance between the two clusters X and Y is calculated as the smallest distance between the elements of both clusters.
[0081] 2. Farthest Neighbor Chain: It represents the distance between the most distant elements of two clusters. In other words, the distance between the two clusters X and Y is calculated as the largest distance between the elements of both clusters.
[0082] 3. Average Connection: It represents the average distance between the elements of two clusters. In other words, the distance between the two clusters X and Y is calculated as the average of the distances between the elements of both clusters.
[0083] Some algorithms, such as UPGMA, use average linking in merging operations. In this case, during the merging of both clusters, the average distance between the elements of these two clusters is used.
[0084] PseudoCode:
[0085] # Create similarity relationships and node set
[0086] relationships = EmptyCluster ()
[0087] node cluster = EmptyCluster()
[0088] for data in data:nodes, similarity = SplitData(data)
[0089] node cluster.Update(nodes)
[0090] relationships. Add((nodes[0], nodesfl], similarity))
[0091] # Sort nodes and create directories
[0092] sorted nodes = SortNodesAndCreatelndices(node set)
[0093] # Create the similarity matrix
[0094] similarity matrix = CreateMatrixOfZeros(sorted nodes)
[0095] for duguml, dugum2, similarity in relationships:
[0096] i, j = GetNodeIndices(nodel, node2, node indices)
[0097] if similarity > 0:
[0098] similarity matrixfi, j] = 100 - similarity
[0099] similarity matrixfj, i] = 100 - similarity
[0100] else:
[0101] similarity matrixfi, j] = similarity
[0102] similarity matrixfj, i] = similarity
[0103] # Flatten the similarity matrix
[0104] flattened similarity matrix = FlattenMatrix(similarity matrix)
[0105] # Create the linkage matrix for UPGMA
[0106] linkage matrix = CreateLinkageMatrix(flattened_similarity matrix, 'average')
[0107] # Draw the dendrogram
[0108] DrawDendrogram(linkage_matrix, sorted nodes)
[0109] Automatic cutting of normal or inverted images containing different band numbers and coming from different devices, determining the sections containing isolates, automatically cutting the isolates into separate images, cleaning them from unwanted isolates and markers are the functions added to the software. Gradient-based methodology is used for image processing, and deep learning-based methodology is used as an alternative.
[0110] In the process of determining the similarities between the automatically obtained images to be included in the tree, both algorithmic and deep learning-based models are tested and used to determine the similarities; then, the similarity algorithm that best produces the expected tree is selected and optimized. The process steps of said method are as follows:
[0111] • Automatic production of phylogenetic trees,• Automatic detection and marking of the average similarity value in phylogenetic tree construction,
[0112] • Automatic detection of isolate images, isolate names, isolate sources and groups and automatic placement of these in the phylogenetic tree,
[0113] • Searching for isolates obtained from the pool in future studies for isolates of different studies that have been analyzed before, producing phylogenetic trees by creating an isolate assembly consisting of different studies according to their similarities, • Combining isolates in more than one PCR image on each tree to produce comprehensive phylogenetic trees containing 100 or more different isolates.
[0114] With this invention, significant advantages are provided in the analysis of the obtained multiple isolates. Instead of high-cost analyses such as PFGE, the same quality of phylogenetic wood can be obtained by using the less costly PCR method, thus reducing costs. In addition, due to the manual performance of traditional analyses, long-lasting processes become fast, error-free, and safe thanks to the automatic determination of the bands with the invention. Since the analysis can automatically recognize the bands in the high-resolution gels, the margin of error is minimized.
[0115] In addition, the software makes it possible to compare the results with other studies with the permission of the researchers by recording the data. Finally, the problem that multiple isolate analysis cannot be performed due to the lack of ability of previous software to combine gel images is overcome by the ability of the invention to combine gel images; thus, the relatedness degrees of more than one isolate can be determined at the same time.
Claims
CLAIMS1. A computer-aided method for determining the degree of affinity of bacteria in a desktop application, characterized in comprising steps of;• Uploading the PCR isolate image by the user to the desktop application,• Calculating gradient values for horizontal lines in the PCR isolate image using the data processing tool,• Analyzing the calculated gradient values to cut the isolate from the bottom and top,• Determining the cut-off point in the PCR isolate image and cutting the image by a processor based on these cut-off points,• Detecting the part containing the isolates in the cropped image by the artificial intelligence-powered image processing unit,• Calculating the gradient values for the vertical lines by a processor or graphical processing unit and the determination of the cut-off points by analyzing the gradient values,• Detection of the start and end points of each isolate in the cropped PCR images according to the cut-off points by the artificial intelligence-powered image processing algorithm.
2. The method according to claim 1, characterized in that, the step of “Detecting the part containing the isolates” comprises steps of;• Comparing the isolates using the artificial intelligence-powered algorithm and creating a similarity list,• Pre-processing the data using a processor or graphical processing unit to create an isolate similarity sequence,• Creating a similarity matrix using the processor,• Creating the square form of the similarity matrix by the artificial intelligence algorithm or the processor,• Converting the square form matrix to the linkage matrix by the graphical processing unit,• Creating a dendrogram using the artificial intelligence-powered algorithm.
3. The method according to claim 2, characterized in that the PCR tree of said dendrogram is created by an artificial intelligence-powered algorithm using the average line.
4. The method according to claim 2, characterized in that the PCR tree of said dendrogram is created by an artificial intelligence-powered algorithm.
5. The method according to claim 2, characterized in that the PCR tree of said dendrogram is created by an artificial intelligence-powered algorithm using the line and isolate images.
6. The method according to claim 2, characterized in that the PCR tree of said dendrogram is created by the artificial intelligence-powered algorithm using the average line, isolate images, and isolate groups.
7. A data processing device characterized in that it comprises means for implementing the methods defined in claim 1 or claim 2.
8. A desktop application containing instructions to implement the methods specified in claim 1 or claim 2.