Next-generation artificial intelligence bioinformatics individual geographic provenance method
By constructing a new generation of AI-based DNA individual tracing model based on spatial attention mechanism and geographic distance loss function, multiple challenges in wildlife tracing in existing technologies have been addressed, achieving accurate prediction of individual geographic locations and improving the applicability and accuracy of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING FORESTRY UNIVERSITY
- Filing Date
- 2025-07-02
- Publication Date
- 2026-06-19
AI Technical Summary
Existing DNA tracing technologies in the field of wildlife conservation face challenges such as the conflict between the development of cross-species universal models and the specific needs of genetic markers between species, the difficulty in balancing the high accuracy requirements of geographic tracing with the applicability of low-quality DNA samples, and the insufficient learning ability of models for small samples, resulting in unmet needs for rapid tracing.
We employ a new generation of artificial intelligence bioinformatics individual geographic tracing method, utilizing spatial attention mechanism and geographic distance loss function to construct an artificial intelligence DNA individual tracing model. By combining convolutional neural network and multilayer perceptron, and optimizing model parameters through data augmentation and feature recoding, we can achieve accurate prediction of individual geographic locations.
It improves the accuracy of geographic origin tracing and the generalization ability of the model, simplifies the dataset integration process, reduces the dependence on the amount of sample data, enhances the applicability to low-quality DNA samples, and enables accurate prediction of individual geographic locations.
Smart Images

Figure CN120808897B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of wildlife DNA geographic tracing technology, and in particular to a new generation of artificial intelligence bioinformatics individual geographic tracing method. Background Technology
[0002] With the rapid development of machine learning and high-throughput sequencing technologies, DNA molecular probing technology has made new progress. DNA molecular probing technology has undergone several technological iterations, as detailed below:
[0003] The first generation of DNA tracing technology (1980s-2000s): based on haplotype network analysis using mitochondrial single-gene markers (COI / Cytb / D-loop), successfully achieved species-level identification (accuracy >90%) and large-scale geographic tracing (dependent on known population background). my country achieved a breakthrough in its forensic identification system by introducing and assimilating this technology, but it was constrained by access restrictions to international genetic databases. This technology supported the enforcement needs in the early stages of global wildlife forensic identification, but it cannot address the sophisticated tracing challenges posed by current networked smuggling.
[0004] Second-generation DNA tracing technologies (2000s-2010s): Population genetic structure analysis, represented by SNP chips and STRUCTURE models, improved tracing resolution to the subpopulation level. This technology system supported the application of conservation genetics in ancestry tracing, but it has three major limitations: ① it relies on prior population division; ② computational complexity increases exponentially with the number of loci; ③ it cannot resolve the genetic geographical gradient of continuously distributed species.
[0005] Third-generation DNA tracing technology (2020s-): Deep learning-based whole-genome spatial decoding technology overcomes the limitations of linear dimensionality reduction, achieving a positioning accuracy of <100km in human population tracing. However, it faces three major adaptation barriers in the field of wildlife conservation: ① 90% of the samples involved in cases contain low-quality degraded DNA (genotype deletion rate >35%); ② 85% of CITES Appendix species lack reference genomes; ③ Cross-border data sharing barriers lead to insufficient training samples (67% of species have n<20). These technical bottlenecks severely restrict the need for rapid tracing of seized items.
[0006] Existing DNA tracing technologies face the following three challenges: (1) there is a conflict between the development of universal cross-species models and the specific requirements of genetic markers between species; (2) it is difficult to reconcile the high precision requirements (individual location) of geographic tracing with the applicability of low-quality DNA samples; and (3) the models' insufficient learning ability on small samples restricts their application in tracing endangered species. Therefore, it is urgent to develop a new generation of DNA tracing methods to solve these problems. Summary of the Invention
[0007] The purpose of this invention is to provide a new generation of artificial intelligence bioinformatics individual geographic origination method. Based on spatial attention mechanism and a new geographic distance loss function, a new generation of artificial intelligence DNA individual origination model is constructed to achieve the prediction of the accurate geographic location of an individual.
[0008] To achieve the above objectives, this invention provides a new generation of artificial intelligence bioinformatics-based individual geographic tracing method, comprising the following steps:
[0009] S1. Obtain DNA from biological samples;
[0010] S2. Based on the convolutional neural network module, combined with the CBAM spatial attention module and multilayer perceptron, a new generation of artificial intelligence DNA individual tracing model is constructed.
[0011] S3. Using the genotype data and background sampling geographic data of the samples as inputs to the model, train the created artificial intelligence DNA individual tracing model, and use cross-validation to tune and optimize the model parameters.
[0012] S4. Use an artificial intelligence DNA individual tracing model to predict the geographic coordinates of individual DNA in unknown samples.
[0013] Preferably, in S1, microsatellites of the sample are obtained through species-specific genetic markers and simple and rapid PCR amplification technology. If the species is unclear, whole genome resequencing or simplified genome data of the sample is obtained rapidly through high-throughput sequencing.
[0014] Preferably, in S2, the input to the CBAM spatial attention module in the AI DNA individual tracing model is the re-encoded genotype feature matrix. Remodeling Specifically:
[0015] Spatial descriptor generation:
[0016] Max pooling and average pooling are performed along the channel dimension:
[0017] F max =MaxPool(X);
[0018] F avg =AvgPool(X);
[0019] in,
[0020] splicing pooling results:
[0021]
[0022] Spatial attention weight calculation:
[0023] Extracting spatial dependencies using convolutional layers:
[0024]
[0025] Among them, Conv 3×3 This represents a 3×3 convolution kernel, where σ is the Sigmoid activation function.
[0026] Feature enhancement:
[0027] Apply attention weights to the original feature matrix:
[0028] X attn =X⊙A spatial ;
[0029] Here, ⊙ represents element-wise multiplication, which enhances spatial correlation features;
[0030] The input to the convolutional neural network module in the AI DNA individual tracing model is the feature matrix enhanced with spatial attention.
[0031] The network structure of the convolutional neural network module is as follows:
[0032] Convolutional layer 1:
[0033]
[0034] Use 64 3×3 convolution kernels with stride = 1 and ReLU activation function;
[0035] Pooling layer 1:
[0036]
[0037] Convolutional layer 2:
[0038]
[0039] Pooling layer 2:
[0040]
[0041] Feature flattening:
[0042]
[0043] The input to the multilayer perceptron in the AI DNA individual tracing model is the flattened feature vector.
[0044] The network structure of a multilayer perceptron is as follows:
[0045] Fully connected layer 1:
[0046]
[0047] in,
[0048] Fully connected layer 2:
[0049]
[0050] in, The output is the predicted latitude and longitude coordinates.
[0051] Preferably, during the training process of the AI DNA individual tracing model, if the sample background information includes the geographical coordinates of the individual's place of origin, the AI DNA individual tracing model integrates geographical information as a reference; if the sample does not have geographical information of the sampling location, the AI DNA individual tracing model predicts the geographical coordinates of the sample by analyzing the population genetic characteristic values and by transforming and rotating the genetic characteristic values and the relationship with geographical information.
[0052] Preferably, step S3 specifically includes the following steps:
[0053] S31. Data augmentation based on upsampling using a random mask;
[0054] S32. Re-encode genotype data;
[0055] S33. Adaptively adjust the CBAM spatial attention module, retaining its spatial attention mechanism and omitting the channel attention part;
[0056] S34. Perform convolutional regression processing;
[0057] S35. Perform geospatial dimension mapping in the artificial intelligence DNA individual tracing model.
[0058] Preferably, in S31, given an original dataset D:
[0059]
[0060] Where, x i This represents the genotype data of sample i, y i Represents geographic coordinates, where N is the original number of samples;
[0061] Generate new samples using the following steps:
[0062] First, a sample (x) is randomly selected from D. i y i Generate a new coordinate yi', which is located at the original coordinate y. i Within a 50-kilometer radius, ensure:
[0063] Haversine(y i ,y′ i ≤50km;
[0064] Where Haversine represents the spherical distance function:
[0065]
[0066] Δλ=λ2-λ1;
[0067] Δφ=φ2-φ1;
[0068] Where r is the Earth's radius, φ1 and φ2 are the latitudes of the original point and the perturbation point, respectively, and λ1 and λ2 are the longitudes of the original point and the perturbation point, respectively.
[0069] Simultaneously, a random mask is generated for the SNP data, and the SNP genotype x i x is generated after random masking. i ', where each SNP site is independently masked with a 20% probability:
[0070]
[0071] Where, x ij This represents the j-th SNP site in sample i;
[0072] The final augmented dataset is defined as follows:
[0073]
[0074] Where M is the number of newly generated samples.
[0075] Preferably, in S32, different machine learning algorithms, including multilayer perceptrons, variational autoencoders, and generative adversarial networks, are used to compress or expand the SNP data into a fixed 1024-dimensional representation; given an input SNP matrix... Where N represents the number of samples, d represents the original feature dimension, and a transformation function f parameterized by the neural network is defined. θ , each sample Mapped to a normalized 1024-dimensional vector:
[0076] z i =f θ (x i );
[0077]
[0078] To improve the balance between computational efficiency and model generalization ability in different source tracing scenarios, the model includes a two-layer fully connected encoding layer neural network architecture, as shown below:
[0079] z i =σ(Wx i +b);
[0080] in, This is the weight matrix. σ is the bias term, and σ(·) represents the nonlinear activation function.
[0081] Preferably, in S33, based on the given reconstructed 32×32 feature matrix... Perform max pooling and average pooling along the channel dimension to obtain two spatial descriptors:
[0082] X max =MaxPool(X);
[0083] X avg =AvgPool(X);
[0084] The pooled feature maps are concatenated along the channel dimension to obtain:
[0085] X cat =[X max ,X avg ];
[0086] Convolutional layers are used to extract spatial dependencies to obtain:
[0087] X conv =σ(Conv(X) cat ));
[0088] Where σ(·) represents the sigmoid activation function, which ensures that the spatial attention weights are normalized in the range [0, 1].
[0089] Finally, the learned spatial attention weights are applied to the original feature map using element-wise multiplication:
[0090] X'=X⊙X conv .
[0091] Preferably, in S34, a custom distance basis loss function is defined. The calculation formula is:
[0092]
[0093] in, This represents the distance between the actual coordinates and the predicted coordinates;
[0094] Given a dataset containing m samples, let and Let be the true and predicted latitude and longitude coordinates of the i-th sample, respectively, and the Euclidean distance between them is:
[0095]
[0096] Define the distance-weighted MSE loss as:
[0097]
[0098] Among them, w i It is based on the Euclidean distance d i The weights assigned to each sample are defined as follows:
[0099]
[0100] Preferably, in S35, the determination coefficient R is used. 2 Measuring the goodness of fit of a regression model:
[0101]
[0102] Where, d i and These are actual and predicted geographic coordinates. It is the average of the actual coordinates;
[0103] The average distance error is calculated as follows:
[0104]
[0105] Here, Haversine refers to the spherical distance between the actual coordinates and the predicted coordinates;
[0106] A spatial confidence criterion is introduced: when the predicted coordinates fall within the confidence radius r' of the true location, the prediction is considered accurate. For a given confidence radius r', accuracy is defined as:
[0107]
[0108] in, It is an indicator function; it is 1 if the predicted coordinates fall within the confidence radius r', and 0 otherwise.
[0109] Therefore, this invention adopts the above-mentioned next-generation artificial intelligence bioinformatics individual geographic tracing method, and constructs a next-generation artificial intelligence DNA individual tracing model based on spatial attention mechanism and new geographic distance loss function, thereby enabling the prediction of the precise geographic location of an individual.
[0110] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description
[0111] Figure 1 This is a flowchart of an embodiment of the new generation of artificial intelligence bioinformatics individual geographic tracing method of the present invention;
[0112] Figure 2 This is a schematic diagram of the AI DNA individual tracing model framework and training process of an embodiment of the new generation AI bioinformatics individual geographic tracing method of the present invention;
[0113] Figure 3 This is a schematic diagram of the test structure according to an embodiment of the present invention; wherein, a is the accuracy of predicting the destination within a real range of 200km; b is the accuracy of predicting the destination within a real range of 50km. Detailed Implementation
[0114] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.
[0115] Example 1
[0116] like Figure 1 As shown, this invention provides a new generation of artificial intelligence bioinformatics individual geographic tracing method, including the following steps:
[0117] S1. Obtain DNA from biological samples.
[0118] Microsatellites of a sample can be obtained through species-specific genetic markers and simple, rapid PCR amplification technology. If the species is unclear, whole-genome resequencing or simplified genome data of the sample can be obtained rapidly through high-throughput sequencing.
[0119] S2. Based on the convolutional neural network module, combined with the CBAM spatial attention module and multilayer perceptron, a new generation of artificial intelligence DNA individual tracing model is constructed.
[0120] The core of the AI DNA individual tracing model consists of a CBAM spatial attention module, a convolutional neural network module (2 convolutional and 2 pooling layers), and a multilayer perceptron. Figure 2 a). For example Figure 2 As shown, where, Figure 2 In this context, 'a' represents the model framework. Figure 2 In this context, 'b' represents the spatial attention mechanism. Figure 2 In this model, ce represents the model's workflow, c indicates the integration of heterogeneous datasets from different species and geographical locations, d indicates that the data is standardized separately before training and restored after training, and e indicates the pre-training loss rate.
[0121] The input to the CBAM spatial attention module in the AI DNA individual tracing model is the re-encoded genotype feature matrix. Remodeling Specifically:
[0122] Spatial descriptor generation:
[0123] Max pooling and average pooling are performed along the channel dimension:
[0124] F max =MaxPool(X)
[0125] F avg =AvgPool(X)
[0126] in,
[0127] splicing pooling results:
[0128]
[0129] Spatial attention weight calculation:
[0130] Extracting spatial dependencies using convolutional layers:
[0131]
[0132] Among them, Conv 3×3 This represents a 3×3 convolution kernel, where σ is the Sigmoid activation function.
[0133] Feature enhancement:
[0134] Apply attention weights to the original feature matrix:
[0135] X attn =X⊙A spatial
[0136] Here, ⊙ represents element-wise multiplication, which enhances spatial correlation features.
[0137] The input to the convolutional neural network module in the AI DNA individual tracing model is the feature matrix enhanced with spatial attention.
[0138] The network structure of the convolutional neural network module is as follows:
[0139] Convolutional layer 1:
[0140]
[0141] Use 64 3×3 convolution kernels with a stride of 1 and the activation function is ReLU.
[0142] Pooling layer 1:
[0143]
[0144] Convolutional layer 2:
[0145]
[0146] Pooling layer 2:
[0147]
[0148] Feature flattening:
[0149]
[0150] The input to the multilayer perceptron in the AI DNA individual tracing model is the flattened feature vector.
[0151] The network structure of a multilayer perceptron is as follows:
[0152] Fully connected layer 1:
[0153]
[0154] in,
[0155] Fully connected layer 2:
[0156]
[0157] in, The output is the predicted latitude and longitude coordinates.
[0158] The AI-powered DNA tracing model dynamically weights the spatial signal contribution of multimodal data through an attention mechanism, enabling the model to integrate data from different species and batches. This significantly simplifies the dataset integration process and achieves geographic information fusion of multimodal genetic data. It significantly reduces complexity and improves reproducibility, while data augmentation reduces the impact of sample size on model performance.
[0159] S3. Train the constructed AI DNA individual tracing model.
[0160] During the training of the AI-powered DNA individual tracing model, if the sample background information includes the geographical coordinates of the individual's origin, the AI-powered DNA individual tracing model integrates geographical information as a reference to optimize and improve model performance. If the sample does not have geographical information about the sampling location, the AI-powered DNA individual tracing model predicts the geographical coordinates of the sample by analyzing the population's genetic characteristic values and by transforming and rotating the genetic characteristic values and the relationship with geographical information.
[0161] The training process of the AI-based DNA tracing model specifically includes the following steps:
[0162] S31. Data augmentation based on upsampling with a random mask.
[0163] The number and quality of samples of endangered protected wild animals are extremely limited. In order to improve the accuracy and reliability of tracing, this embodiment adopts an upsampling strategy based on random masking, which intelligently increases the size of the dataset while maintaining the integrity of its genetic data and background information.
[0164] Given an original dataset D:
[0165]
[0166] Where, x i This represents the genotype data of sample i (using SNPs as an example), y i Represents geographic coordinates (latitude and longitude), and N is the number of original samples.
[0167] Generate new samples using the following steps:
[0168] First, a sample (x) is randomly selected from D. i y i Generate a new coordinate yi', which is located at the original coordinate y. i Within a 50-kilometer radius, ensure:
[0169] Haversine(y i ,y′ i ≤50km
[0170] Here, Haversine represents the spherical distance function.
[0171]
[0172] Δλ=λ2-λ1
[0173] Δφ=φ2-φ1
[0174] Where r is the Earth's radius, φ1 and φ2 are the latitudes (in radians) of the original point and the perturbation point, respectively, and λ1 and λ2 are the longitudes of the original point and the perturbation point, respectively.
[0175] Simultaneously, a random mask is generated for the SNP data, and the SNP genotype x i x is generated after random masking. i ', where each SNP site is independently masked with a 20% probability:
[0176]
[0177] Where, x ij This represents the j-th SNP site in sample i.
[0178] The final augmented dataset is defined as follows:
[0179]
[0180] Where M is the number of newly generated samples.
[0181] To ensure the validity of the sample's geographical information, the perturbation follows a uniform distribution within a specified 50-kilometer range, avoiding areas where real samples are known to be absent. The random masking process maintains the integrity of the genetic structure while simulating patterns of missing data in the real world.
[0182] S32. Re-encode the genotype data.
[0183] Due to limitations in sequencing methods, datasets obtained from different batches of the same species often exhibit different dimensionalities. Even when using the same sequencing technology, differences in genome size between different species can lead to inconsistencies in dataset dimensionality. These differences hinder the cross-species analysis of geographic genetic information from DNA markers. To address this issue, this embodiment employs a re-encoding strategy to standardize genotype data to a uniform dimensional space.
[0184] Several different machine learning algorithms, including multilayer perceptron (MLP), variational autoencoder (VAE), and generative adversarial network (GAN), are used to compress or expand SNP data into a fixed 1024-dimensional representation.
[0185] Given an input SNP matrix Where N represents the number of samples, d represents the original feature dimension (which varies in different datasets), and a transformation function f parameterized by the neural network is defined. θ , each sample Mapped to a normalized 1024-dimensional vector:
[0186] z i =f θ (x i )
[0187]
[0188] To improve the balance between computational efficiency and model generalization ability in different source tracing scenarios, the model includes a two-layer fully connected encoding layer neural network architecture, as shown below:
[0189] z i =σ(Wx i +b)
[0190] in, This is the weight matrix. Here, σ(·) represents the bias term, and σ(·) denotes the nonlinear activation function. This recoding framework effectively achieves dimensionality normalization while preserving genetic variation, which is crucial for source tracing analysis.
[0191] S33. Adaptively adjust the CBAM spatial attention module, retaining its spatial attention mechanism and omitting the channel attention part.
[0192] To enhance the genetic feature representation of the re-encoded genotype data, the CBAM spatial attention module was adaptively adjusted, retaining its spatial attention mechanism while omitting the channel attention component. The re-encoded 1×1024 vector was reshaped into a 32×32 matrix, and spatial dependencies were extracted through convolution operations.
[0193] To enhance spatial feature representation, a spatial attention mechanism is introduced during feature extraction. The spatial attention module generates an attention weight matrix through max pooling, average pooling, and convolution operations, thereby optimizing the feature map.
[0194] Based on the given reconstructed 32×32 feature matrix Perform max pooling and average pooling along the channel dimension to obtain two spatial descriptors:
[0195] X max =MaxPool(X)
[0196] X avg =AvgPool(X)
[0197] The pooled feature maps are concatenated along the channel dimension to obtain:
[0198] X cat =[X max ,X avg ]
[0199] Convolutional layers are used to extract spatial dependencies to obtain:
[0200] X conv =σ(Conv(X) cat ))
[0201] Where σ(·) represents the sigmoid activation function, which ensures that the spatial attention weights are normalized in the range [0, 1].
[0202] Finally, the learned spatial attention weights are applied to the original feature map using element-wise multiplication:
[0203] X'=X⊙X conv
[0204] This spatial attention mechanism dynamically enhances spatially relevant features while suppressing regions with less information, thereby improving the feature extraction capabilities of downstream tasks.
[0205] S34. Perform convolutional regression processing.
[0206] The input to convolutional regression is a 32×32 feature matrix with spatial attention weights, which is then processed through a series of convolutional and fully connected layers.
[0207] In regression tasks, mean squared error (MSE) is one of the most widely used loss functions. While MSE provides a global measure of loss, it ignores local variations. This often leads the model to predict clusters of points around a few central locations to minimize the overall loss, but reduces the accuracy of predictions for individual samples.
[0208] To address this issue, this embodiment defines a custom distance basis loss function. The loss function does not treat all samples equally; instead, it introduces weights based on the Euclidean distance between the predicted and actual locations. This adjustment prompts the model to minimize the error at distant points, thereby improving the accuracy of geographic predictions.
[0209] Distance basis loss function The calculation formula is:
[0210]
[0211] in, This represents the distance between the true coordinates and the predicted coordinates. The parameters of this model are optimized using the Adam optimizer via backpropagation.
[0212] Given a dataset containing m samples, let and Let be the true and predicted latitude and longitude coordinates of the i-th sample, respectively, and the Euclidean distance between them is:
[0213]
[0214] Define the distance-weighted MSE loss as:
[0215]
[0216] Among them, w i It is based on the Euclidean distance d i The weights assigned to each sample are defined as follows:
[0217]
[0218] This weighting scheme ensures that samples with larger errors contribute more to the total loss, thus improving the model's performance in predicting specific samples.
[0219] S35. Perform geospatial dimension mapping in the artificial intelligence DNA individual tracing model.
[0220] The relatively sparse latitude and longitude values can hinder model convergence if used directly as regression targets. Previous studies trained neural networks directly on the raw geographic coordinates, which yielded poor results in practical applications. In contrast, this model introduces an optimization strategy that normalizes the coordinates during the feature encoding stage while preserving the basic statistical properties of each dataset. By normalizing the geographic data during training and applying an inverse transform after inference, the model significantly accelerates convergence and improves regression accuracy.
[0221] To comprehensively evaluate the performance of different source tracing models, in addition to the standard regression R... 2 In addition to the score, this embodiment introduces a new evaluation metric. Previous studies have mainly relied on the mean distance error (MDE) for evaluation. This embodiment further incorporates a prediction accuracy metric based on spatial proximity.
[0222] Through the coefficient of determination R 2 Measuring the goodness of fit of a regression model:
[0223]
[0224] Where, d i and These are actual and predicted geographic coordinates. It is the average of the actual coordinates;
[0225] The average distance error is calculated as follows:
[0226]
[0227] Here, Haversine refers to the spherical distance between the actual coordinates and the predicted coordinates;
[0228] A spatial confidence criterion is introduced: when the predicted coordinates fall within the confidence radius r' of the true location, the prediction is considered accurate. For a given confidence radius r', accuracy is defined as:
[0229]
[0230] in, It is an indicator function; it is 1 if the predicted coordinates fall within the confidence radius r', and 0 otherwise.
[0231] In this embodiment, three confidence radii are set to evaluate the model's predictive accuracy: 100 km, 200 km, and 500 km. By incorporating this additional metric, the model's performance in geographic attribution becomes more interpretable and practical.
[0232] S4. Use an artificial intelligence DNA individual tracing model to predict the geographic coordinates of individual DNA in unknown samples.
[0233] When the genotype data of a sample corresponds to geographic coordinates, these geographic coordinates are also used as input to the AI-powered DNA individual tracing model. Genotype data originates from screening for genetic variations in biological samples through resequencing, simplified genomes, and microsatellite sampling, while geographic coordinates correspond to the latitude and longitude at the time of sample collection.
[0234] In practical applications, the model can be fine-tuned based on existing sample data to adapt it to different application scenarios. The model can seamlessly combine non-parametric (predicting coordinates based solely on genotype data) and parametric (fine-tuning using genotype data and partial geographic coordinates before prediction) operating modes, allowing users to choose the appropriate mode according to their needs.
[0235] In this embodiment, tests were conducted on the Chinese pangolin, the Asian grasshopper, and the forest musk deer. The test data are shown in Table 1.
[0236] Table 1 Test Dataset
[0237]
[0238] The final test results are as follows Figure 3 As shown, the prediction results of this embodiment are relatively accurate and reliable, and have certain reference value.
[0239] Therefore, this invention adopts the above-mentioned next-generation artificial intelligence bioinformatics individual geographic tracing method, and constructs a next-generation artificial intelligence DNA individual tracing model based on spatial attention mechanism and new geographic distance loss function, thereby enabling the prediction of the precise geographic location of an individual.
[0240] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the technical solutions of the present invention, and these modifications or equivalent substitutions cannot cause the modified technical solutions to deviate from the spirit and scope of the technical solutions of the present invention.
Claims
1. An artificial intelligence bioinformatics individual geographic provenance method, characterized in that, Includes the following steps: S1. Obtain DNA from biological samples; S2. Based on the convolutional neural network module, combined with the CBAM spatial attention module and multilayer perceptron, a new generation of artificial intelligence DNA individual tracing model is constructed. In S2, the input of the CBAM space attention module in the artificial intelligence DNA individual tracing model is the re-encoded genotype feature matrix , and is reshaped as , specifically: Spatial descriptor generation: Max pooling and average pooling are performed along the channel dimension: ; ; wherein , ; splicing pooling results: ; Spatial attention weight calculation: Extracting spatial dependencies using convolutional layers: ; wherein, denotes a 3 x 3 convolution kernel, is a Sigmoid activation function; Feature enhancement: Apply attention weights to the original feature matrix: ; Here, ⊙ represents element-wise multiplication, which enhances spatial correlation features; The input of the convolutional neural network module in the artificial intelligence DNA individual tracing model is a feature matrix after spatial attention enhancement ; The network structure of the convolutional neural network module is as follows: Convolutional layer 1: ; Using 64 3×3 convolutional kernels, stride = 1, activation function is: ReLU ; Pooling layer 1: ; Convolutional layer 2: ; Pooling layer 2: ; Feature flattening: ; The input to the multilayer perceptron in the AI DNA individual tracing model is the flattened feature vector. ; The network structure of a multilayer perceptron is as follows: Fully connected layer 1: ; wherein ; Fully connected layer 2: ; wherein, , the output is the predicted latitude and longitude coordinates ; S3. Using the genotype data and background sampling geographic data of the samples as inputs to the model, train the created artificial intelligence DNA individual tracing model, and use cross-validation to tune and optimize the model parameters. In S3, during the training of the AI DNA individual tracing model, if the sample background information includes the geographical coordinates of the individual's origin, the AI DNA individual tracing model integrates geographical information as a reference; if the sample does not have geographical information of the sampling location, the AI DNA individual tracing model predicts the geographical coordinates of the sample by analyzing the population genetic characteristic values and by transforming and rotating the genetic characteristic values and the relationship with geographical information. S3 specifically includes the following steps: S31. Data augmentation based on upsampling using a random mask; In S31, a raw data set is given D : ; in, Indicates sample i Genotype data, Represents geographic coordinates, N This represents the original sample size. Generate new samples using the following steps: First, from D A sample is randomly selected from the middle ( Generate a new coordinate system. y i ', This coordinate is located at the original coordinate y i Within a 50-kilometer radius, ensure: ; wherein Haversine denotes the spherical distance function: ; ; ; in, r For the Earth's radius, These are the latitudes of the original point and the perturbation point, respectively. λ 1 and λ 2 These are the longitudes of the original point and the disturbance point, respectively. Simultaneously, a random mask is generated for the SNP data, and the SNP genotype is determined. x i After random mask generation x i ', where each SNP site is independently masked with a 20% probability: ; wherein x ij representing a sample i of the first j SNP site; The final augmented dataset is defined as follows: ; wherein, M is the number of newly generated samples; S32. Re-encode genotype data; In S32, a machine learning algorithm is used to organize the SNP data into a fixed 1024-dimensional representation; given an input SNP matrix... ,in, N Indicates the number of samples. d Representing the original feature dimension, a transformation function parameterized by the neural network is defined. Each sample Mapped to a normalized 1024-dimensional vector: ; ; To improve the balance between computational efficiency and model generalization ability in different source tracing scenarios, the model includes a two-layer fully connected encoding layer neural network architecture, as shown below: ; in, This is the weight matrix. For bias terms, Represents a non-linear activation function; S33. Adaptively adjust the CBAM spatial attention module, retaining its spatial attention mechanism and omitting the channel attention part; S34. Perform convolutional regression processing; S35. Mapping geospatial dimensions in an AI-based DNA individual tracing model; S4. Use an artificial intelligence DNA individual tracing model to predict the geographic coordinates of individual DNA in unknown samples.
2. The artificial intelligence bioinformatics individual geographic tracing method according to claim 1, characterized in that: In S1, microsatellites of the sample are obtained through species-specific genetic markers and simple, rapid PCR amplification technology. If the species is unclear, whole-genome resequencing or simplified genome data of the sample can be obtained rapidly through high-throughput sequencing.
3. The artificial intelligence bioinformatics individual geo- provenancing method of claim 1, wherein: In S33, based on the given reconstructed 32x32 feature matrix , max-pooling and average-pooling are performed along the channel dimension to obtain two spatial descriptors: ; ; The pooled feature maps are concatenated along the channel dimension to obtain: ; Convolutional layers are used to extract spatial dependencies to obtain: ; wherein, denotes a sigmoid activation function that ensures the spatial attention weights are normalized in the range [0, 1]; Finally, the learned spatial attention weights are applied to the original feature map using element-wise multiplication: 。 4. The artificial intelligence bioinformatics individual geo- provenancing method of claim 1, wherein: In S34, a distance-based loss function is defined The calculation formula is: ; wherein, represents the distance between the real and predicted coordinates; Given a containing m A dataset of samples, let... and The first i The true and predicted latitude and longitude coordinates of each sample, and the Euclidean distance between them are: ; Define the distance-weighted MSE loss as: ; in, It is based on Euclidean distance The weights assigned to each sample are defined as follows: 。 5. The artificial intelligence bioinformatics individual geo- provenancing method of claim 1, wherein: In S35, the coefficient of determination R2 is calculated R 2 Measuring the goodness of fit of a regression model: ; in, and These are actual and predicted geographic coordinates. It is the average of the actual coordinates; The average distance error is calculated as follows: ; wherein, Haversine refers to the spherical distance between the actual and predicted coordinates; A spatial confidence-based criterion is introduced, when the predicted coordinates fall within the confidence radius of the true position r’ at the time of prediction, then the prediction is considered accurate for a given confidence radius r’ Accuracy is defined as: ; wherein is an indicator function that predicts 1 if the coordinate falls within the confidence radius r’ and 0 otherwise.
Citation Information
Patent Citations
Pedigree tracing method based on whole genome re-sequencing SNP big data and deep learning
CN118248210A