Mineral prediction method and system based on multi-modal transformer architecture

By integrating and preprocessing mineral data using a multimodal Transformer architecture, a dynamic attention graph is constructed to filter core features, solving the problems of multimodal data integration and feature extraction in mineral prediction, and achieving more accurate and adaptable mineral prediction.

CN121456689BActive Publication Date: 2026-06-23BEIJING CHENGFENG INTELLIGENT TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING CHENGFENG INTELLIGENT TECHNOLOGY CO LTD
Filing Date
2025-11-07
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing mineral prediction methods rely on single-type data and have difficulty integrating multimodal data, resulting in missing information and inaccurate feature extraction. Furthermore, traditional models struggle to capture cross-modal correlation features and lack adaptability.

Method used

A multimodal Transformer architecture is adopted. Through multimodal data acquisition, integration and preprocessing, a dynamic attention relationship graph is constructed using the Transformer's self-attention mechanism to filter core features and generate mineral prediction results.

Benefits of technology

It achieves efficient integration of multimodal data and deep feature extraction, improving the accuracy and adaptability of mineral prediction, reducing redundant features, and increasing the reliability of prediction results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121456689B_ABST
    Figure CN121456689B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of mineral prediction, and discloses a mineral prediction method and system based on a multi-modal Transformer architecture. The method comprises collecting multi-modal data related to mineral prediction through multiple heterogeneous data sources; cleaning, aligning and standardizing the collected data to generate a multi-modal data set in a unified format; deeply encoding the data set using a multi-modal Transformer encoder to extract multi-modal feature representations; based on the feature representations, calculating the correlation weights between the features through the Transformer self-attention mechanism to construct a dynamic attention relationship graph; evaluating the importance of the features according to the correlation weights in the graph to filter key features to form a core feature set; and inputting the core feature set into a prediction model to generate a mineral prediction result. The method can mine the correlation information of multi-modal data, filter core features, and adapt to the demand for mineral prediction under complex geological conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of mineral prediction technology, specifically to a mineral prediction method and system based on a multimodal Transformer architecture. Background Technology

[0002] Mineral resources are a crucial material foundation for industrial production and social development. With the increasing demand for mineral exploration, accurate mineral prediction has become key to reducing exploration costs and improving efficiency. Current mineral prediction methods largely rely on single-type data analysis. Common data sources include geological mapping data, geophysical exploration data, geochemical analysis data, and remote sensing imagery. Different types of data reflect mineral distribution information from dimensions such as geological structure, physical properties, chemical composition, and surface morphology. However, single-modal data has significant limitations. For example, geological mapping data can only present the geological structural characteristics of a local area and cannot cover the overall geological background of a large area; geophysical exploration data is easily affected by topography, has a strong ability to characterize shallow minerals, but has limited accuracy in detecting deep minerals; remote sensing imagery can achieve large-scale data acquisition, but its effectiveness is easily constrained by factors such as cloud cover and vegetation cover.

[0003] In the data processing stage, existing methods often face high integration difficulties when dealing with multimodal data due to issues such as heterogeneous data formats and large scale differences. Some methods use simple concatenation or weighted fusion to process multimodal data, failing to fully consider the inherent correlations between different modalities, making it difficult for the effective information in the data to work synergistically. Meanwhile, in the data cleaning and standardization process, existing methods often use fixed thresholds or general rules to handle outliers and missing values, without adjusting processing strategies according to the specificities of mineral prediction scenarios. This can easily lead to the loss of key geological information or residual noise, affecting the accuracy of subsequent feature extraction.

[0004] In terms of feature extraction and modeling, traditional mineral prediction methods often employ models such as convolutional neural networks and support vector machines. These models are strong at extracting features from single-modal data, but struggle to capture cross-modal correlation features between multimodal data. Some methods attempt to introduce attention mechanisms to improve feature modeling capabilities, but they mostly use static attention weight allocation, failing to dynamically adjust the focus based on the actual characteristics of the data. This results in insufficient adaptability of the model to feature changes under complex geological conditions. Furthermore, existing methods often rely on manual experience or simple statistical indicators to select features, lacking consideration of the correlations between features. This easily leads to the inclusion of redundant features or the omission of key features, increasing the computational burden on the model and reducing the reliability of the prediction results. Summary of the Invention

[0005] The purpose of this invention is to provide a mineral prediction method and system based on a multimodal Transformer architecture to solve the problems mentioned in the background art.

[0006] To achieve the above objectives, this invention provides a mineral prediction method based on a multimodal Transformer architecture, the method comprising:

[0007] The multimodal data acquisition process involves collecting multimodal data related to mineral prediction from multiple heterogeneous data sources.

[0008] The data integration and preprocessing steps involve cleaning, aligning, and standardizing the collected multimodal data to generate a multimodal dataset in a unified format.

[0009] The multimodal feature encoding step involves using a multimodal Transformer encoder to perform deep encoding on the multimodal dataset and extract multimodal feature representations.

[0010] The attention relationship modeling step involves calculating the association weights between features based on the multimodal feature representation and constructing a dynamic attention relationship graph through the self-attention mechanism of the Transformer.

[0011] The core feature selection step involves evaluating the importance of each feature based on the association weights in the dynamic attention graph and selecting key features to form a core feature set.

[0012] The mineral prediction output step involves inputting the core feature set into the prediction model to generate mineral prediction results.

[0013] Preferably, the multimodal data includes geological core record data, geophysical field measurement data, multispectral remote sensing image data, hydrogeological observation data, and mineral chemical analysis data; wherein, the geological core record data is obtained through drilling to obtain lithological descriptions and mineral composition, the geophysical field measurement data includes gravity anomaly and magnetic gradient information, the multispectral remote sensing image data covers the visible and infrared bands, the hydrogeological observation data includes groundwater flow and water quality parameters, and the mineral chemical analysis data provides elemental concentrations and isotope ratios.

[0014] Preferably, the data integration and preprocessing steps specifically include:

[0015] Outlier detection is performed on the multimodal data, and density clustering algorithm is used to identify and remove noise points;

[0016] Data from different modalities is projected onto a unified geographic grid system, and data alignment is achieved through spatial interpolation methods.

[0017] Robust normalization techniques are applied to adjust the data distribution, eliminate dimensional differences, and generate the multimodal dataset.

[0018] Preferably, in the multimodal feature encoding step, the multimodal Transformer encoder uses a deep stacked encoding layer, each encoding layer containing a feature transformation sublayer and an interactive attention sublayer; the feature transformation sublayer improves feature representation through nonlinear mapping, and the interactive attention sublayer calculates the similarity weights between cross-modal features, fusing multi-source information to generate the multimodal feature representation.

[0019] Preferably, the attention relationship modeling step specifically includes:

[0020] Extract the context vector of each feature in the multimodal feature representation, and calculate the cosine similarity between the vectors as the initial correlation degree;

[0021] Construct a dynamic graph structure where nodes represent features and edge weights are dynamically adjusted by the degree of association.

[0022] Spectral clustering analysis is performed on the dynamic graph to strengthen key connections and generate the dynamic attention graph.

[0023] Preferably, the core feature selection step includes:

[0024] Traverse all feature nodes in the dynamic attention graph and calculate the degree centrality and feature weight product of each node;

[0025] Based on the product results, feature importance is ranked, and an adaptive threshold is set to filter high-importance features;

[0026] The selected high-importance features are grouped by modality, and similar features are merged to form the core feature set.

[0027] Preferably, in the mineral prediction output step, the prediction model adopts a gradient boosting decision tree architecture, performs multiple rounds of iterative training on the core feature set, adds a weak learner to correct the error in each round of iteration, and finally outputs a mineral distribution probability map by weighted combination.

[0028] The gradient boosting decision tree architecture includes a sequence of base learners, each of which fits the residual from the previous round. The tree structure is optimized using a log loss function, and the output is finally converted into a probability value through a sigmoid transformation.

[0029] Preferably, the geological core record data includes rock hardness and fracture density parameters, the geophysical field measurement data includes resistivity tomography results, the multispectral remote sensing image data includes vegetation index and surface temperature inversion values, the hydrogeological observation data involves permeability coefficient and aquifer thickness, and the mineral chemical analysis data includes rare earth element distribution patterns.

[0030] Preferably, the feature transformation sublayer uses kernel function mapping to extend the original features to a high-dimensional space, enhancing feature separability; the interactive attention sublayer uses a multi-head mechanism to process different feature subsets in parallel, and fuses the multi-head outputs through concatenation operations to improve the robustness of relationship modeling.

[0031] Preferably, the present invention also includes a mineral prediction system based on a multimodal Transformer architecture, the system including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein when the processor executes the computer program, it implements the steps of the mineral prediction method based on the multimodal Transformer architecture as described above.

[0032] Compared with the prior art, the beneficial effects of the present invention are:

[0033] Through multimodal data acquisition steps, mineral prediction-related data are obtained from multiple heterogeneous data sources, covering information from multiple dimensions such as geology, geophysics, geochemistry, and remote sensing. This fully integrates the advantages of different types of data, avoids information loss due to the limitations of a single data dimension, and allows the model to perceive mineral distribution-related characteristics from a more comprehensive perspective. It is suitable for prediction scenarios of different exploration areas and different mineral types.

[0034] In the data integration and preprocessing steps, the collected multimodal data is cleaned, aligned, and standardized to generate a multimodal dataset in a unified format. This effectively solves the problems of heterogeneous multimodal data formats and large scale differences, reduces the interference of data noise on subsequent processing, and ensures the consistency of different modal data in time and space dimensions through data alignment. This provides a high-quality data foundation for subsequent deep feature encoding, enabling different modal data to work together under a unified framework and avoiding feature deviations caused by data inconsistency.

[0035] The multimodal feature encoding step employs a multimodal Transformer encoder to perform deep encoding on the multimodal dataset. This encoder can capture the intrinsic correlation between different modal data through a cross-modal attention mechanism, breaking through the limitation of traditional models that can only process single-modal features. It achieves deep fusion and abstract representation of multimodal features, and mines hidden deep features related to mineral distribution in the data. These features can more accurately reflect the intrinsic laws of mineral formation and distribution, providing a more valuable foundation for subsequent attention relationship modeling and core feature selection.

[0036] The attention relationship modeling step is based on multimodal feature representation. It calculates the association weights between features through the self-attention mechanism of Transformer and constructs a dynamic attention relationship graph. This process can dynamically adjust the degree of attention to different features according to the importance of the features themselves and the strength of the association between features, rather than using a fixed weight allocation method. It can effectively capture the dynamic relationship between features. Especially under complex geological conditions, it can focus on the feature associations that play a key role in mineral prediction and ignore the interaction of irrelevant or weakly associated features, making the model more adaptable to changes in the geological environment.

[0037] The core feature selection step evaluates the importance of each feature based on the association weights in the dynamic attention graph, and selects key features to form a core feature set. This process does not rely on human experience or simple statistical indicators, but rather on the objective screening based on the feature association rules of the data itself. It can effectively remove redundant and noisy features, reduce the computational burden of the model, and ensure that the core feature set can reflect the key information related to mineral prediction. It avoids prediction bias caused by too many features or missing key features, making the model's prediction process more efficient and the prediction results more reliable. Attached Figure Description

[0038] Figure 1 This is a schematic diagram illustrating the working principle of the mineral prediction method based on the multimodal Transformer architecture described in this invention.

[0039] Figure 2 A flowchart defining multimodal data types;

[0040] Figure 3 This is a flowchart of the multimodal feature encoding steps. Detailed Implementation

[0041] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0042] Please see Figure 1This invention provides a mineral prediction method based on a multimodal Transformer architecture. The method includes: a multimodal data acquisition step involving acquiring mineral prediction-related data from multiple independent data sources, including geological exploration equipment, physical field measurement instruments, remote sensing satellite platforms, hydrological monitoring stations, and chemical analysis laboratories; a data integration and preprocessing step performing quality control and format standardization on the raw multimodal data, using algorithms to remove noise and achieve spatial alignment, generating a standardized multimodal dataset; a multimodal feature encoding step employing a multimodal Transformer encoder to extract deep features from the multimodal dataset, with the encoder structure designed to capture cross-modal interaction information; an attention relationship modeling step calculating internal correlation weights based on feature representations, constructing a dynamic graph structure reflecting the dependencies between features; a core feature selection step evaluating the importance metric of each feature in the dynamic graph, identifying key feature subsets through a threshold filtering mechanism; and a mineral prediction output step inputting the filtered features into the prediction model to generate a spatial distribution map of the probability of mineral existence.

[0043] Example 1: See Figure 2Geological core data was obtained through drilling operations, using diamond drill bits to penetrate the rock strata. The extracted core samples were immediately numbered and sealed. Geological loggers used magnifying glasses and microscopes to observe the surface characteristics of the cores, recording rock color, mineral crystallinity, and alteration type. Rock hardness parameters were measured using a Schmidt hammer tester, with the number of blows converted to the Protodyakonov hardness coefficient. Fracture density was measured using a linear survey method to count the number of fractures per unit length of core, and fracture width was measured using a feeler gauge. Geophysical field measurements included gravity and magnetic measurements. Gravity measurements were performed using a Lacoste gravimeter on a grid of measuring points, and the values ​​were corrected for solid tides and altitude to obtain Bouguer gravity anomaly values. Magnetic measurements were performed using a proton precession magnetometer scanning along the survey line, and the data were corrected for diurnal variation and normal field to obtain magnetic anomaly gradient values. Multispectral remote sensing image data was obtained from the Landsat platform, with satellite sensors acquiring radiant energy values ​​in the visible and infrared bands. The data receiving station performs radiometric calibration and atmospheric correction on the raw images to generate surface reflectance products. The vegetation index is calculated using reflectance data from the near-infrared and red bands to obtain the normalized differential vegetation index value. Surface temperature is retrieved based on thermal infrared radiance values, using a split-window algorithm to eliminate atmospheric effects and obtain pixel temperature values. Hydrogeological observation data comes from a network of groundwater monitoring wells, with pressure sensors installed in the wells recording water level changes, and flow velocity measurements calculated using the tracer dilution method. Water quality parameters are measured in-situ using a multi-parameter water quality analyzer to determine pH, conductivity, and dissolved oxygen content. Permeability coefficients are determined through in-situ pumping tests, recording the drawdown versus time curves, and parameter values ​​are retrieved using the Theis formula. Aquifer thickness data comes from ground-penetrating radar detection; the radar antenna moves along the survey line to emit electromagnetic waves, and the stratigraphic interface depth is calculated using the travel time of the reflected waves. Mineral chemical analysis data is completed in the laboratory; after core samples are broken, X-ray fluorescence spectrometry is used to measure major element concentrations, and inductively coupled plasma mass spectrometry is used for trace element analysis. Isotope ratio measurements were performed using thermal ionization mass spectrometry to analyze lead and strontium isotope composition, and rare earth element partition patterns were calculated from mass spectrometry data to determine chondrite-normalized values.

[0044] The data integration and preprocessing steps initiate the outlier detection module, employing a density-based noise spatial clustering algorithm. The algorithm sets the neighborhood radius parameter to 1.5 times the standard deviation of the data distribution and the minimum number of points to 20. The density clustering algorithm calculates the k-distance map for each data point to determine density accessibility, marking data points with densities below the threshold as noise points. Removing noise points generates a clean dataset, improving the continuity of the data distribution. Spatial projection unifies the multimodal data to the UTM coordinate system, setting the grid resolution to 30m × 30m. Borehole coordinates for geological core records are obtained via GPS measurements, and kriging interpolation is used to interpolate the point data into grid data. Geophysical field measurement data is registered with the grid system, and magnetic gradient data is resampled using bicubic convolution interpolation. Multispectral remote sensing image data undergoes geometric correction to eliminate topographic displacement, and nearest neighbor resampling is used to maintain pixel values. Monitoring well coordinates for hydrogeological observation data are transformed to the same coordinate system, and groundwater velocity data is generated as a continuous field using inverse distance weighted interpolation. The sampling point locations of the mineral chemical analysis data are coordinated with the geological core record data, and the elemental concentration data are used to generate distribution maps using radial basis function interpolation.

[0045] The data alignment process checks the spatial overlap of different modal data and interpolates to fill in missing areas. Alignment between geological core record data and geophysical field measurement data is achieved through grid coordinate matching, and a correspondence is established between gravity anomaly data and rock hardness data in three-dimensional space. Temporal consistency checks between multispectral remote sensing image data and hydrogeological observation data employ a synchronous acquisition strategy, and vegetation index data and groundwater quality data are interpolated and aligned in the temporal dimension. Integration of mineral chemical analysis data and geological core record data is based on the spatial relationship of sampling points, and rare earth element data and fracture density data are correlated through spatial connectivity operations. Robust normalization uses median and quartile ranges for data scaling. For the rock hardness parameter of the geological core record data, the median is calculated to be 5.5, and the quartile range is 2.3. The normalization formula uses subtracting the median and dividing by the quartile range. The median gravity anomaly value of the geophysical field measurement data is -12.3 mGal, and the quartile range is 8.7 mGal. The normalized values ​​show a more uniform distribution. The median vegetation index value of the multispectral remote sensing image data is 0.35, and the interquartile range is 0.22. Standardization enhances the distinction between different land cover types. The median permeability coefficient of the hydrogeological observation data is 0.15 m / d, and the interquartile range is 0.08 m / d. Standardization eliminates the influence of dimensions. The elemental concentration values ​​of the mineral chemical analysis data are logarithmically transformed before calculating the median and interquartile range. The rare earth element data are standardized to retain the relative distribution pattern.

[0046] The generation process of the multimodal dataset includes a data quality verification step, with verification indicators including integrity checks and consistency checks. The integrity check calculates the data coverage of each grid point, requiring data from at least three modalities to be present before retaining that grid point. The consistency check compares the spatial correlation patterns of different modal data; the rock hardness of the geological core record data and the resistivity values ​​of the geophysical field measurement data should show a positive correlation trend. The storage structure of the multimodal dataset is designed as a hierarchical data format. The top-level directory is divided by geographical region, and each region folder contains five modal data subfolders. The geological core record data subfolder stores the lithology coding matrix and physical property parameter tables; the geophysical field measurement data subfolder contains gravity field grids and magnetic field grid files; the multispectral remote sensing image data subfolder stores multiband image data; the hydrogeological observation data subfolder records water level contour maps and flow velocity vector maps; and the mineral chemical analysis data subfolder stores elemental concentration distribution maps and isotope ratio tables. The data access interface provides a query function by coordinate range and supports synchronous extraction of multimodal data.

[0047] The preprocessing workflow includes an anomaly handling mechanism, employing mirror continuation to address edge effects during interpolation. Parameter optimization of the spatial interpolation algorithm is performed through cross-validation; interpolation errors in geological core records are verified using leave-one-out method, with the mean absolute error controlled within 0.15 units. The smoothness of interpolation for geophysical field measurement data is adjusted through variogram analysis, with the nugget effect value set at 0.05. Resampling methods for multispectral remote sensing image data are compared between bilinear interpolation and nearest neighbor interpolation, selecting the optimal scheme based on pixel value fidelity. Interpolation accuracy for hydrogeological observation data is verified using known points, with the relative error of velocity interpolation not exceeding 8%. Interpolation results for mineral chemical analysis data are matched with geostatistical characteristics, and the spatial autocorrelation of elemental concentrations is tested using the Moran index. The computational efficiency of data integration and preprocessing steps is improved through parallel computing, and grid partitioning processing uses the MPI communication protocol to synchronize results. Memory management utilizes a data block loading mechanism, supporting large-scale grid data processing. The preprocessing log records parameter settings and runtime status for each step, facilitating result traceability and workflow reproducibility. The quality assessment report generation module automatically outputs data quality indicators, including the integrity percentage and consistency coefficient for each modality. Geological core record data requires an integrity rate of over 95% and a consistency coefficient with geophysical field measurement data greater than 0.7. Cloud cover detection in multispectral remote sensing image data marks invalid pixels, and the effective data coverage rate must exceed 90%. Temporal continuity checks on hydrogeological observation data exclude records with abnormal fluctuations, and laboratory quality control indicators for mineral chemical analysis data are attached to the metadata. The final version of the multimodal dataset undergoes three rounds of quality checks before proceeding to the feature encoding step. A data backup mechanism creates multiple copies stored on different storage nodes to prevent data loss due to single points of failure. The preprocessing system's user interface provides a parameter configuration panel, allowing experts to adjust algorithm parameters according to regional characteristics. A historical database saves operation traces for each preprocessing step, supporting parameter optimization and process improvement.

[0048] Example 2: See Figure 3The multimodal feature encoding step employs a multimodal Transformer encoder consisting of six identical stacked encoding layers. Each encoding layer contains two core sub-components: a feature transformation sub-layer and an interactive attention sub-layer. The feature transformation sub-layer receives feature vectors from the previous layer. These vectors are first normalized to stabilize the data distribution. The normalized features are then input into a feedforward neural network for nonlinear transformation. The feedforward neural network uses a two-layer fully connected structure. The first layer uses a linear transformation to map the 512-dimensional features to 2048 dimensions, followed by the application of the ReLU activation function to introduce nonlinearity. The second layer uses a linear transformation to restore the dimensions to 512 dimensions. The output of the feature transformation sub-layer is residually connected to the original input to alleviate the gradient vanishing problem in deep networks. The interactive attention sub-layer applies a multi-head self-attention mechanism to the normalized features, uniformly dividing the 512-dimensional features into eight 64-dimensional feature subspaces. Each attention head independently computes the query matrix, key matrix, and value matrix. The query matrix and the transpose of the key matrix are then multiplied by a dot product and divided by a scaling factor of 8 to generate the attention weight matrix. The attention weight matrix is ​​normalized using the softmax function and then multiplied by the value matrix to obtain the output vector of each attention head. The output vectors of the eight attention heads are concatenated along the feature dimension to form a 512-dimensional composite vector. This composite vector is then adjusted in dimension by a linear transformation layer and residually concatenated with the input of the interactive attention sub-layer. The final output is then normalized again and passed as the final output of the current encoding layer to the next encoding layer.

[0049] The input data for the multimodal Transformer encoder comes from a preprocessed multimodal dataset. Geological core records are encoded into 128-dimensional feature vectors, geophysical field measurement data is converted into 64-dimensional vector representations, multispectral remote sensing imagery data yields 256-dimensional depth features, hydrogeological observation data is mapped to 32-dimensional vectors, and mineral chemical analysis data is encoded into 64-dimensional features. These heterogeneous feature vectors are projected onto a unified 512-dimensional semantic space through a learnable modality embedding matrix, while modality type and location encoding vectors are added. The modality type encoding vector is generated using a trainable parameter matrix to distinguish the sources of five different data modalities. The location encoding vector is generated using sine and cosine functions to inject order information into the feature sequence. The embedded feature sequence is input into the first encoding layer. The processing time for each encoding layer is approximately 3.2 milliseconds, and the total processing time for the six layers is controlled within 20 milliseconds.

[0050] The nonlinear mapping of the feature transformation sublayer employs kernel function enhancement techniques, introducing a Gaussian kernel mapping before ReLU activation. The bandwidth parameter of the Gaussian kernel function is optimized using gradient descent, and the kernel mapping projects the original features onto a high-dimensional regenerating kernel Hilbert space. The feature dot product in the high-dimensional space is calculated using kernel tricks, avoiding the computational overhead of explicit high-dimensional mapping. The multi-head mechanism of the interactive attention sublayer uses grouped convolution optimization, decomposing the matrix multiplication of the eight attention heads into parallel computational flows. The query matrix generation for each attention head uses an independent linear transformation layer, with the key and value matrices sharing weights to reduce the number of parameters. The calculation of attention weights introduces a causal masking mechanism, restricting each position to only focusing on features at the current position and those preceding it. The input to the softmax function is temperature-scaled, with the temperature coefficient initialized to 0.1 and dynamically adjusted during training. The training process of the multimodal Transformer encoder adopts a phased strategy. The first phase involves pre-training a masked language model, randomly masking 15% of the input features, and training the model to reconstruct the masked feature values. The second phase involves multi-task fine-tuning, jointly optimizing the feature reconstruction loss and modality matching loss. The optimizer uses the AdamW algorithm with an initial learning rate of 1e-4 and a weight decay coefficient of 0.01. Training data is processed in batches of 32 samples, each containing 2048 feature tokens. A gradient clipping threshold of 1.0 is used to prevent gradient explosion. The learning rate is scheduled using a cosine annealing strategy, decaying from its maximum value to zero over 100 training epochs.

[0051] The encoder parameters are initialized using a Xavier uniform distribution, and the weight matrices of the linear layers are sampled from a uniform distribution in the range [-0.02, 0.02]. The weight matrices of the attention layers are orthogonally initialized to maintain the orthogonality of the matrices. The location encoding vectors remain fixed during training, while the modality embedding matrices are updated as training progresses. Mixed-precision computation is enabled during the inference phase, and feature tensors are converted to bfloat16 format to reduce memory usage. Computation graph optimization uses operator fusion techniques, fusing layer normalization and linear transformations into a single computation kernel. Memory management employs gradient checkpointing, recalculating intermediate results from the forward propagation during backpropagation. The output feature representation of the multimodal Transformer encoder contains rich cross-modal semantic information. Geological core record features and geophysical field measurement features are correlated under the attention mechanism, and multispectral remote sensing image features and hydrogeological observation features are fused through interactive attention sublayers. Visualization of the feature representation shows that different modal features form clustered distributions in the embedding space, and heterogeneous features from the same geographic region are close to each other in the transformation space. Analysis of attention weights reveals a strong correlation between geological core records and mineral chemical analysis data, with a clear diagonal pattern in the attention weight matrix. Activation statistics of the encoder's intermediate layers show that deeper encoding layers capture more abstract feature combination patterns. The deployment version supports dynamic batch processing, automatically optimizing computational efficiency for different batch sizes. The inference engine integrates the TensorRT acceleration framework, transforming encoding layer computations into an optimized computational graph. The concurrent processing module uses an asynchronous execution pipeline, overlapping the feature encoding and attention relationship modeling steps. Performance monitoring records the computational latency of each encoding layer, adjusting computational resource allocation in real time. An error handling mechanism detects abnormal input features, automatically skipping invalid feature sequences to ensure processing continuity. Encoder version management records each parameter update, supporting model rollback and A / B testing.

[0052] Example 3: Attention Relationship Modeling Steps. The modeling process handles feature representations from a multimodal Transformer encoder. Each feature representation is a sequence of 512-dimensional floating-point vectors. Each feature vector corresponds to the fusion information of multimodal data within a geographic grid cell. The construction of the dynamic attention relationship graph begins with the calculation of similarity between feature vectors, using cosine similarity to measure the proximity of two vector directions. Before calculating cosine similarity, each feature vector undergoes L2 normalization, adjusting the vector magnitude to 1 to eliminate the influence of amplitude on similarity calculation. The result of the normalized vector dot product is directly equal to the cosine value, ranging from -1 to 1. The cosine similarity calculation between all pairwise feature vectors generates a symmetric similarity matrix, with the matrix dimension being the number of features multiplied by the number of features. The diagonal elements of the similarity matrix are set to zero to exclude comparisons between features and themselves. The node set of the dynamic attention relationship graph corresponds to all feature vectors, and the edge set is defined by the non-zero elements of the similarity matrix. The edge weights are initialized using the corresponding cosine similarity values, with negative similarity weights forced to zero, retaining only positively correlated connections. The graph structure is stored using an adjacency list format, where each node records the indices of its neighboring nodes and the weights of its edges. A dynamic adjustment mechanism periodically updates the edge weights, with the update frequency set to trigger after every 10 batches of data processed. Weight updates are based on the rate of change of the feature vector within a sliding window, calculating the cosine similarity between the current feature vector and its historical mean. Edges with a rate of change below a threshold have their weights increased, while edges with a rate of change above the threshold have their weights decreased. A momentum factor is introduced during the dynamic adjustment process to maintain the spatiotemporal continuity of the graph structure.

[0053] The core feature selection step analyzes the topological properties of the dynamic attention graph and calculates the degree centrality index of each node. Degree centrality is defined as the sum of the weights of the edges connecting a node, reflecting the node's influence in the graph. Feature weights are derived from the attention probability distribution of the final layer of the multimodal Transformer encoder, obtained by counting the total number of times each feature is attended. Node importance scores are calculated by multiplying the degree centrality value by the feature weight; this product operation amplifies the importance of high-frequency connections and high-attention features. Importance scores are then Min-Max normalized, mapping the numerical range to 0 to 1. Node importance is ranked using a quicksort algorithm in descending order, generating a feature importance ranking. An adaptive threshold determination method based on histograms is used, dividing the importance scores into 100 equally wide intervals. The frequency of each interval is calculated using histogram distribution, and the boundary of the interval with the lowest frequency is selected as a candidate threshold value. The final threshold is determined as the first boundary point that satisfies an importance score greater than 0.7 and a sudden change in frequency. A filtering operation retains feature nodes with importance scores higher than the threshold, generating a candidate feature set.

[0054] Candidate feature sets are grouped by data modality: geological core record features, geophysical field measurement features, multispectral remote sensing image features, hydrogeological observation features, and mineral chemical analysis features are grouped separately. Intra-group feature similarity is calculated using Euclidean distance, based on the original numerical representation of the feature vectors. A similarity merging operation performs a weighted average on feature pairs with a distance less than a threshold of 0.15, with the weights determined by the feature importance score. The merged feature vectors are then recalculated to determine their connection weights with other features in the graph, updating the dynamic attention graph. The final version of the core feature set records the unique identifier, modality type, and importance score of each feature, serving as input for the mineral prediction output step. Visualization of the dynamic attention graph employs a force-directed layout algorithm, with node positions iteratively adjusted based on edge weights. High-frequency connected nodes cluster towards the graph center, while edge nodes represent features with lower importance. A repulsive force constant is set during graph layout to prevent node overlap, while the attractive force constant is proportional to the edge weight. The visualization output uses color coding to distinguish feature nodes of different modalities: geological core record features are displayed in red, geophysical field measurement features in blue, multispectral remote sensing image features in green, hydrogeological observation features in yellow, and mineral chemical analysis features in purple. Edge transparency is positively correlated with the weight value; strongly connected edges are displayed as opaque solid lines, while weakly connected edges are displayed as semi-transparent dashed lines.

[0055] The computational complexity of the core feature selection step is optimized using graph sampling techniques, randomly selecting a subset of features to construct an approximate graph structure. A sampling rate of 0.8 is set to ensure the statistical significance of the graph's topological properties. Degree centrality calculation uses an approximate algorithm, estimating node importance through random walks. Adaptive threshold adjustment incorporates a smoothing filter to avoid drastic threshold fluctuations across different batches of data. Feature merging employs an incremental update strategy, gradually integrating new features with the existing feature set. The core feature set is persistently stored in binary format, including a feature index mapping table and metadata descriptions. The dynamic attention graph is maintained using a version control mechanism, saving historical snapshots after each significant topological change. A version comparison tool analyzes the graph structure evolution, identifying stable and transient features. The monitoring panel for the core feature selection step displays real-time screening statistics, including trends in feature quantity and modal distribution ratios. An anomaly detection module monitors the distribution of importance scores, triggering alerts when deviations from normality are detected. The quality assessment of the core feature set uses a reconstruction error metric to measure the explanatory power of the selected features on the original data.

[0056] The importance of nodes in the entire process is quantified using the following mathematical formula:

[0057]

[0058] in: Indicates the first The importance score of each feature node Indicates the first Degree centrality of each feature node in the dynamic attention graph. The first one in the dynamic attention graph represents the second one. The feature node and the first Attention weights between feature nodes This represents the total number of feature nodes in the dynamic attention graph.

[0059] Degree centrality The computation depends on the real-time state of the dynamic attention graph and the attention weights. Output from the multimodal Transformer encoder. Total number of feature nodes. The total number of corresponding geographic grids remains constant during data processing. Importance score. The distribution characteristics reflect the overall quality of the feature set; a right skewness indicates the existence of a significantly important subset of features. The score calculation process is parallelized, with each feature node's calculation performed independently, and the results are aggregated to generate a score vector. This score vector drives the decision logic of the core feature selection step, completing the transformation from feature representation to core features.

[0060] Example 4: The gradient boosting decision tree architecture used in the mineral prediction output step comprises a series of sequentially trained decision tree models. Each decision tree model acts as a base learner, fitting the residuals of the previous prediction. The training process of the gradient boosting decision tree architecture uses a logarithmic loss function as the optimization objective, which measures the difference between the model's predicted probability and the true label. The base learner is generated using a recursive binary splitting algorithm, continuously selecting the optimal features and split points from the root node to divide the data into subsets with higher purity. The maximum depth of the decision tree is limited to 6 layers, and the minimum number of samples in the leaf nodes is set to 10 to prevent overfitting of the model to the training data. The learning rate parameter controls the contribution of each decision tree to the final model, with a value of 0.1 to balance training speed and generalization ability. The number of training epochs is dynamically determined using early stopping; the training process terminates when the loss function on the validation set no longer decreases for 10 consecutive epochs.

[0061] The input data for the gradient boosting decision tree architecture comes from the core feature set generated in the core feature selection step. The feature vectors contain key information from five modalities: geology, geophysics, remote sensing, hydrology, and chemistry. The dimension of the feature vectors ranges from 50 to 200, with the specific number depending on the selection threshold. The training data is divided into three independent sets: 70% of the samples are used for model training, 15% for the validation set to monitor the training process, and 15% for the test set to evaluate the final performance. Stratified sampling is used to maintain the positive-to-negative sample ratio in each set consistent with the original data. Feature scaling normalizes numerical features to the [0,1] interval, and categorical features are one-hot encoded. In the prediction phase of the gradient boosting decision tree architecture, the outputs of each base learner are weighted and summed, with the weights proportional to the base learner's performance on the validation set. The summation result is converted into a probability value using the sigmoid function, representing the likelihood of mineral deposits. The probability output is mapped to a geographic grid system to generate a mineral distribution probability map, with the color depth of each grid cell corresponding to the probability of mineral exploration. The spatial resolution of the probabilistic map is consistent with the grid size of the multimodal dataset, maintaining a grid accuracy of 30m × 30m. Post-processing performs median filtering on the probabilistic map to eliminate isolated noisy prediction points and enhance the spatial continuity of the map. Hyperparameter optimization of the gradient boosting decision tree architecture uses a Bayesian search algorithm to find the optimal combination within a specified parameter space. The candidate values ​​for the maximum depth of the decision tree are [3, 4, 5, 6, 7], the candidate values ​​for the learning rate are [0.01, 0.05, 0.1, 0.15, 0.2], and the candidate values ​​for the subsampling ratio are [0.6, 0.7, 0.8, 0.9, 1.0]. The optimization process lasts for 100 rounds, with each round selecting a set of parameters to train the model and evaluating its performance on the validation set. The optimal parameter combination is selected based on the average accuracy on the validation set to avoid overfitting the training data. A 5-fold cross-validation strategy is used to ensure the stability of parameter selection.

[0062] The gradient boosting decision tree architecture is implemented based on an open-source machine learning library. Histogram optimization uses a histogram algorithm to accelerate feature split point finding. The histogram construction process discretizes continuous feature values ​​into 256 buckets, reducing the number of candidate split points. A parallel computing strategy is employed, with feature-level parallelism, simultaneously calculating the optimal split point for all features. Memory management uses block compression technology to reduce the space overhead of storing feature values. Cache-aware optimization rearranges data access patterns to improve CPU cache hit rate. Monitoring metrics during training include training loss, validation loss, feature importance ranking, and convergence curve. Training loss reflects the model's fit to the training data, while validation loss indicates the model's generalization ability. Feature importance calculation is based on the number of times a feature is used in the decision tree and the information gain it provides; features with high importance rankings are usually closely related to mineral formation. The convergence curve shows the trend of the loss function changing with training epochs; a healthy training process shows the loss steadily decreasing to a stable plateau. The deployed version of the gradient boosting decision tree architecture supports online learning capabilities, allowing incremental updates of model parameters to adapt to new data. The incremental learning process fixes the existing decision tree structure, only adjusting leaf node weights and adding new decision trees. Model version management saves parameter snapshots for each training cycle, supporting model rollback and performance comparison. The inference interface provides both batch processing and real-time querying. Batch processing is suitable for full-area prediction, while real-time querying quickly returns probability values ​​for individual grid cells.

[0063] Referring to Table 1, the quality control of the mineral resource prediction output steps includes probability calibration and uncertainty quantification. Probability calibration uses ordinal-preserving regression to adjust the predicted probability distribution, ensuring that the predicted probability aligns with the true frequency. Uncertainty quantification generates multiple models through Bootstrap sampling and calculates the standard deviation of the prediction results as an uncertainty estimate. The visualization of the output results utilizes a geographic information system platform for overlay display, combining the mineral resource distribution probability map with basic geographic data such as geological maps and topographic maps for analysis. The user interface provides a probability threshold slider for interactive adjustment of display sensitivity.

[0064] Table 1: Hyperparameter settings for gradient boosting decision tree architecture

[0065]

[0066] The gradient boosting decision tree architecture employs a fault recovery mechanism that periodically saves training checkpoints, allowing recovery from the most recent checkpoint after training interruptions. Performance monitoring records inference latency and memory usage, triggering alarms when resource consumption exceeds thresholds. The model interpretation tool provides SHAP value analysis, demonstrating the contribution of each feature to the prediction result. Feature contribution visualization helps geological experts understand the model's decision-making logic, enhancing the credibility of the prediction results. An ensemble learning strategy combines multiple gradient boosting decision tree models, further improving prediction stability through a voting mechanism.

[0067] Example 5: Multimodal Data Acquisition Steps A data collection network is systematically deployed within the specific exploration area. Geological core data acquisition involves selecting representative borehole locations and setting up drilling points. Taking a sedimentary iron ore exploration area as an example, the drilling project uses an XY-44 core drilling rig with a designed drilling depth of 300 meters, maintaining an average core recovery rate of over 85%. On-site logging personnel set up a temporary core logging station on the drilling platform, immediately cleaning, boxing, and numbering the extracted core samples. Rock hardness parameters are measured using a Schmidt hammer, continuously striking 10 points on the core surface. The impact rebound values ​​are recorded, and the average value is converted into the Protodyakonov hardness coefficient. Fracture density is measured using a line survey method, marking a baseline per meter of the core surface and counting the number of fractures intersecting the baseline. Typical granite cores have a hardness coefficient between 6 and 7, and a fracture density range of 3-8 fractures / meter. Geophysical field measurement data acquisition uses a regular survey network, with the survey lines perpendicular to the regional tectonic strike. Gravity measurements were performed using a CG-5 automatic gravimeter with measuring points arranged in a 500m × 500m grid. Each measuring point was read three times, and the average value was taken. The measured values ​​were then corrected for solid tides and topography to obtain the Bouguer gravity anomaly values. Magnetic measurements were performed using a G-858 cesium optically pumped magnetometer, with continuous recording along the survey line at a point spacing of 20 meters. The data were corrected for diurnal variation and normal field to generate a magnetic anomaly contour map. Resistivity tomography employed a multi-electrode measurement system with an electrode spacing of 10 meters. The measurement mode was set to Wenner device, and the inversion software RES2DINV was used to generate the subsurface resistivity profile. Above known ore bodies, a combined response of high gravity anomalies, high magnetic anomalies, and low resistivity anomalies is typically observed.

[0068] Multispectral remote sensing imagery was acquired during satellite transits when cloud cover was below 10%, with Landsat 8 OLI sensor data being the preferred data source. Image preprocessing included radiometric calibration, atmospheric correction, and geometric correction to generate surface reflectance products. Vegetation index calculation employed the Normalized Difference Vegetation Index (NDVI) formula, utilizing reflectance values ​​from the near-infrared and red bands. Surface temperature retrieval was based on thermal infrared data, using a single-window algorithm to eliminate atmospheric effects. In mineral alteration zones, a characteristic combination of abnormally low vegetation index values ​​and abnormally high surface temperature values ​​was typically observed. Hydrogeological observation data relied on existing monitoring well networks supplemented by drilling observation wells. Groundwater flow was measured using electromagnetic flowmeters to record instantaneous flow rates. Water quality parameters were tested in-situ using a multi-parameter water quality analyzer, measuring pH, conductivity, dissolved oxygen, and redox potential. Permeability coefficients were determined through multi-well steady-flow pumping tests, recording the drawdown versus time curves. Aquifer thickness data, combined with ground-penetrating radar detection and borehole data interpretation, was used to create contour maps of the aquifer's top and bottom plates. Mineral chemical analysis data were completed in a CMA-certified laboratory. After core samples were fragmented, X-ray fluorescence spectrometry was used to measure the oxide content of major elements. Trace element analysis was performed using inductively coupled plasma mass spectrometry (ICP-MS), with detection limits reaching the ppb level. Isotope ratio measurements were conducted using thermal ionization mass spectrometry to analyze lead isotope composition and calculate model ages. Rare earth element (REE) distribution model data were obtained through mass spectrometry analysis to construct chondrite-normalized patterns. Abnormally high levels of copper, lead, and zinc were typically observed in mineralization anomaly areas, and the REE distribution model exhibited a right-leaning characteristic.

[0069] Rock hardness parameters from geological core data were recorded in real time during drilling, with each core number strictly corresponding to its depth coordinates. Fracture density measurements were performed after the core boxes were neatly arranged, with fracture locations marked using a red marker. Gravity point coordinates for geophysical field measurements were measured using a sub-meter GPS receiver, and elevation measurements were performed using an electronic level. Magnetic measurements were conducted away from thunderstorms and days of magnetic disturbance; diurnal correction data was obtained from regional geomagnetic observatories. Stainless steel electrodes were used for resistivity tomography, with grounding resistance controlled below 1000 ohms. Radiometric calibration coefficients from multispectral remote sensing image data were extracted from satellite metadata, and atmospheric correction used the 6S radiative transfer model. Geometric correction control points were selected from 1:50000 topographic maps, with a positioning error of less than 0.5 pixels. Groundwater level measurements from hydrogeological observation data were performed using pressure level gauges, recording data hourly. Sensors were calibrated on-site before water quality parameter measurements, and measurement accuracy was verified using standard solutions. The Thys formula was used to fit the pumping test data for permeability coefficient calculation, requiring a correlation coefficient greater than 0.95. Aquifer thickness interpretation incorporated seismic exploration results, with stratigraphic tracing performed on ground-penetrating radar images. Sample pretreatment for mineral chemical analysis data was conducted in a cleanroom to prevent external contamination. X-ray fluorescence spectrometry analysis used national first-class standard materials for quality control, and a duplicate sample was inserted for every 10 samples in trace element analysis. Isotope ratio measurements were calibrated using the international standard material NBS981, and rare earth element analysis used the internal standard method to correct for matrix effects.

[0070] Quality control for multimodal data acquisition includes on-site repeat measurements and internal laboratory checks. For hardness measurements of geological core records, a repeat measurement point is inserted every 10 points, with an allowable error range of ±0.5 units. For gravity measurements of geophysical field data, 5% of the measurement points are repeated, and the sensor sensitivity of magnetic measurements is checked daily. Multispectral remote sensing image data is cross-validated with other satellite imagery, comparing reflectance values ​​for the same ground features. Water level measurements of hydrogeological observation data use dual-instrument comparison, and flow rate measurements use a comparison of the current meter method and the tracer method. Maintenance of data acquisition equipment is performed according to a predetermined plan; the drill bit is replaced every 100 meters of drilling for geological drilling rigs, and core logging tools are cleaned daily. Geophysical instruments are periodically sent to the metrology department for verification, and gravimeters undergo baseline comparison monthly. The connection status of multispectral remote sensing data receiving antennas is checked daily, and hydrological monitoring sensors undergo zero-point calibration weekly. Chemical analysis instruments undergo performance verification before each use, and mass spectrometers undergo major maintenance every six months. Training for acquisition personnel includes learning operating procedures and on-site practical assessments; those who pass are issued a certificate of competency. Documentation for multimodal data acquisition steps follows a standardized format. Geological logging utilizes a dedicated core logging form, and geophysical measurements are recorded in field logbooks. Remote sensing data is downloaded and satellite parameters are recorded, while hydrological observations are compiled into daily reports. Chemical analysis yields formal test reports, and all documents are scanned and stored in a database. Data acquisition progress is monitored using project management software, with daily reports on completed work. Emergency plans are established for handling anomalies, and procedures are activated in the event of drilling accidents or instrument malfunctions.

[0071] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A mineral prediction method based on a multimodal Transformer architecture, characterized in that, The method includes: The multimodal data acquisition process involves collecting multimodal data related to mineral prediction from multiple heterogeneous data sources. The data integration and preprocessing steps involve cleaning, aligning, and standardizing the collected multimodal data to generate a multimodal dataset in a unified format. The multimodal feature encoding step involves using a multimodal Transformer encoder to perform deep encoding on the multimodal dataset and extract multimodal feature representations. The attention relationship modeling step involves calculating the association weights between features based on the multimodal feature representation and constructing a dynamic attention relationship graph through the Transformer's self-attention mechanism. The attention relationship modeling steps specifically include: Extract the context vector of each feature in the multimodal feature representation, and calculate the cosine similarity between the vectors as the initial correlation degree; Construct a dynamic graph structure where nodes represent features and edge weights are dynamically adjusted by the degree of association. Spectral clustering analysis is performed on the dynamic graph to strengthen key connections and generate the dynamic attention relationship graph. The core feature selection step involves evaluating the importance of each feature based on the association weights in the dynamic attention graph, and selecting key features to form a core feature set. This core feature selection step includes: Traverse all feature nodes in the dynamic attention graph and calculate the degree centrality and feature weight product of each node; Based on the product results, feature importance is ranked, and an adaptive threshold is set to filter high-importance features; The selected high-importance features are grouped by modality, and similar features are merged to form the core feature set. The mineral prediction output step involves inputting the core feature set into the prediction model to generate mineral prediction results. The multimodal data includes geological core record data, geophysical field measurement data, multispectral remote sensing image data, hydrogeological observation data, and mineral chemical analysis data. Among them, the geological core record data is obtained through drilling to obtain lithological descriptions and mineral composition; the geophysical field measurement data includes gravity anomaly and magnetic gradient information; the multispectral remote sensing image data covers the visible and infrared bands; the hydrogeological observation data includes groundwater flow and water quality parameters; and the mineral chemical analysis data provides elemental concentrations and isotope ratios.

2. The mineral prediction method based on a multimodal Transformer architecture as described in claim 1, characterized in that, The data integration and preprocessing steps specifically include: Outlier detection is performed on the multimodal data, and density clustering algorithm is used to identify and remove noise points; Data from different modalities is projected onto a unified geographic grid system, and data alignment is achieved through spatial interpolation methods. Robust normalization techniques are applied to adjust the data distribution, eliminate dimensional differences, and generate the multimodal dataset.

3. The mineral prediction method based on a multimodal Transformer architecture as described in claim 2, characterized in that, In the multimodal feature encoding step, the multimodal Transformer encoder uses deeply stacked encoding layers, each encoding layer containing a feature transformation sublayer and an interactive attention sublayer; The feature transformation sublayer enhances feature representation through nonlinear mapping, the interactive attention sublayer calculates the similarity weights between cross-modal features, and the multi-source information is fused to generate the multimodal feature representation.

4. The mineral prediction method based on a multimodal Transformer architecture as described in claim 3, characterized in that, In the mineral prediction output step, the prediction model adopts a gradient boosting decision tree architecture, performs multiple rounds of iterative training on the core feature set, adds a weak learner to correct the error in each round of iteration, and finally outputs a mineral distribution probability map by weighted combination. The gradient boosting decision tree architecture includes a sequence of base learners, each of which fits the residual from the previous round. The tree structure is optimized using a log loss function, and the output is finally converted into a probability value through a sigmoid transformation.

5. The mineral prediction method based on a multimodal Transformer architecture as described in claim 4, characterized in that, The geological core record data includes rock hardness and fracture density parameters; the geophysical field measurement data includes resistivity tomography results; the multispectral remote sensing image data includes vegetation index and surface temperature inversion values; the hydrogeological observation data involves permeability coefficient and aquifer thickness; and the mineral chemical analysis data includes rare earth element distribution patterns.

6. The mineral prediction method based on a multimodal Transformer architecture as described in claim 5, characterized in that, The feature transformation sublayer uses kernel function mapping to extend the original features to a high-dimensional space, enhancing feature separability; the interactive attention sublayer uses a multi-head mechanism to process different feature subsets in parallel, and fuses the multi-head outputs through concatenation operations to improve the robustness of relationship modeling.

7. A mineral prediction system based on a multimodal Transformer architecture, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the mineral prediction method based on the multimodal Transformer architecture as described in any one of claims 1 to 6.