A method and system for identifying a provenance based on sediment geochemical data
By preprocessing geochemical data of sediments, calculating characteristic parameters, and establishing machine learning models, the problems of low efficiency and insufficient standardization in existing geochemical data processing technologies have been solved, and automated, quantitative, and reliable provenance identification has been achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU UNIVERSITY OF TECHNOLOGY
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing geochemical data processing methods rely on manual processing, which is inefficient, prone to errors, and lacks standardized practices. Furthermore, machine learning methods suffer from deficiencies in analytical completeness and standardization.
A provenance identification method based on sediment geochemical data is adopted, including preprocessing, calculation of characteristic parameters of major elements and rare earth elements, projection analysis, cluster analysis and machine learning model establishment, to achieve automated and standardized provenance identification.
It has achieved automation, quantification, and reliability in sediment geochemical data processing, improved the repeatability and comparability of analysis results, reduced the influence of subjective judgment, and can adapt to datasets of different sizes and characteristics.
Smart Images

Figure CN122245484A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of geological analysis technology, specifically to a method and system for determining the provenance of sediments based on geochemical data. Background Technology
[0002] Sediment geochemical analysis is an important tool for studying the source, weathering degree, and depositional environment of sediments. By analyzing the content and ratios of major elements, trace elements, and rare earth elements in sediments, information such as source rock type, chemical weathering intensity, paleoclimate conditions, and depositional environment can be revealed. The Chemical Index of Alteration (CIA) is one of the most commonly used indicators for evaluating the degree of chemical weathering. CIA quantitatively evaluates the weathering intensity based on the degree of loss of active cations (Ca, Na, K) relative to Al during weathering.
[0003] Existing geochemical data processing methods primarily rely on manual data calculation, discrimination analysis, and analysis. However, manual processing is prone to errors, inefficient, and lacks standardized procedures. Some studies have also used machine learning methods for geochemical data processing, but these methods still suffer from deficiencies in analytical completeness and standardization.
[0004] Therefore, there is an urgent need to establish a complete, standardized, and automated method for processing sediment geochemical data and determining provenance. Summary of the Invention
[0005] In view of this, the present disclosure provides a method and system for determining the provenance of sediments based on geochemical data, which at least partially solves the problems existing in the prior art.
[0006] In a first aspect, embodiments of this disclosure provide a method for provenance determination based on geochemical data of sediments, which includes the following steps:
[0007] S1, preprocesses the geochemical data of the input sediment samples;
[0008] S2, based on the preprocessed geochemical data, the mass percentage of the oxides of major elements is converted into moles, thus obtaining the mole count of the oxides of major elements.
[0009] S3, Based on the number of moles of the oxides, the sediment sample is projected onto the A-CN-K ternary component space to obtain the distribution characteristics of the sediment sample in the ternary component space;
[0010] S4. Rare earth element data of sediment samples were processed using the chondrite normalization method to obtain rare earth-related characteristic parameters.
[0011] S5. Perform principal component analysis dimensionality reduction on the standardized geochemical data, and conduct cluster analysis based on the dimensionality reduction results to obtain cluster results for identifying provenance end members;
[0012] S6. Use the cluster results or known provenance labels as target variables, use the mole number, characteristic parameters, and distribution characteristics as input quantities, and establish a provenance discrimination model using a machine learning classification algorithm to perform provenance determination of sediments based on the provenance determination model.
[0013] According to a specific implementation manner of an embodiment of the present disclosure, in step S1, the preprocessing includes:
[0014] (1) Processing values below the detection limit: Identify special marks of "<X" format values, "ND", and "BDL" in the geochemical data, and replace them with half of the detection limit value;
[0015] (2) Filling missing values: Fill variables with a missing rate lower than 20% with the median of the variable, and mark variables with a missing rate higher than 20%;
[0016] (3) Negative value correction: Replace negative values in the geochemical data with zero or mark them as abnormal;
[0017] (4) Normalization of major elements: Normalize the contents of oxides of major elements proportionally to a total of 100%.
[0018] According to a specific implementation manner of an embodiment of the present disclosure, in step S2, the conversion formula for converting the mass percentage content of oxides of major elements into mole numbers is:
[0019]
[0020] Where, is the mole number of the oxide of the major element, is the mass percentage content of the oxide of the major element, is the molecular weight of the oxide of the major element.
[0021] According to a specific implementation manner of an embodiment of the present disclosure, it further includes:
[0022] According to the mole numbers of oxides of major elements , , , , calculate:
[0023] Chemical weathering index CIA:
[0024]
[0025] Chemical alteration index CIW:
[0026]
[0027] Plagioclase alteration index (PIA):
[0028]
[0029] in, .
[0030] According to a specific implementation of an embodiment of this disclosure, in step S3, A, CN, and K in the A-CN-K ternary component space represent respectively... (CaO*+ ), Example of the number of moles:
[0031]
[0032]
[0033]
[0034] It also includes:
[0035] Tie the CIA contour lines, which are straight lines parallel to the bottom edge of CN-K;
[0036] Determining weathering trends based on the point distribution of sediment samples in the ternary component space:
[0037] If sediment sample points are distributed along a trend line parallel to the A-CN side, it indicates a normal weathering trend.
[0038] If the sediment sample points are biased towards the K-vertex direction, it indicates the presence of potassium replacement.
[0039] According to a specific implementation of an embodiment of this disclosure, step S4 includes:
[0040] Using chondrite normalization, the rare earth element content is divided by the content of the corresponding element in the chondrite:
[0041]
[0042] The standard values for chondrites are adopted from the recommendations of Sun and McDonough (1989);
[0043] Calculate the degree of fractionation between light rare earth elements and heavy rare earth elements:
[0044]
[0045]
[0046]
[0047] Calculate Eu anomalies:
[0048]
[0049] Where δEu<1 indicates a negative anomaly, and δEu>1 indicates a positive anomaly;
[0050] Calculate Ce anomalies:
[0051]
[0052] Calculate the total rare earth element content ΣREE and the light rare earth element ratio LREE / HREE.
[0053] According to a specific implementation of an embodiment of this disclosure, step S5 includes:
[0054] Z-score standardization was performed on the relevant variables of the geochemical data involved in the analysis to obtain standardized data;
[0055] Principal component analysis (PCA) is used to reduce the dimensionality of standardized data. This includes: calculating the covariance matrix of the standardized data; solving for the eigenvalues and eigenvectors of the covariance matrix; sorting the eigenvalues by size and selecting the principal components with a cumulative variance contribution rate of 85% or higher; and calculating the score of the sediment sample in the principal component space based on the principal components.
[0056] Cluster analysis is performed based on the selected principal components, including: determining the optimal number of clusters K using the silhouette coefficient method, with a search range of K∈[2, min(10, n / 2)]; performing K-Means clustering for each K value and calculating the silhouette coefficient; selecting the K value with the largest silhouette coefficient as the optimal number of clusters; and performing final clustering using the optimal K value to obtain the clustering results of the sediment samples.
[0057] According to a specific implementation of the embodiments of this disclosure, it further includes:
[0058] By using multi-indicator cross-validation to detect outliers and inconsistencies in the data, the reliability of the data processing procedure and the source determination is evaluated; including:
[0059] Outlier detection: Outliers were detected for each variable in the geochemical data using the interquartile range method.
[0060]
[0061] in , The first and third quartiles, ;
[0062] Major element sum test: Check whether the sum of major element oxides is within a reasonable range;
[0063] Contradictory Weathering Indicators Detection:
[0064] CIA-WIP Conflict: A conflict is identified when CIA > 75 and WIP > 60.
[0065] CIA-ICV contradiction: When CIA>80 and ICV>1, a contradiction is identified, and data quality needs to be checked;
[0066] Rare earth element distribution pattern verification:
[0067] Check if the rare earth element distribution curve is smooth;
[0068] Check the consistency between Eu abnormalities and CIA: issue a warning when a strong positive Eu abnormality and a high CIA value occur simultaneously;
[0069] Generate a data quality report that lists detected anomalies and inconsistencies and their possible causes.
[0070] Secondly, embodiments of this disclosure provide a provenance determination system based on geochemical data of sediments, comprising:
[0071] The preprocessing unit preprocesses the geochemical data of the input sediment samples;
[0072] The mole number calculation unit is used to convert the mass percentage of oxides of major elements into moles based on preprocessed geochemical data, and obtain the mole number of oxides of major elements.
[0073] The projection unit is used to project the sediment sample onto the A-CN-K ternary component space according to the number of moles of the oxide, so as to obtain the distribution characteristics of the sediment sample in the ternary component space.
[0074] The feature calculation unit is used to process the rare earth element data of sediment samples using the chondrite normalization method to obtain rare earth-related feature parameters.
[0075] Clustering units are used to perform principal component analysis to reduce the dimensionality of standardized geochemical data. Based on the dimensionality reduction results, cluster analysis is performed to obtain clustering results for identifying source endmembers.
[0076] The model building unit is used to establish a provenance discrimination model using clustering results or known provenance labels as target variables, and mole count, feature parameters, and distribution characteristics as input quantities, and to make provenance judgments on sediments based on the provenance discrimination model.
[0077] According to a specific implementation of this disclosure, a visualization unit is further included, which supports generating the following types of graphics:
[0078] Scatter plot: Supports scatter plots of variables from any two geochemical data sets, and allows setting conditions to highlight specific samples;
[0079] A-CN-K ternary map: automatically labels reference mineral points and CIA contour lines, showing weathering trends;
[0080] Rare Earth Element Distribution Chart: The rare earth element distribution pattern of chondrites is displayed using logarithmic coordinates.
[0081] PCA bipolar plot: Displays both sample scores and variable loadings simultaneously, and supports coloring by clustering results;
[0082] Box plot: Displays the distribution characteristics and outliers of each variable;
[0083] Correlation heatmap: Displays the correlation coefficient matrix between variables;
[0084] Confusion matrix: Displays the comparison between the prediction results of the source discrimination model and the true labels.
[0085] Compared with the prior art, the above embodiment has at least the following beneficial effects:
[0086] (1) Process standardization: A complete sediment geochemical data processing process has been established. From data preprocessing to provenance identification, each step has clear methods and parameters, which improves the repeatability and comparability of the analysis results;
[0087] (2) Automated calculation: The automatic calculation of indicators such as chemical weathering index and rare earth element parameters has been realized, avoiding errors in manual calculation and improving analysis efficiency;
[0088] (3) Quantitative discrimination: Machine learning algorithms are used to establish a source discrimination model, providing quantitative discrimination results and confidence levels, reducing the influence of subjective judgment;
[0089] (4) Quality controllability: A data quality inspection method with multi-index cross-validation was established, which can automatically detect outliers and inconsistencies in the indicators, thereby improving the reliability of the analysis results;
[0090] (5) Parameter adaptation: It can automatically adjust the analysis parameters, such as the number of clusters, according to the data characteristics to adapt to datasets of different sizes and characteristics. Attached Figure Description
[0091] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0092] Figure 1 A schematic flowchart illustrating the provenance determination method based on sediment geochemical data provided in the first embodiment of the present invention;
[0093] Figure 2 This is a schematic diagram of the provenance determination system based on sediment geochemical data provided in the second embodiment of the present invention. Detailed Implementation
[0094] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.
[0095] The following specific examples illustrate the implementation of this disclosure. Those skilled in the art can easily understand other advantages and effects of this disclosure from the content disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. This disclosure can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this disclosure. It should be noted that, in the absence of conflict, the following embodiments and features in the embodiments can be combined with each other. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0096] It should be noted that various aspects of embodiments within the scope of the appended claims are described below. It will be apparent that the aspects described herein can be embodied in a wide variety of forms, and any particular structure and / or function described herein is merely illustrative. Based on this disclosure, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement the device and / or practice the method. Additionally, this device and / or method can be implemented using structures and / or functionalities other than one or more of the aspects set forth herein.
[0097] It should also be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this disclosure. The drawings only show the components related to this disclosure and are not drawn according to the number, shape and size of the components in actual implementation. In actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.
[0098] Furthermore, specific details are provided in the following description to facilitate a thorough understanding of the examples. However, those skilled in the art will understand that the described aspects can be practiced without these specific details.
[0099] Please see Figure 1 The first embodiment of the present invention provides a provenance determination method based on sediment geochemical data, which can be executed by a provenance determination device based on sediment geochemical data (hereinafter referred to as the determination device), specifically, by one or more processors within the determination device, to implement the following steps:
[0100] S1 is used to preprocess the geochemical data of the input sediment samples.
[0101] In this embodiment, the discrimination device can be a computing device with data processing capabilities, such as a mobile terminal, desktop computer, laptop computer, workstation, server, etc., and the present invention does not impose a specific limitation. The discrimination device may have an operating system and an application program (such as a graphical processing program written in Python) installed within it, and the steps of the present invention are implemented by executing this application.
[0102] In this embodiment, the discrimination device can obtain geochemical data of sediment samples by reading raw data files. The discrimination device supports data files in Excel (.xlsx, .xls) and CSV formats. Typically, the first column of the data file is the sediment sample number, and the remaining columns are the content of geochemical elements or oxides.
[0103] In this embodiment, after reading the original data file, the discrimination device can also perform intelligent column name recognition. Specifically, the discrimination device has a built-in element name mapping table that automatically converts non-standard column names into standard symbols. For example, different representations such as "silicon dioxide", "SiO2", and "silicon" are uniformly converted to "SiO2". The mapping rules are executed according to the following priority: (1) direct matching of standard symbols; (2) Chinese name mapping; (3) fuzzy matching.
[0104] In this embodiment, after the transformation is completed, the discrimination device further preprocesses the geochemical data, including:
[0105] 1. Special value identification and processing. The discrimination device automatically identifies the following types of special values:
[0106] Values below the detection limit: Matching the format "<X", such as "<0.01", "<0.5", etc.;
[0107] Non-detection markers: Matching "ND", "N.D.", "BDL", "-", etc.;
[0108] Null values: Matching empty cells or "NA", "NULL", etc.
[0109] For values below the detection limit, extract the numerical part and divide it by 2 as the replacement value. For example, "<0.1" is replaced by 0.05. For non-detection markers, replace them with 0.001 or the default value specified by the user.
[0110] 2. Data type conversion. Convert all data columns to numerical types, and mark the values that cannot be converted as missing values.
[0111] 3. Missing value processing. Calculate the missing rate of each variable. For variables with a missing rate lower than 20%, fill them with the median. For variables with a missing rate higher than 20%, retain the missing status and mark it in the report.
[0112] 4. Negative value processing. Geochemical data should theoretically not have negative values. The system replaces negative values with 0 and records the anomaly. <000026l>
[0113] 5. Major element normalization. For major element data, calculate the total sum of oxides of each sample and normalize it to 100% proportionally:
[0114]
[0115] Where is the original content of the i-th oxide, is the normalized content, thus eliminating the influence of factors such as loss on ignition.
[0116] S2. According to the preprocessed geochemical data, convert the mass percentage content of the oxides of the major elements into moles to obtain the moles of the oxides of the major elements.
[0117] In this embodiment, the conversion formula for converting the mass percentage content of the oxides of the major elements into moles is:
[0118]
[0119] Where, is the moles of the oxides of the major elements, is the mass percentage content of the oxides of the major elements, is the molecular weight of the oxides of the major elements.
[0120] In this embodiment, some important indices can be calculated based on the number of moles of the oxide. For example, assuming the oxide of the major element is... , , , , Based on their number of moles, we can calculate:
[0121] Chemical Index of Alteration (CIA):
[0122]
[0123] CIA reflects the degree of loss of Ca, Na, and K relative to Al during the weathering process of feldspar minerals. The higher the CIA value, the stronger the weathering. Generally, it is considered that: CIA=50 corresponds to unweathered feldspar; CIA=50-65 is weakly weathered; CIA=65-85 is moderately weathered; and CIA>85 is strongly weathered.
[0124] Chemical Index of Weathering (CIW):
[0125]
[0126] CIW does not consider the influence of potassium and is suitable for evaluating the weathering degree of plagioclase or samples with potassium replacement.
[0127] Plagioclase alteration index (PIA):
[0128]
[0129] PIA is specifically used to evaluate the degree of alteration of plagioclase, and it excludes the contribution of potassium feldspar.
[0130] Parker Weathering Index (WIP):
[0131]
[0132] WIP is calculated based on the bond strength of alkali metals and alkaline earth metals. The lower the WIP value, the stronger the weathering.
[0133] In the above calculation formula, the number of moles of CaO is calculated using an improved CaO silicate correction method. .
[0134] Specifically, CaO in sediments may originate from three types of minerals: (1) silicate minerals (plagioclase, pyroxene, etc.); (2) carbonate minerals (calcite, dolomite, etc.); and (3) phosphate minerals (apatite, etc.). When calculating the chemical weathering index, CaO from non-silicate sources needs to be excluded.
[0135] This embodiment employs an improved two-step correction method:
[0136] The first step is to subtract the Ca from the phosphate. Assuming the phosphate mineral is apatite Ca5(PO4)3(F,Cl,OH), its Ca / P molar ratio is 5:3, meaning 10 / 3 mol Ca corresponds to 1 mol P2O5.
[0137]
[0138] If the calculation result is negative, then take 0.
[0139] The second step is to estimate the Ca content in the silicates. For samples that have not undergone carbonate separation, it is assumed that the Ca content in the silicates does not exceed the Na content (based on the stoichiometry of plagioclase), and the smaller of the corrected values for CaO and Na₂O is taken:
[0140]
[0141] That is, the final number of moles of CaO is: .
[0142] S3, Based on the number of moles of the oxides, the sediment sample is projected onto the A-CN-K ternary component space to obtain the distribution characteristics of the sediment sample in the ternary component space;
[0143] In this embodiment, the A-CN-K ternary diagram is an important tool for analyzing chemical weathering trends. Specifically, step S3 includes:
[0144] S31: Calculate the ternary coordinates. A, CN, and K represent the molar ratios of Al₂O₃, (CaO*+Na₂O), and K₂O, respectively.
[0145]
[0146]
[0147]
[0148] And it satisfies A + CN + K = 100.
[0149] S32: Mark the reference mineral point. Mark the theoretical location of the reference mineral in the A-CN-K ternary diagram:
[0150] S33: Bind the CIA contour lines. The CIA contour lines are straight lines parallel to the bottom edge of CN-K, and the CIA value is equal to the coordinate value of A. Bind the CIA=50, 60, 70, 80, and 90 contour lines in the A-CN-K ternary map.
[0151] S34: Weathering trend assessment.
[0152] Among them, for the normal weathering trend: the sample points are distributed along the trend line parallel to the A-CN side, evolving from plagioclase to kaolinite, reflecting the normal weathering process in which Ca and Na are lost preferentially over K.
[0153] For potassium metasomatism: if the sample points are biased towards the K-vertex direction, or the trend line points towards illite rather than kaolinite, it indicates the presence of potassium metasomatism during diagenesis. In this case, the CIA value may underestimate the actual degree of weathering.
[0154] For multi-source mixtures: the sample points are dispersed and do not show a single trend line, which may indicate the mixture of multiple sources.
[0155] S4. Rare earth element data of sediment samples were processed using the chondrite normalization method to obtain rare earth-related characteristic parameters.
[0156] In this embodiment, the rare earth element distribution map is an important tool for analyzing the source characteristics. Specifically, step S4 includes:
[0157] S41: Chondrite Normalization. The rare earth element content in the sample is divided by the content of the corresponding element in the chondrite to eliminate the parity effect of rare earth element abundance.
[0158] In particular, this embodiment adopts the standard values for chondrites recommended by Sun and McDonough (1989).
[0159] S42: Calculate the characteristic parameters, including:
[0160] Degree of fractionation of light and heavy rare earth elements:
[0161]
[0162]
[0163]
[0164] (La / Yb)N>1 indicates enrichment in light rare earth elements, while (La / Yb)N<1 indicates enrichment in heavy rare earth elements.
[0165] Eu anomaly:
[0166]
[0167] δEu<1 indicates a negative anomaly, suggesting the presence of plagioclase separation crystallization or residue in the source region; δEu>1 indicates a positive anomaly, suggesting the presence of plagioclase cumulation or a reducing environment in the source region.
[0168] Ce abnormality:
[0169]
[0170] δCe<1 is a negative anomaly, indicating Ce under oxidizing conditions. 4 ⁺ preferentially precipitates; δCe>1 is a positive anomaly, which is rare.
[0171] Total amount of rare earth elements:
[0172]
[0173] Light and heavy rare earth ratio:
[0174]
[0175] S43, Rare Earth Element Distribution Chart. A rare earth element distribution curve is plotted with the atomic number of the rare earth elements on the x-axis and the logarithm of the chondrite-normalized value on the y-axis. The shape of the distribution curve reflects the provenance characteristics and includes the following types:
[0176] Right-leaning type (enriched with light rare earth elements): a typical upper crustal feature;
[0177] Flat type: typical mid-ocean ridge basalt characteristics;
[0178] Left-leaning type (heavy rare earth element enrichment): less common, may indicate a special source region or differentiation process.
[0179] S5 performs principal component analysis to reduce the dimensionality of the standardized geochemical data, and then performs cluster analysis based on the dimensionality reduction results to obtain clustering results for identifying source endmembers.
[0180] In this embodiment, step S5 mainly includes two parts: principal component analysis and cluster analysis. Specifically, it includes:
[0181] S51: Variable Selection. Select variables from the geochemical data used for multivariate statistical analysis. Generally, major elements (SiO2, Al2O3, Fe2O3, MgO, CaO, Na2O, K2O, TiO2) and / or trace elements (such as Th, Sc, Zr, Cr, Co, Ni, etc.) are selected. Derived indices (such as CIA, CIW, etc.) are excluded to avoid information redundancy.
[0182] S52: Data Standardization. Perform Z-score standardization on the selected variables:
[0183]
[0184] in Let j be the original value of the j-th variable for the i-th sample. Let be the mean of the j-th variable. Let be the standard deviation of the j-th variable. After standardization, the mean of each variable is 0, and the standard deviation is 1.
[0185] S53: Principal component analysis.
[0186] (1) Calculate the covariance matrix (i.e., the correlation coefficient matrix) of the standardized data:
[0187]
[0188] (2) Solve for the eigenvalues of the covariance matrix and eigenvectors ;
[0189] (3) Sort the components in descending order of their eigenvalues and calculate the variance contribution rate of each principal component:
[0190]
[0191] (4) Select the top m principal components whose cumulative variance contribution rate reaches 85% or more;
[0192] (5) Calculate the score of the sample in the principal component space:
[0193]
[0194] S54: Cluster analysis.
[0195] (1) Determine the number of clusters K. The optimal number of clusters is automatically determined using the silhouette coefficient method. The silhouette coefficient is defined as:
[0196]
[0197] in Let i be the average distance between sample i and other samples in the same cluster. is the average distance between sample i and its nearest neighbor cluster samples. The silhouette coefficient ranges from [-1, 1], with a larger value indicating better clustering performance.
[0198] (2) Iterate through K=2 to K=min(10, n / 2), perform K-Means clustering for each K value, and calculate the average silhouette coefficient;
[0199] (3) Select the K value with the largest average profile coefficient as the optimal number of clusters;
[0200] (4) Perform the final clustering using the optimal K value to obtain the sample grouping results, i.e. the clustering results.
[0201] In this implementation, the clustering results can also be evaluated to determine the clustering quality. Specifically, the following indicators are calculated to evaluate clustering quality:
[0202] Silhouette Score: Values range from -1 to 1, with >0.5 indicating good performance and >0.7 indicating excellent performance.
[0203] Calinski-Harabasz index: The higher the value, the better, reflecting the ratio of inter-cluster separation to intra-cluster compactness;
[0204] The Davies-Bouldin index: the smaller the value, the better, reflecting the ratio of intra-cluster dispersion to inter-cluster distance.
[0205] S6. Using clustering results or known provenance labels as target variables, and mole count, feature parameters, and distribution characteristics as inputs, a machine learning classification algorithm is used to establish a provenance discrimination model, and the provenance of sediments is determined based on the provenance discrimination model.
[0206] In this embodiment, the source discrimination model construction includes the following steps:
[0207] S61: Determine the target variable. The target variable can be: grouping labels obtained from cluster analysis, known provenance types provided, or sedimentary facies or stratigraphic levels determined based on geological background.
[0208] S62: Feature Selection. Geochemical indicators that contribute significantly to provenance determination are selected as feature variables. These include the aforementioned mole count, feature parameters, and distribution characteristics. Specifically, this embodiment recommends using the following combination of indicators for feature selection:
[0209] Combination 1 (major elements): SiO2, Al2O3, Fe2O3, MgO, CaO, Na2O, K2O, TiO2;
[0210] Combination 2 (trace elements): Th, Sc, Zr, Cr, Co, Ni, V, Rb, Sr, Ba;
[0211] Combination 3 (rare earth elements): La, Ce, Nd, Sm, Eu, Gd, Yb, Lu or (La / Yb)N, δEu, ΣREE;
[0212] Combination 4 (element ratios): Th / Sc, Zr / Sc, La / Sc, Cr / Th, K2O / Na2O.
[0213] S63: Dataset partitioning.
[0214] (1) Check the number of samples in each category and statistically analyze the category distribution;
[0215] (2) Remove categories with fewer than 2 samples and record a warning message;
[0216] (3) Dynamically adjust the proportion of the test set based on the minimum number of samples in the smallest category:
[0217] Minimum class sample size ≥ 10: Test set proportion 25%;
[0218] Minimum class size 5-9: Test set proportion 20%;
[0219] Minimum class sample size 3-4: Test set proportion 15%;
[0220] (4) Use stratified sampling to divide the training set and the test set to ensure that the proportion of each category in the training set and the test set is consistent;
[0221] (5) If stratified sampling fails (too few samples in a certain category), ordinary random sampling shall be used instead.
[0222] S64: Model training. This embodiment supports the following classification algorithms:
[0223] Random Forest: Integrates multiple decision trees, has strong generalization ability and anti-overfitting ability, and can output feature importance;
[0224] Gradient Boosting: Achieves high prediction accuracy by iteratively optimizing the residuals;
[0225] Support Vector Machine (SVM): Finds the optimal classification hyperplane in a high-dimensional space, suitable for small sample data;
[0226] K-Nearest Neighbors (KNN): Classifies samples based on their distances, which is simple and intuitive.
[0227] S65: Model Evaluation.
[0228] (1) Calculate the training set accuracy and test set accuracy ;
[0229] (2) Perform K-fold cross-validation, with K value set to min(5, minimum number of samples in the smallest class), and calculate the mean and standard deviation of the cross-validation accuracy.
[0230] (3) Calculate the confusion matrix and analyze the classification effect of each category;
[0231] (4) Calculate precision, recall, and F1 score;
[0232] (5) Overfitting detection: If The system was deemed to have an overfitting risk, and it was recommended to increase the sample size or simplify the model.
[0233] S66: Feature Importance Analysis. For random forest and gradient boosting algorithms, output the importance score of each feature variable, sorted in descending order of importance. The importance score reflects the contribution of the feature to the classification result and can be used to identify key discriminant indicators.
[0234] In this embodiment, after the provenance discrimination model is obtained through training, the features of the corresponding sediments can be input into the provenance discrimination model to determine the provenance of the sediments in the subsequent identification process.
[0235] Furthermore, this embodiment can also perform data quality verification based on the obtained indices or features. Specifically, data quality verification includes the following detection items:
[0236] 1. Outlier detection.
[0237] For each geochemical variable, outliers were detected using the interquartile range (IQR) method:
[0238]
[0239] Outlier identification criteria:
[0240]
[0241] Output the number of outliers and the specific sample number for each variable.
[0242] 2. Test for the sum of major elements.
[0243] Calculate the sum of major element oxides in each sample (before normalization) and check if it is within a reasonable range:
[0244] Normal range: 95%-105%;
[0245] Warning range: 90%-95% or 105%-110%;
[0246] Abnormal range: <90% or >110%.
[0247] Possible causes of total anomalies include: analytical errors, lack of certain components (such as H2O, CO2), and dehydration of hydrous minerals.
[0248] 3. Detection of contradictions in weathering indicators.
[0249] Detect the following types of contradictory indicators:
[0250] (1) CIA-WIP contradiction: CIA > 75 (strong weathering) but WIP > 60 (weak weathering). CIA and WIP should be negatively correlated. If both high CIA and high WIP occur simultaneously, possible reasons include:
[0251] Potassium metabolites lead to an underestimation of CIA;
[0252] Incomplete carbonate correction resulted in a high WIP (work intensity).
[0253] Data analysis error.
[0254] (2) CIA-ICV contradiction: CIA>80 (strong weathering) but ICV>1 (immature components). Strong weathering should lead to increased component maturity (decreased ICV). If both high CIA and high ICV occur simultaneously, the data quality needs to be checked.
[0255] (3) CIA-A / CNK contradiction: CIA>70 but A / CNK<1. High CIA samples should be over-aluminous (A / CNK>1). If quasi-aluminous characteristics are present, there may be data problems.
[0256] 4. Verification of rare earth element distribution patterns.
[0257] (1) Smoothness test of the partition curve: The standardized values of adjacent rare earth elements should not fluctuate drastically (except for Eu). If a sawtooth partition curve appears, there may be analytical error.
[0258] (2) Eu anomaly-CIA consistency test: A strong positive Eu anomaly (δEu>1.2) usually indicates weak weathering or plagioclase cumulates. If a high CIA value (>80) is also present, there is a contradiction.
[0259] (3) Verification of the rationality of Ce anomalies: Marine sediments may show negative Ce anomalies, while terrestrial sediments generally do not show obvious Ce anomalies. If a strong Ce anomaly is found in a terrestrial sample, it needs to be verified.
[0260] 5. Correlation test. Check whether the expected positively correlated element pairs meet the criteria:
[0261] Al2O3 and K2O should be positively correlated (controlled by clay minerals);
[0262] TiO2 and Al2O3 should be positively correlated (enriched in fine-grained sediments);
[0263] Zr and Hf should show a strong positive correlation (controlled by zircon).
[0264] Th and U should be positively correlated (controlled by heavy minerals).
[0265] If the correlation is abnormal, there may be data problems or special geological processes.
[0266] Compared with the prior art, the above embodiment has at least the following beneficial effects:
[0267] (1) Process standardization: A complete sediment geochemical data processing process has been established. From data preprocessing to provenance identification, each step has clear methods and parameters, which improves the repeatability and comparability of the analysis results;
[0268] (2) Automated calculation: The automatic calculation of indicators such as chemical weathering index and rare earth element parameters has been realized, avoiding errors in manual calculation and improving analysis efficiency;
[0269] (3) Quantitative discrimination: Machine learning algorithms are used to establish a source discrimination model, providing quantitative discrimination results and confidence levels, reducing the influence of subjective judgment;
[0270] (4) Quality controllability: A data quality inspection method with multi-index cross-validation was established, which can automatically detect outliers and inconsistencies in the indicators, thereby improving the reliability of the analysis results;
[0271] (5) Parameter adaptation: It can automatically adjust the analysis parameters, such as the number of clusters, according to the data characteristics to adapt to datasets of different sizes and characteristics.
[0272] Please see Figure 2 The second embodiment of the present invention provides a provenance determination system based on geochemical data of sediments, comprising:
[0273] Preprocessing unit 210 preprocesses the geochemical data of the input sediment sample;
[0274] The mole number calculation unit 220 is used to convert the mass percentage content of the oxides of major elements into moles based on the preprocessed geochemical data, so as to obtain the mole number of the oxides of major elements.
[0275] Projection unit 230 is used to project the sediment sample onto the A-CN-K ternary component space according to the number of moles of the oxide, so as to obtain the distribution characteristics of the sediment sample in the ternary component space;
[0276] The feature calculation unit 240 is used to process the rare earth element data of sediment samples using the chondrite normalization method to obtain rare earth-related feature parameters.
[0277] Clustering unit 250 is used to perform principal component analysis to reduce the dimensionality of the standardized geochemical data. Based on the dimensionality reduction results, cluster analysis is performed to obtain clustering results for identifying source endmembers.
[0278] The model building unit 260 is used to establish a source discrimination model using clustering results or known source labels as target variables and mole count, feature parameters and distribution characteristics as input quantities, and to make source discrimination of sediments based on the source discrimination model.
[0279] Specifically, it also includes a visualization unit 270, which supports generating the following types of graphics:
[0280] Scatter plot: Supports scatter plots of variables from any two geochemical data sets, and allows setting conditions to highlight specific samples;
[0281] A-CN-K ternary map: automatically labels reference mineral points and CIA contour lines, showing weathering trends;
[0282] Rare Earth Element Distribution Chart: The rare earth element distribution pattern of chondrites is displayed using logarithmic coordinates.
[0283] PCA bipolar plot: Displays both sample scores and variable loadings simultaneously, and supports coloring by clustering results;
[0284] Box plot: Displays the distribution characteristics and outliers of each variable;
[0285] Correlation heatmap: Displays the correlation coefficient matrix between variables;
[0286] Confusion matrix: Displays the comparison between the prediction results of the source discrimination model and the true labels.
[0287] The third embodiment of the present invention also provides an electronic device, the electronic device comprising:
[0288] At least one processor; and,
[0289] The memory is communicatively connected to the at least one processor; wherein,
[0290] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the provenance determination method based on sediment geochemical data of the foregoing embodiments.
[0291] The fourth embodiment of the present invention also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the provenance determination method based on sediment geochemical data in the foregoing embodiments.
[0292] The fifth embodiment of the present invention also provides a computer program product, which includes a computing program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions that, when executed by a computer, cause the computer to perform the provenance determination method based on sediment geochemical data in the foregoing embodiments.
[0293] The above description is merely a specific embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.
Claims
1. A method for provenance determination based on geochemical data of sediments, characterized in that, It includes the following steps: S1. Preprocess the geochemical data of the input sediment sample; S2. According to the preprocessed geochemical data, convert the mass percentage content of the oxides of the major elements into moles to obtain the number of moles of the oxides of the major elements; S3. According to the number of moles of the oxides, project the sediment sample into the A-CN-K ternary component space to obtain the distribution characteristics of the sediment sample in the ternary component space; S4. Use the chondrite normalization method to process the rare earth element data of the sediment sample to obtain the characteristic parameters related to rare earths; S5. Perform principal component analysis for dimensionality reduction on the standardized geochemical data, and perform clustering analysis based on the dimensionality reduction results to obtain the clustering results for identifying the provenance end members; S6. Take the clustering results or known provenance labels as the target variables, and take the number of moles, characteristic parameters, and distribution characteristics as the input variables, and use the machine learning classification algorithm to establish a provenance discrimination model to judge the provenance of the sediment based on the provenance judgment model.
2. The method for provenance determination based on geochemical data of sediments according to claim 1, characterized in that, In step S1, the preprocessing includes: (1) Processing of values below the detection limit: Identify the special marks of "<X" format values, "ND", and "BDL" in the geochemical data, and replace them with half of the detection limit value; (2) Filling of missing values: Fill the variables in the geochemical data with a missing rate lower than 20% with the median of the variable, and mark the variables with a missing rate higher than 20%; (3) Negative value correction: Replace the negative values in the geochemical data with zero or mark them as abnormal; (4) Normalization of major elements: Normalize the content of the oxides of the major elements proportionally to a total of 100%.
3. The method for provenance determination based on geochemical data of sediments according to claim 1, characterized in that, In step S2, the conversion formula for converting the mass percentage content of the oxides of the major elements into moles is: in, The number of moles of oxides of the major element. The mass percentage of oxides of the main mineral elements. The molecular weight of oxides of major elements.
4. The method for provenance determination based on geochemical data of sediments according to claim 3, characterized in that, It also includes: Based on the oxides of major elements , , , The number of moles, calculate: Chemical weathering index CIA: Chemical alteration index CIW: Plagioclase alteration index PIA: in, .
5. The method for provenance determination based on geochemical data of sediments according to claim 4, characterized in that, In step S3, A, CN, and K in the A-CN-K ternary component space represent respectively (CaO*+ ), Example of the number of moles: Then it also includes: Binding CIA contour lines, and the CIA contour lines are straight lines parallel to the bottom edge of CN-K; Judge the weathering trend according to the point distribution of the sediment sample in the ternary component space: If the sediment sample points are distributed along the trend line parallel to the A-CN side, it indicates a normal weathering trend; If the sediment sample points deviate towards the K vertex direction, it indicates the existence of potassium metasomatism.
6. The method for provenance determination based on geochemical data of sediments according to claim 4, characterized in that, Step S4 includes: Use chondrite normalization to divide the rare earth element content by the content of the corresponding element in chondrites: Among them, the chondrite standard values adopt the recommended values of Sun and McDonough (1989); Calculate the fractionation degree of light rare earths and heavy rare earths: Calculate Eu anomaly: Among them, δEu < 1 is a negative anomaly, and δEu > 1 is a positive anomaly; Calculate Ce anomaly: Calculate the total rare earth element content ΣREE and the ratio of light rare earths to heavy rare earths LREE / HREE.
7. The method for provenance determination based on geochemical data of sediments according to claim 1, characterized in that, Step S5 includes: Perform Z-score normalization on the relevant variables of the geochemical data participating in the analysis to obtain the standardized data; Principal component analysis (PCA) is used to reduce the dimensionality of standardized data. This includes: calculating the covariance matrix of the standardized data; solving for the eigenvalues and eigenvectors of the covariance matrix; sorting the eigenvalues by size and selecting the principal components with a cumulative variance contribution rate of 85% or higher; and calculating the score of the sediment sample in the principal component space based on the principal components. Cluster analysis is performed based on the selected principal components, including: determining the optimal number of clusters K using the silhouette coefficient method, with a search range of K∈[2, min(10, n / 2)]; performing K-Means clustering for each K value and calculating the silhouette coefficient; selecting the K value with the largest silhouette coefficient as the optimal number of clusters; and performing final clustering using the optimal K value to obtain the clustering results of the sediment samples.
8. The method for provenance determination based on geochemical data of sediments according to claim 6, characterized in that, Also includes: By using multi-indicator cross-validation to detect outliers and inconsistencies in the data, the reliability of the data processing procedure and the source determination is evaluated; including: Outlier detection: Outliers were detected for each variable in the geochemical data using the interquartile range method. in , The first and third quartiles, ; Major element sum test: Check whether the sum of major element oxides is within a reasonable range; Contradictory Weathering Indicators Detection: CIA-WIP Conflict: A conflict is identified when CIA > 75 and WIP > 60. CIA-ICV contradiction: When CIA>80 and ICV>1, a contradiction is identified, and data quality needs to be checked; Rare earth element distribution pattern verification: Check if the rare earth element distribution curve is smooth; Check the consistency between Eu abnormalities and CIA: issue a warning if a strong positive Eu abnormality and a high CIA value occur simultaneously; Generate a data quality report that lists detected anomalies and inconsistencies and their possible causes.
9. A provenance determination system based on geochemical data of sediments, characterized in that, include: The preprocessing unit preprocesses the geochemical data of the input sediment samples; The mole number calculation unit is used to convert the mass percentage of oxides of major elements into moles based on preprocessed geochemical data, and obtain the mole number of oxides of major elements. The projection unit is used to project the sediment sample onto the A-CN-K ternary component space according to the number of moles of the oxide, so as to obtain the distribution characteristics of the sediment sample in the ternary component space. The feature calculation unit is used to process the rare earth element data of sediment samples using the chondrite normalization method to obtain rare earth-related feature parameters. Clustering units are used to perform principal component analysis to reduce the dimensionality of standardized geochemical data. Based on the dimensionality reduction results, cluster analysis is performed to obtain clustering results for identifying source endmembers. The model building unit is used to establish a provenance discrimination model using clustering results or known provenance labels as target variables, and mole count, feature parameters, and distribution characteristics as input quantities, and to make provenance judgments on sediments based on the provenance discrimination model.
10. The provenance determination system based on sediment geochemical data according to claim 9, characterized in that, It also includes a visualization unit that supports generating the following types of graphics: Scatter plot: Supports scatter plots of variables from any two geochemical data sets, and allows setting conditions to highlight specific samples; A-CN-K ternary map: automatically labels reference mineral points and CIA contour lines, showing weathering trends; Rare Earth Element Distribution Chart: The rare earth element distribution pattern of chondrites is displayed using logarithmic coordinates. PCA bipolar plot: Displays both sample scores and variable loadings simultaneously, and supports coloring by clustering results; Box plot: Displays the distribution characteristics and outliers of each variable; Correlation heatmap: Displays the correlation coefficient matrix between variables; Confusion matrix: Displays the comparison between the prediction results of the source discrimination model and the true labels.