A method for tracing the origin of dairy products based on feature differences
By clustering mass spectrometry feature data and correcting boundary conditions, a weight and class relationship matrix is constructed, which solves the problem of insufficient category differentiation ability in dairy product origin traceability and achieves high accuracy and stable origin traceability effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GANSU AGRI UNIV
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for tracing the origin of dairy products rely on single physicochemical indicators or supervised learning, which are insufficient to fully reflect the intrinsic differences in metabolic composition. The dimensionality reduction results lack the ability to distinguish between categories, affecting the accuracy and stability of the traceability results.
Clustering of mass spectrometry feature data, calculating the boundary degrees of intra-class and inter-class distances, and combining local density correction, constructing weight matrices and class relationship matrices to form a structure enhancement matrix, performing principal component analysis and dimensionality reduction, and establishing a traceability origin database.
It significantly improves the distinguishability of dairy product samples from different origins in a low-dimensional space, enhances intra-class compactness, improves the accuracy and stability of traceability, reduces dependence on labeled data, and improves the practicality and generalization ability of the method.
Smart Images

Figure CN122193364A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of origin traceability technology, specifically to a method for tracing the origin of dairy products based on characteristic differences. Background Technology
[0002] With the widespread production and distribution of products in different plateau regions, the quality differences caused by variations in origin have gradually become an important research topic in the field of food traceability and quality control. Current methods for origin traceability mostly rely on single physicochemical index analysis or supervised learning-based pattern recognition methods, such as using the content of certain characteristic compounds for discrimination, or training classification models using labeled dairy product samples. However, due to the high dimensionality, complexity, and significant regional environmental influences of metabolic composition, single or low-dimensional features are insufficient to fully reflect its inherent differences. Furthermore, supervised methods rely on large amounts of high-quality labeled data, which are costly to obtain and have limited generalization capabilities in practical applications.
[0003] In non-target metabolomics analysis, although mass spectrometry can acquire high-dimensional metabolic characteristic data, existing data processing methods typically employ unsupervised dimensionality reduction techniques such as traditional principal component analysis. These methods focus only on the overall variance structure of the data, neglecting the potential class relationships and boundary characteristics between dairy product samples. This results in insufficient class discrimination in the dimensionality reduction results, making it difficult to effectively separate dairy product samples from different origins. Furthermore, existing methods lack characterization of the boundary features and local density differences of dairy product samples, failing to highlight boundary dairy product samples that play a crucial role in classification, thus affecting the accuracy and stability of traceability results.
[0004] The information disclosed in the background section is only intended to enhance the understanding of the background of this disclosure, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention
[0005] The purpose of this invention is to provide a method for tracing the origin of dairy products based on characteristic differences, so as to solve the problems mentioned in the background art.
[0006] To achieve the above objectives, the present invention provides the following technical solution: A method for tracing the origin of dairy products based on characteristic differences, comprising the following steps: S1: Collect mass spectrometry characteristic data of each dairy product sample, and cluster each dairy product sample based on the mass spectrometry data to obtain the first analysis result; S2: For any dairy product sample in the first parsing result, calculate the first boundary degree of each dairy product sample data based on its intra-class distance and inter-class distance, calculate its local density in combination with its neighboring dairy product samples, correct its boundary degree based on the local density to obtain the second boundary degree, determine the boundary weight based on the second boundary degree, and set the weight matrix based on the boundary weight. S3: For any two dairy samples, determine their clustering relationship based on the clustering results, traverse the clustering relationships of all dairy samples to construct a class relationship matrix, construct a boundary direction matrix based on the clustering results and boundary weights, and determine the structure enhancement matrix by combining the weight matrix, class relationship matrix and mass spectrometry feature data of each dairy sample. S4: The structure enhancement matrix is analyzed based on principal component analysis, and the dimensionality of each dairy product sample is reduced based on the analysis results to obtain the final analytical data; S5: Establish a traceability origin database and compare the final parsed data with the traceability origin database to trace the origin of each dairy product sample.
[0007] Furthermore, the mass spectrometry data includes the mass-to-charge ratio of each metabolomics product; The logic for clustering dairy product samples based on mass spectrometry data is as follows: construct mass spectrometry feature row vectors from the mass spectrometry feature data of each dairy product sample; determine the number of clusters K based on the elbow rule; randomly select K mass spectrometry feature row vectors from dairy product samples as initial cluster centers; for any dairy product sample, calculate the Euclidean distance between its mass spectrometry feature row vector and each cluster center, and assign it to the cluster corresponding to the cluster center with the smallest Euclidean distance; for any cluster, calculate the average value of all dairy product samples in each dimension of mass spectrometry features; update the mass spectrometry feature row vectors formed by the average values to the cluster center of the corresponding cluster; iterate until no dairy product sample is reassigned to a different cluster or no cluster center changes, and obtain the first analytical result; The first analysis result includes the cluster center of each cluster and the mass spectrometry feature row vector of each dairy product sample within each cluster, and for each dairy product sample, the cluster center of its respective cluster is used as its first cluster center.
[0008] Furthermore, the logic for calculating the first boundary degree is as follows: For any dairy product sample in the first analysis result, the Euclidean distance between the mass spectrometry feature row vector of the dairy product sample and the cluster center of its respective cluster is called its intra-cluster distance. The Euclidean distances between the mass spectrometry feature row vector of the dairy product sample and the cluster centers of other clusters are calculated and the minimum value is taken as its inter-cluster distance. The cluster center with the smallest Euclidean distance between its mass spectrometry feature row vector and other clusters is extracted as its second cluster center. The intra-cluster distance is divided by the inter-cluster distance to obtain its first boundary degree.
[0009] Furthermore, the logic for determining the second boundary degree is as follows: a preset Euclidean distance threshold is set. For any dairy product sample, other dairy product samples whose Euclidean distance to its mass spectrometry feature row vector is less than the Euclidean distance threshold are extracted as neighboring dairy product samples. The number of neighboring dairy product samples of each dairy product sample is counted and normalized to serve as its local density. The local density is used as the exponent and the natural base is used as the base to calculate the density correction factor of the dairy product sample. The product between the density correction factor of the dairy product sample and the first boundary degree is used as its second boundary degree.
[0010] Furthermore, for any dairy product sample, its second boundary degree is analyzed using the sigmoid function to obtain the boundary weight of the dairy product sample. The boundary weight of each dairy product sample is used as the diagonal element of the weight matrix, and the remaining elements are padded with 0 to obtain the weight matrix.
[0011] Furthermore, if two dairy product samples belong to the same cluster, they are considered to be of the same class, and their class magnitude is set to 1. If two dairy product samples belong to different clusters, they are considered to be of different classes, and their class magnitude is set to -1. The class magnitude is used as an element to construct a class relationship matrix.
[0012] Furthermore, the logic for determining the structure enhancement matrix is as follows: For any dairy product sample, calculate the difference between its second cluster center and its first cluster center to obtain a category separation direction row vector. Each element in the category separation direction row vector is the category separation direction of each mass spectrometry feature in the dairy product sample. For any mass spectrometry feature, calculate the sum of the category separation directions of that mass spectrometry feature in all dairy product samples to obtain the comprehensive category separation direction of that mass spectrometry feature. Construct a row vector from the comprehensive category separation directions of all mass spectrometry features. Multiply this row vector by its transpose and then by the boundary weights to obtain the boundary direction matrix. The mass spectrometry feature row vectors of each dairy product sample are stacked to form a dairy product sample feature matrix. The transpose of the dairy product sample feature matrix is multiplied by the weight matrix, then multiplied by the class relation matrix, then multiplied by the weight matrix again, and then multiplied by the dairy product sample feature matrix again to obtain the class coordination relation matrix. The product of the preset scaling factor and the boundary direction matrix is added to the coordination relation matrix to obtain the structure enhancement matrix.
[0013] Furthermore, PCA is used to reduce the dimensionality of the structure enhancement matrix to obtain a projection matrix. The projection matrix is then used to project the feature matrix of the dairy product samples to obtain a reduced-dimensional sample matrix. The feature vector of each dairy product sample in the reduced-dimensional sample matrix is then extracted from the reduced-dimensional sample matrix as the main distinguishing feature vector of the corresponding dairy product sample.
[0014] Furthermore, the traceability origin database consists of standard main distinguishing feature vectors for each origin. For any dairy product sample, the Euclidean distance between the main distinguishing feature vector of the dairy product sample and each standard main distinguishing feature vector is calculated. The standard main distinguishing feature vector corresponding to the minimum Euclidean distance is taken as the target vector, and the origin corresponding to the target vector is traced back to the origin of the dairy product sample.
[0015] Compared with the prior art, the beneficial effects of the present invention are: This invention constructs a dairy product sample category structure without prior labels by performing cluster analysis on mass spectrometry feature data. Furthermore, it builds a boundary degree index based on intra-class and inter-class distances, and corrects the boundary degree by incorporating local density, thereby achieving effective identification and weighted representation of key boundary dairy product samples. Based on this, by constructing a dairy product sample weight matrix and a class relationship matrix, the same-class and different-class relationships between dairy product samples are explicitly introduced into the feature modeling process. This ensures that the data processing not only reflects the overall distribution characteristics but also the category structure relationships between dairy product samples. This invention also constructs a category separation direction and forms a boundary direction matrix, which is then fused with the category collaboration matrix to obtain a structure enhancement matrix. Based on this matrix, feature decomposition and dimensionality reduction are performed, thereby transforming the dimensionality reduction result from the traditional "maximum variance expression" to the "maximum category separation expression". As a result, the distinguishability between dairy product samples from different origins can be significantly improved in low-dimensional space, while enhancing intra-class compactness, achieving high accuracy and stability in origin traceability, reducing dependence on labeled data, and improving the practicality and generalization ability of the method. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of the overall method flow of the present invention.
[0017] Figure 2 This is a schematic diagram of the clustering results of the present invention. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments.
[0019] It should be noted that, unless otherwise defined, the technical or scientific terms used in this invention should have the ordinary meaning understood by one of ordinary skill in the art to which this invention pertains. The terms "first," "second," and similar terms used in this invention do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.
[0020] Example: Please see Figures 1-2 The present invention provides a technical solution: A method for tracing the origin of dairy products based on characteristic differences, comprising the following steps: S1: Collect mass spectrometry characteristic data of each dairy product sample, and cluster each dairy product sample based on the mass spectrometry data to obtain the first analysis result; Furthermore, the mass spectrometry data includes the mass-to-charge ratio of each metabolomics product; Metabolomics products include: 2-linoleoylglycerol, L-carnitine, choline, daunorubicin, 5-aminovaleric acid betaine, 2-methylbutyryl L-carnitine, α-D-(+)-tarose, melibiose, lysophosphatidylcholine 18:2, 1-oleoyl-sn-glycerol-3-phosphocholine, 1-palmitoyl-sn-glycerol-3-phosphocholine, acetylcarnitine, betaine, creatinine, lactose, 4-α-mannbiose, N-acetylmannosamine, 3-hydroxyacetaminophen, phosphocholine, adenine, 3,5-dimethoxy-α-methyl-4-propoxyphenylethylamine, tea saponin, oleamide, proline-tryptophan, nicotinamide, S-methyl-5'-thioadenosine, O-phosphotyrosine, histidine-proline, DL-proline, 2-oleoyl-1-palmitoyl-sn-glycerol-3-phosphocholine, etc.
[0021] The logic for clustering dairy product samples based on mass spectrometry data is as follows: construct mass spectrometry feature row vectors from the mass spectrometry feature data of each dairy product sample; determine the number of clusters K based on the elbow rule; randomly select K mass spectrometry feature row vectors from dairy product samples as initial cluster centers; for any dairy product sample, calculate the Euclidean distance between its mass spectrometry feature row vector and each cluster center, and assign it to the cluster corresponding to the cluster center with the smallest Euclidean distance; for any cluster, calculate the average value of all dairy product samples in each dimension of mass spectrometry features; update the mass spectrometry feature row vectors formed by the average values to the cluster center of the corresponding cluster; iterate until no dairy product sample is reassigned to a different cluster or no cluster center changes, and obtain the first analytical result; The first analysis result includes the cluster center of each cluster and the mass spectrometry feature row vector of each dairy product sample within each cluster, and for each dairy product sample, the cluster center of its respective cluster is used as its first cluster center.
[0022] Determining the number of clusters K based on the elbow rule is an existing technique. First, a possible range of K values is set (e.g., from 1 to 10, or a wider range based on experience). For each candidate K value within the range, the K-Means clustering algorithm is run to calculate the sum of squared distances (SSE, also known as the sum of squared intra-cluster deviations) of all sample points to their respective cluster centers. As K gradually increases from 1, the SSE will continue to decrease due to the increase in the number of clusters and each sample being closer to its cluster center, but the rate of decrease will gradually flatten out from the initial sharp decrease. All candidate K values and their corresponding SSE values are plotted as a line graph, and the turning point where the downward trend changes from "steep" to "flat" is observed. This point, which resembles an elbow, is the optimal number of clusters K. This is because increasing K before this point can significantly reduce SSE (explaining a large amount of variation), while continuing to increase K after this point will only bring a small improvement in SSE (diminishing marginal returns), thus achieving a balance between model complexity and clustering effect.
[0023] Please see Figure 2 , Figure 2 This is a schematic diagram of the clustering results of the present invention.
[0024] It should be noted that, Figure 2 MQ2, MQ1, MQ4, MQ3, LQ3, MQ6, LQ2, LQ4, HZ6, LQ5, MQ5, LQ1, LQ6, HZ1, HZ2, HZ3, HZ4, and HZ5 are the sample numbers for dairy products.
[0025] S2: For any dairy product sample in the first parsing result, calculate the first boundary degree of each dairy product sample data based on its intra-class distance and inter-class distance, calculate its local density in combination with its neighboring dairy product samples, correct its boundary degree based on the local density to obtain the second boundary degree, determine the boundary weight based on the second boundary degree, and set the weight matrix based on the boundary weight. In the field of dairy product origin traceability, samples are typically characterized by high dimensionality, small sample size, and complex category distribution. Metabolic characteristics often overlap and transition between different origins. Principal component analysis (PCA), a commonly used dimensionality reduction method, usually achieves feature compression by extracting the direction with the largest overall variance. However, this method relies solely on the global statistical distribution characteristics of the samples and lacks the ability to perceive the potential category structure between samples. Without clustering or category division, PCA cannot distinguish the inherent grouping relationships between samples from different origins. The principal components it extracts often reflect the direction of the largest change in variance in the data, rather than the direction most conducive to category differentiation. This results in overlap between samples of different categories in the low-dimensional space, causing key discriminative information to be averaged during dimensionality reduction or classification, thereby reducing the ability to distinguish between different origins. Therefore, it is necessary to perform clustering processing on dairy product samples and quantitatively characterize the boundary positions in the category structure to highlight the key samples that contribute most to the classification.
[0026] Furthermore, the logic for calculating the first boundary degree is as follows: For any dairy product sample in the first analysis result, the Euclidean distance between the mass spectrometry feature row vector of the dairy product sample and the cluster center of its respective cluster is called its intra-cluster distance. The Euclidean distances between the mass spectrometry feature row vector of the dairy product sample and the cluster centers of other clusters are calculated and the minimum value is taken as its inter-cluster distance. The cluster center with the smallest Euclidean distance between its mass spectrometry feature row vector and other clusters is extracted as its second cluster center. The intra-cluster distance is divided by the inter-cluster distance to obtain its first boundary degree.
[0027] Based on the above problems, this application constructs a first boundary degree by defining the ratio of intra-class distance to inter-class distance. Intra-class distance reflects the proximity of a sample to the center of its class, while inter-class distance describes the proximity of a sample to its neighboring classes by calculating the minimum distance between the sample and the centers of other classes. By making the intra-class distance and inter-class distance ratios, the influence of different scales can be effectively eliminated, so that the index can uniformly reflect the relative positional relationship of samples in the class structure.
[0028] Specifically, a larger boundary degree indicates that the sample is far from the center of its own category, but relatively close to the centers of other categories, suggesting that it is located in the category boundary region and has strong category uncertainty and discriminative value. A smaller boundary degree indicates that the sample is closer to the internal region of its own category and its contribution to category division is relatively limited. By introducing this boundary degree index, we can effectively identify and enhance the expression of key boundary samples, thereby solving the technical problem that traditional methods cannot highlight boundary samples and lead to insufficient category separation ability. It also provides more discriminative weight information for subsequent structural modeling and dimensionality reduction analysis, ultimately achieving the technical effect of improving the accuracy and stability of distinguishing dairy product samples from different origins.
[0029] When using only the first boundary degree to characterize samples, this index essentially only reflects the relative distance between samples and different class centers. It can characterize whether a sample is closer to its own class or another class, but its analysis is based solely on geometric distance and does not consider the data distribution of the sample's location. In actual high-dimensional mass spectrometry data, sample distribution is often uneven, and some samples may be in sparse regions or even outliers. These samples, because they are far from their own class center and relatively close to a certain other class center, are prone to obtaining a large first boundary degree and thus being misclassified as boundary samples. However, such samples do not have stable class boundary properties, and their location lacks sufficient data support. If modeling is directly based on the first boundary degree, noise information is easily introduced into the subsequent analysis process, reducing the stability and reliability of the model.
[0030] Furthermore, the logic for determining the second boundary degree is as follows: a preset Euclidean distance threshold is set. For any dairy product sample, other dairy product samples whose Euclidean distance to its mass spectrometry feature row vector is less than the Euclidean distance threshold are extracted as neighboring dairy product samples. The number of neighboring dairy product samples of each dairy product sample is counted and normalized to serve as its local density. The local density is used as the exponent and the natural base is used as the base to calculate the density correction factor of the dairy product sample. The product between the density correction factor of the dairy product sample and the first boundary degree is used as its second boundary degree.
[0031] The normalization process can employ max-min normalization, which is an existing technology. Specifically, for each dairy product sample, the number of its neighboring dairy product samples is obtained. The maximum and minimum values of the number of neighboring dairy product samples among all dairy product samples are obtained. The number of neighboring dairy product samples of each neighboring dairy product sample is substituted into the max-min normalization formula to complete the normalization operation. The max-min normalization formula is a conventional technique in this field and will not be elaborated here.
[0032] The core of introducing local density to correct the first boundary degree lies in further introducing distribution constraints on the basis of "distance relationship". By characterizing the degree of data aggregation in the neighborhood of the sample, the boundary degree is modulated so that the boundary degree not only reflects the relative position of the sample, but also reflects the reliability of that position.
[0033] The preset Euclidean distance threshold can be achieved using existing techniques, specifically the upper quartile method within the quantile approach. Specifically, all dairy product samples in the dataset are paired, and their Euclidean distances are calculated to obtain a set of distances reflecting the overall difference between samples. The distances in this set are sorted in ascending order, and the upper 4 digits (75%) of the sorting result are used as the Euclidean distance threshold. Using the upper quartile as the Euclidean distance threshold reflects the global difference scale of the dataset. By sorting the distances of all sample pairs and taking the 75th quartile, which represents the individual, it ensures that the vast majority of samples have a sufficient number of neighboring samples, thus stabilizing the calculation of local density. This avoids statistical failure caused by excessively small neighborhoods or empty sets, and also allows subsequent correction factors based on local density to smoothly characterize the relative sparsity of each sample in the global context, thereby reasonably adjusting the second boundary degree. Specifically, samples located in high-density regions with large initial boundary degrees are typically situated at the true boundaries between different categories and possess strong discriminative value; their boundary degrees are preserved or enhanced after correction. Conversely, samples in low-density regions, even with large initial boundary degrees, are suppressed due to a lack of neighborhood support, thus preventing outliers or noisy samples from being misidentified as key boundary samples. This correction mechanism effectively improves the accuracy and stability of boundary sample identification, reduces noise interference, and makes subsequent feature modeling and dimensionality reduction based on boundary information more consistent with the true category structure, thereby achieving the technical effect of improving the separation capability of samples from different categories and the overall analysis accuracy.
[0034] S3: For any two dairy samples, determine their clustering relationship based on the clustering results, traverse the clustering relationships of all dairy samples to construct a class relationship matrix, construct a boundary direction matrix based on the clustering results and boundary weights, and determine the structure enhancement matrix by combining the weight matrix, class relationship matrix and mass spectrometry feature data of each dairy sample. Furthermore, for any dairy product sample, its second boundary degree is analyzed using the sigmoid function to obtain the boundary weight of the dairy product sample. The boundary weight of each dairy product sample is used as the diagonal element of the weight matrix, and the remaining elements are padded with 0 to obtain the weight matrix.
[0035] After obtaining the second boundary degree after local density correction, this application maps it using the Sigmoid function to compress the unbounded or scale-inconsistent boundary degree into a finite interval, thereby obtaining stable and comparable boundary weights. Furthermore, the boundary weights of each sample are used as diagonal elements to construct a weight matrix, and the remaining elements are set to zero, thereby forming a point-by-point weighted structure for the samples. The purpose of this construction method is to transform the boundary information that originally belonged to the sample attributes into "structural weights" that can participate in matrix operations, so that different samples can be subjected to differentiated influence in a unified linear algebraic form during the subsequent feature modeling process.
[0036] Specifically, by embedding boundary weights into a diagonal weight matrix, the contribution of samples can be dynamically adjusted when combined with the original feature matrix and sample relationship matrix. This allows samples with larger boundary weights to occupy a higher proportion in the structural modeling process, thereby strengthening the information expression of the class boundary region. Meanwhile, the influence of samples with smaller boundary weights on the overall structure is weakened accordingly, reducing the interference of redundant intra-class samples on the model. Thus, this weight matrix not only achieves a prominent expression of key samples, but also allows boundary information to be naturally integrated into the subsequent calculation framework in matrix form, avoiding the inconsistency problem caused by directly performing unstructured superposition of scalar indicators.
[0037] The above method can effectively transfer the boundary recognition results to the subsequent feature space construction and dimensionality reduction process, so that the dimensionality reduction results are transformed from traditional overall statistical driving to structure driving dominated by key samples, thereby significantly improving the separation ability and boundary clarity between different categories, while enhancing the model's adaptability and stability to complex distributed data, and achieving a higher accuracy in dairy product origin discrimination.
[0038] After obtaining the clustering results of dairy product samples, if the clustering labels are used only as discrete identifiers, their information remains at the level of grouping results and is difficult to directly participate in subsequent feature modeling and dimensionality reduction analysis based on matrix operations. In existing technologies, dimensionality reduction methods usually rely on the statistical distribution of sample features and lack explicit characterization of the class relationships between samples. This results in the inherent consistency between samples of the same class not being strengthened, and the differences between samples of different classes not being effectively amplified, making it difficult to achieve adaptive modeling for specific clustering structures. Therefore, it is necessary to transform the clustering results from label information into a relational structure so as to explicitly introduce the constraints of the same class and different classes between samples in subsequent calculations.
[0039] Furthermore, if two dairy product samples belong to the same cluster, they are considered to be of the same class, and their class magnitude is set to 1. If two dairy product samples belong to different clusters, they are considered to be of different classes, and their class magnitude is set to -1. The class magnitude is used as an element to construct a class relationship matrix.
[0040] Based on this, this application constructs a class relation matrix to quantitatively represent the class relationship between any two samples: when two samples belong to the same cluster, their relation amplitude is set to 1, indicating that they should be close in the feature space; when two samples belong to different clusters, their relation amplitude is set to -1, indicating that they should be separated in the feature space. In this way, the originally discrete class labels are transformed into a relation matrix with positive and negative constraints, so that the similarity and difference between samples can participate in subsequent matrix operations in a unified numerical form. Mathematically, this construction method is equivalent to introducing a symmetrical similarity-opposition relation structure, which makes samples of the same class have a pulling effect on the feature space and samples of different classes have a pushing effect on the distance, thereby forming an optimization orientation that takes into account both intra-class compactness and inter-class separability in the overall modeling process.
[0041] By constructing the aforementioned class relationship matrix, clustering results can be effectively embedded into the feature space modeling and dimensionality reduction process. This allows dimensionality reduction analysis to no longer rely solely on global statistical characteristics but to explicitly consider the class relationship constraints between samples, thus solving the technical problem of "disconnect between dimensionality reduction and clustering" in traditional methods. Furthermore, under the influence of this relationship matrix, the dimensionality-reduced feature space can achieve a distribution effect where similar samples are more concentrated and dissimilar samples are more dispersed, significantly improving the clarity of class boundaries and overall discriminative ability, providing reliable support for high-precision identification of dairy product origins.
[0042] After constructing a class relationship matrix based on clustering results, although it can characterize the relationship constraints between sample pairs, such constraints mainly remain at the sample relationship level and lack a clear characterization of which feature directions achieve separation. In existing technologies, traditional dimensionality reduction methods usually rely on overall statistical characteristics or simple inter-class difference measures, failing to model the directional competition relationship between specific categories. This results in a lack of clear separation paths even though there is a separation trend in the feature space, thus affecting the discrimination efficiency and stability of the dimensionality reduction results. Therefore, it is necessary to start from the sample level and construct a structure that can represent the direction information of category separation to make up for the problem of having only relationship constraints but lacking directional guidance.
[0043] Furthermore, the logic for determining the structure enhancement matrix is as follows: For any dairy product sample, calculate the difference between its second cluster center and its first cluster center to obtain a category separation direction row vector. Each element in the category separation direction row vector is the category separation direction of each mass spectrometry feature in the dairy product sample. For any mass spectrometry feature, calculate the sum of the category separation directions of that mass spectrometry feature in all dairy product samples to obtain the comprehensive category separation direction of that mass spectrometry feature. Construct a row vector from the comprehensive category separation directions of all mass spectrometry features. Multiply this row vector by its transpose and then by the boundary weights to obtain the boundary direction matrix. This application constructs a class separation direction row vector by calculating the difference between the first cluster center of each dairy product sample and its nearest second cluster center. Each element in this vector corresponds to the directional contribution of each mass spectrometry feature in achieving class differentiation, which can characterize which class offset of the sample in the feature space is more conducive to differentiation. Furthermore, by accumulating the class separation directions of all samples on the same mass spectrometry feature, the comprehensive class separation direction of the feature is obtained, thereby aggregating the directional information at the sample level into a global directional expression at the feature level. On this basis, the comprehensive class separation direction vector is multiplied by its transpose to construct a directional collaborative relationship structure between features, and weighted by boundary weights to obtain a boundary direction matrix, so that samples in the class boundary region occupy a higher weight in directional modeling, thereby strengthening the characterization of the true separation direction.
[0044] The mass spectrometry feature row vectors of each dairy product sample are stacked to form a dairy product sample feature matrix. The transpose of the dairy product sample feature matrix is multiplied by the weight matrix, then multiplied by the class relation matrix, then multiplied by the weight matrix again, and then multiplied by the dairy product sample feature matrix again to obtain the class coordination relation matrix. The product of the preset scaling factor and the boundary direction matrix is added to the coordination relation matrix to obtain the structure enhancement matrix.
[0045] The structure enhancement matrix is specifically represented as follows: in, For structural enhancement matrix, This is the feature matrix for dairy product samples, an n x p matrix, where n represents the number of dairy product samples and p represents the number of mass spectrometry features. This is the matrix transpose symbol. The weight matrix is an n x n matrix. It is a class relation matrix, an n x n matrix. Let p be the boundary direction matrix, and p be the number of rows and p columns. It is a scaling factor; parameter Used to adjust the relative strength of structural relation terms and directional constraint terms in the overall matrix, it is essentially a balance coefficient, a constant, set... Existing technologies can be used; this application employs the Frobenius norm method, that is, calculating... and The Frobenius norm will and The ratio of the Frobenius norm is set to .
[0046] It should be noted that in performing a standard PCA analysis, the sample covariance matrix needs to be calculated first. The specific calculation method is as follows: Z is the covariance matrix. In the subsequent PCA analysis, this application uses the structure enhancement matrix instead of the covariance matrix. One important function is to replace the covariance matrix while describing the boundary weights. ; S4: The structure enhancement matrix is analyzed based on principal component analysis, and the dimensionality of each dairy product sample is reduced based on the analysis results to obtain the final analytical data; In traditional principal component analysis, the covariance matrix is usually constructed from the sample feature matrix to characterize the statistical correlation between different features. Its essence is to reflect the variance structure of the data in the overall distribution. However, this construction method only relies on the numerical fluctuation information of the samples and does not introduce any category-related structural constraints. The purpose of constructing a structure enhancement matrix to replace the covariance matrix in PCA analysis is to map structural information at the sample level to relational structures at the feature level; among which, This is used to introduce boundary weights, so that key samples at the class boundaries have a higher proportion in the overall modeling; This is used to characterize the similarity and dissimilarity relationships between samples, thus forming a mechanism of "aggregation of similar samples and separation of dissimilarity samples" during the calculation process; and by multiplying left and right... and This achieves a mapping from the sample space to the feature space, enabling the final matrix to reflect which feature combinations contribute to category differentiation, thus facilitating further source tracing; a direction matrix is then superimposed on this matrix. Furthermore, category separation direction information is introduced, so that the matrix not only contains relational constraints, but also has a clear separation orientation.
[0047] Compared to the traditional covariance matrix, this construction method has significant advantages. First, it no longer only describes the linear correlation between features, but also reflects the contribution of features to class distinction, making the dimensionality reduction process more consistent with the actual classification goal. Second, by introducing boundary weights, it can highlight key samples and suppress redundant samples, thereby avoiding the averaging of important discriminative information. Third, through the role of the class relation matrix, it achieves the shrinkage of similar samples in the feature space and the separation between dissimilar samples, enhancing the expression of class structure in the low-dimensional space. Finally, by supplementing the direction matrix, the dimensionality reduction result has a clear separation direction, rather than relying solely on the principle of maximizing variance, thereby improving the stability and discriminative ability of the model.
[0048] Furthermore, PCA is used to reduce the dimensionality of the structure enhancement matrix to obtain a projection matrix. The projection matrix is then used to project the feature matrix of the dairy product samples to obtain a reduced-dimensional sample matrix. The feature vector of each dairy product sample in the reduced-dimensional sample matrix is then extracted from the reduced-dimensional sample matrix as the main distinguishing feature vector of the corresponding dairy product sample.
[0049] The structural enhancement matrix was analyzed based on principal component analysis, and the dimensionality of each dairy product sample was reduced to the existing technology based on the analysis results. The PCA technology is a mature existing technology. The specific implementation process can be referred to in the article "PCA-based OPGW optical cable health assessment method" published in the journal "Optical Communication Research" in 2022, No. 2, which will not be elaborated here.
[0050] S5: Establish a traceability origin database and compare the final parsed data with the traceability origin database to trace the origin of each dairy product sample.
[0051] Furthermore, the traceability origin database consists of standard main distinguishing feature vectors for each origin. For any dairy product sample, the Euclidean distance between the main distinguishing feature vector of the dairy product sample and each standard main distinguishing feature vector is calculated. The standard main distinguishing feature vector corresponding to the minimum Euclidean distance is taken as the target vector, and the origin corresponding to the target vector is traced back to the origin of the dairy product sample.
[0052] Establishing a traceability origin database is an existing technology. In each origin to be traced, a number of dairy product samples of the same quantity can be taken. The main distinguishing feature vector of each dairy product sample can be calculated by the method in steps 1-4. For the main distinguishing feature vectors of dairy product samples from the same origin, their average value can be calculated as the standard main distinguishing feature vector of that origin. The above formulas are all dimensionless calculations. The formulas are derived from software simulations based on a large amount of collected data to obtain the most recent real-world results. The preset parameters in the formulas are set by those skilled in the art according to the actual situation.
[0053] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented in software, the above embodiments can be implemented, in whole or in part, as a computer program product. Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution.
[0054] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.
[0055] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that cannot be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.
Claims
1. A method for tracing the origin of dairy products based on characteristic differences, characterized in that, The specific steps include: S1: Collect mass spectrometry characteristic data of each dairy product sample, and cluster each dairy product sample based on the mass spectrometry data to obtain the first analysis result; S2: For any dairy product sample in the first parsing result, calculate the first boundary degree of each dairy product sample data based on its intra-class distance and inter-class distance, calculate its local density in combination with its neighboring dairy product samples, correct its boundary degree based on the local density to obtain the second boundary degree, determine the boundary weight based on the second boundary degree, and set the weight matrix based on the boundary weight. S3: For any two dairy samples, determine their clustering relationship based on the clustering results, traverse the clustering relationships of all dairy samples to construct a class relationship matrix, construct a boundary direction matrix based on the clustering results and boundary weights, and determine the structure enhancement matrix by combining the weight matrix, class relationship matrix and mass spectrometry feature data of each dairy sample. S4: The structure enhancement matrix is analyzed based on principal component analysis, and the dimensionality of each dairy product sample is reduced based on the analysis results to obtain the final analytical data; S5: Establish a traceability origin database and compare the final parsed data with the traceability origin database to trace the origin of each dairy product sample.
2. The method according to claim 1, characterized in that, The mass spectrometry data includes the mass-charge ratio of each metabolomics product; The logic for clustering dairy product samples based on mass spectrometry data is as follows: construct mass spectrometry feature row vectors from the mass spectrometry feature data of each dairy product sample; determine the number of clusters K based on the elbow rule; randomly select K mass spectrometry feature row vectors from dairy product samples as initial cluster centers; for any dairy product sample, calculate the Euclidean distance between its mass spectrometry feature row vector and each cluster center, and assign it to the cluster corresponding to the cluster center with the smallest Euclidean distance; for any cluster, calculate the average value of all dairy product samples in each dimension of mass spectrometry features; update the mass spectrometry feature row vectors formed by the average values to the cluster center of the corresponding cluster; iterate until no dairy product sample is reassigned to a different cluster or no cluster center changes, and obtain the first analytical result; The first analysis result includes the cluster center of each cluster and the mass spectrometry feature row vector of each dairy product sample within each cluster, and for each dairy product sample, the cluster center of its respective cluster is used as its first cluster center.
3. The method according to claim 2, characterized in that, The logic for calculating the first boundary degree is as follows: For any dairy product sample in the first analysis result, the Euclidean distance between the mass spectrometry feature row vector of the dairy product sample and the cluster center of its respective cluster is called its intra-cluster distance. The Euclidean distances between the mass spectrometry feature row vector of the dairy product sample and the cluster centers of other clusters are calculated and the minimum value is taken as its inter-cluster distance. The cluster center with the smallest Euclidean distance between its mass spectrometry feature row vector and other clusters is extracted as its second cluster center. The intra-cluster distance is divided by the inter-cluster distance to obtain its first boundary degree.
4. The method according to claim 3, characterized in that, The logic for determining the second boundary degree is as follows: a preset Euclidean distance threshold is set. For any dairy product sample, other dairy product samples whose Euclidean distance to its mass spectrometry feature row vector is less than the Euclidean distance threshold are extracted as neighboring dairy product samples. The number of neighboring dairy product samples for each dairy product sample is counted and normalized to be used as its local density. The local density is used as the exponent and the natural base is used as the base to calculate the density correction factor of the dairy product sample. The product between the density correction factor of the dairy product sample and the first boundary degree is used as its second boundary degree.
5. The method according to claim 4, characterized in that, For any dairy product sample, its second boundary degree is analyzed using the sigmoid function to obtain the boundary weight of the dairy product sample. The boundary weight of each dairy product sample is used as the diagonal element of the weight matrix, and the remaining elements are padded with 0 to obtain the weight matrix.
6. The method according to claim 2, characterized in that, If two dairy product samples belong to the same cluster, they are considered to be of the same class, and their class magnitude is set to 1. If two dairy product samples belong to different clusters, they are considered to be of different classes, and their class magnitude is set to -1. The class magnitude is used as an element to construct a class relationship matrix.
7. The method according to claim 3, characterized in that, The logic for determining the structure enhancement matrix is as follows: For any dairy product sample, calculate the difference between its second cluster center and its first cluster center to obtain a category separation direction row vector. Each element in the category separation direction row vector is the category separation direction of each mass spectrometry feature in the dairy product sample. For any mass spectrometry feature, calculate the sum of the category separation directions of that mass spectrometry feature in all dairy product samples to obtain the comprehensive category separation direction of that mass spectrometry feature. Construct a row vector from the comprehensive category separation directions of all mass spectrometry features. Multiply this row vector by its transpose and then by the boundary weights to obtain the boundary direction matrix. The mass spectrometry feature row vectors of each dairy product sample are stacked to form a dairy product sample feature matrix. The transpose of the dairy product sample feature matrix is multiplied by the weight matrix, then multiplied by the class relation matrix, then multiplied by the weight matrix again, and then multiplied by the dairy product sample feature matrix again to obtain the class coordination relation matrix. The product of the preset scaling factor and the boundary direction matrix is added to the coordination relation matrix to obtain the structure enhancement matrix.
8. The method according to claim 7, characterized in that, PCA is used to reduce the dimensionality of the structure enhancement matrix to obtain a projection matrix. The projection matrix is then used to project the feature matrix of the dairy product samples to obtain a reduced-dimensional sample matrix. The feature vector of each dairy product sample in the reduced-dimensional sample matrix is then extracted from the reduced-dimensional sample matrix as the main distinguishing feature vector of the corresponding dairy product sample.
9. The method according to claim 8, characterized in that, The traceability origin database consists of standard main distinguishing feature vectors for each origin. For any dairy product sample, the Euclidean distance between the main distinguishing feature vector of the dairy product sample and each standard main distinguishing feature vector is calculated. The standard main distinguishing feature vector corresponding to the minimum Euclidean distance is taken as the target vector, and the origin corresponding to the target vector is traced back to the origin of the dairy product sample.