A deep learning-based food ingredient intelligent detection method and system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By combining the improved ConvNeXt V2 network with joint modeling of hyperspectral and RGB images, the problems of long detection cycle, high cost and insufficient robustness of existing food component detection methods are solved, realizing intelligent, accurate detection and stable identification of food components.

CN122200631APending Publication Date: 2026-06-12HAINAN SHENGJUE INTELLIGENT TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HAINAN SHENGJUE INTELLIGENT TECHNOLOGY CO LTD
Filing Date: 2026-03-20
Publication Date: 2026-06-12

Application Information

Patent Timeline

20 Mar 2026

Application

12 Jun 2026

Publication

CN122200631A

IPC: G06V20/68; G06V10/82; G06N3/0455; G06V10/44; G06V10/77; G06V10/80; G06N3/045; G06N3/0464

AI Tagging

Application Domain

Character and pattern recognition Biological models

Technical Efficacy Phrases

improve accuracy Improve detection accuracy

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing food component detection methods suffer from long detection cycles, high costs, sensitivity to light conditions, and insufficient robustness, making it difficult to meet the real-time and batch detection needs of production lines. Furthermore, component identification confusion or misjudgment frequently occurs in complex environments.

Method used

An improved ConvNeXt V2 network was used to construct a food component detection model. Hyperspectral imaging and RGB visible light images were combined for joint modeling and prediction. Intelligent detection of food components was achieved through a hyperspectral backbone encoder, an RGB auxiliary encoder, and a spectral-appearance feature fusion module.

Benefits of technology

It achieves non-destructive testing of food components, high detection accuracy, and strong robustness. It can stably and accurately identify food components in complex environments, has the ability to optimize the closed-loop testing process, and reduces the cost of manual testing.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122200631A_ABST

Patent Text Reader

Abstract

The application discloses a kind of based on deep learning food ingredient intelligent detection method and system, including step one: the food image sample pair of food to be detected is collected;Step two: image pre-processing is carried out;Step three: through spectrum and appearance feature construction, generate spectrum initial feature tensor and appearance initial feature tensor;Step four: based on improved ConvNeXt V2 network, food ingredient detection model is constructed, spectrum initial feature tensor and appearance initial feature tensor are input into food ingredient detection model to carry out joint feature modeling and prediction inference, obtain component prediction feature vector;Step five: decoding operation is executed to component prediction feature vector;Step six: the compliance and consistency of food ingredients are judged, and risk level is judged;Step seven: version update is carried out to food ingredient detection model.The application improves the intelligence and accuracy of food ingredient detection by improved ConvNeXt V2 network.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of food safety testing and intelligent sensing technology, and in particular to a method and system for intelligent detection of food components based on deep learning. Background Technology

[0002] With the increasing scale of food production and stricter food safety regulations, rapid, non-destructive, and intelligent detection technologies for food components have received widespread attention. Existing methods for food component detection mainly rely on chemical analysis, physicochemical experiments, or single optical imaging for qualitative or quantitative analysis. However, these methods commonly suffer from the following problems in practical applications: Traditional chemical detection methods typically require complex pretreatment processes and specialized experimental equipment, resulting in long detection cycles, high costs, and difficulty in meeting the real-time and batch testing needs of production lines. Detection methods based on single visible light images or single spectral information are highly sensitive to food surface morphology, lighting conditions, and noise interference. In complex processing environments, they are prone to feature instability and insufficient robustness, making it difficult to guarantee the accuracy and reliability of component content estimation. Furthermore, most existing methods fail to fully integrate the spectral and appearance structure information of food during the modeling process, resulting in insufficient characterization of differences in spectral response and spatial texture distribution among different components. This can easily lead to component identification confusion or misjudgment in the presence of formulation differences, batch fluctuations, or adulteration interference.

[0003] Therefore, how to provide a method and system for intelligent detection of food components based on deep learning is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0004] One objective of this invention is to propose a deep learning-based intelligent detection method and system for food components. This invention details the construction of a food component detection model based on an improved ConvNeXt V2 network, and the joint modeling and predictive inference of the spectral and appearance characteristics of the food to be detected. This enables intelligent detection and risk assessment of the content and distribution of food components, possessing advantages such as non-destructive detection process, high detection accuracy, and strong robustness.

[0005] A method for intelligent detection of food components based on deep learning according to an embodiment of the present invention includes the following steps: Step 1: Collect food image sample pairs of the food to be tested; Step 2: Perform image preprocessing on the food image sample pairs to generate standard image sample pairs; Step 3: Construct spectral and appearance features from standard image sample pairs to generate initial spectral feature tensors and initial appearance feature tensors; Step 4: Construct a food component detection model based on the improved ConvNeXt V2 network. Input the initial spectral feature tensor and the initial appearance feature tensor into the food component detection model for joint feature modeling and prediction inference to obtain the component prediction feature vector. The food component detection model includes a hyperspectral backbone encoder, an RGB auxiliary encoder, a spectral-appearance feature fusion module, and a component prediction module. Step 5: Perform a decoding operation on the component prediction feature vector to obtain the component content prediction vector and component composition distribution vector of the food to be tested; Step 6: Based on the component content prediction vector and component composition distribution vector, perform a food component compliance and consistency determination, determine the risk level, and generate a risk level label for the food to be tested; Step 7: Collect the labeling data of the retested ingredients of the food to be tested, and update the food ingredient detection model using the backpropagation method.

[0006] Optionally, step one specifically includes: Obtain the sample labeling information of the food to be tested; Hyperspectral imaging images of the food to be tested are acquired using a hyperspectral imaging acquisition device, and RGB visible light images of the food to be tested are acquired using a visible light image acquisition device. Based on the sample identification information, the hyperspectral imaging image of the food to be tested is matched with the RGB visible light image to form a food image sample pair.

[0007] Optionally, step two specifically includes: The image preprocessing includes noise suppression, resolution resampling, intensity normalization, and spatial registration; Noise suppression was performed on the hyperspectral imaging image and the RGB visible light image using median filtering to obtain a denoised hyperspectral image and a denoised RGB image. The denoised hyperspectral image and the denoised RGB image were resampled according to the resolution using bilinear interpolation to obtain the resampled hyperspectral image and the resampled RGB image. The intensity normalization specifically involves performing minimum-maximum normalization on the intensity value of each pixel in the resampled hyperspectral image and the resampled RGB image to obtain the standard hyperspectral image and the normalized RGB image. The spatial registration specifically involves: using a standard hyperspectral image as a reference image, mapping the normalized RGB image to the spatial coordinate system of the standard hyperspectral image through geometric transformation to obtain a standard RGB image; Standard image sample pairs are constructed by combining standard hyperspectral images with standard RGB images.

[0008] Optionally, step three specifically includes: Based on a standard hyperspectral image, the pixel intensity values of each hyperspectral band are extracted at each pixel coordinate and arranged in the order of the spectral band index to obtain the spectral band vector. The spectral segment vector is compressed and mapped by a linear mapping and the GELU activation function to obtain the spectral embedding vector. The spectral embedding vectors are stacked according to their spatial positions to obtain the initial spectral feature tensor; The standard RGB image is convolved position by position in the spatial dimension using a sliding window method through 3×3 convolution to generate the initial appearance feature tensor.

[0009] Optionally, step four specifically includes: The initial spectral feature tensor is input into the hyperspectral backbone encoder to generate the hyperspectral feature tensor. The initial appearance feature tensor is input into the RGB auxiliary encoder to generate the RGB visual feature tensor. The hyperspectral feature tensor and the RGB visual feature tensor are input into the spectral-appearance feature fusion module for attention operations to obtain the food component fusion feature tensor, specifically: Define a feature mapping space, and map the hyperspectral feature tensor to the feature mapping space through a trainable query mapping matrix to obtain the hyperspectral query tensor; The RGB visual feature tensor is mapped to the feature mapping space through trainable key mapping matrix and value mapping matrix respectively, to obtain the RGB key tensor and RGB value tensor; Flatten the hyperspectral query tensor, RGB key tensor, and RGB value tensor into matrix form in the spatial dimension to obtain the hyperspectral query matrix, RGB key matrix, and RGB value matrix. Multiply the hyperspectral query matrix with the transpose of the RGB key matrix, divide by the square root of the dimension of the feature mapping space, and then normalize using the Softmax function to generate the food component attention weight matrix. Based on the food component attention weight matrix, the RGB value matrix is weighted and summed to obtain the food component fusion feature matrix. The food component fusion feature matrix is then reshaped into a tensor form to obtain the food component fusion feature tensor. In the component prediction module, the food component fusion feature tensor is subjected to global average pooling to obtain the food component fusion feature vector; the food component fusion feature vector is then subjected to linear mapping and GELU activation to obtain the component prediction feature vector.

[0010] Optionally, the hyperspectral backbone encoder includes a spectral band sensing mapping module, an input mapping downsampling module, a multi-stage feature extraction backbone, and a stage-level spectral band weight modulation module, specifically including: In the spectral band perception mapping module, the spectral band embedding channel of the initial spectral feature tensor is mapped to the backbone channel dimension by 1×1 convolution to obtain the channel mapping feature tensor. By using one-dimensional convolution to model the channel-mapped feature tensor in the channel dimension, a spectral band-aware feature tensor is obtained. In the input mapping downsampling module, the spectral band sensing feature tensor is downsampled and channel mapped by a 4×4 convolution with a stride of 4 to obtain the stage mapping feature tensor. The multi-stage feature extraction backbone includes four feature extraction stages, wherein the input feature tensor of each feature extraction stage is the modulation output feature tensor of the previous stage-level spectral band weight modulation module; the input feature tensor of the first feature extraction stage is the stage mapping feature tensor. Each feature extraction stage consists of several spectral band hybrid enhancement blocks, and each spectral band hybrid enhancement block includes a spatial feature branch and a spectral band feature branch; In the spatial feature branch, the stage input feature tensor is processed by 7×7 convolution and layer normalization to extract spatial features and normalize features, thereby obtaining the intermediate spatial feature tensor. The intermediate spatial feature tensor is subjected to channel expansion and nonlinear mapping by 1×1 convolution and GELU activation function to obtain the expanded spatial feature tensor. The expanded spatial feature tensor is then subjected to channel back mapping by 1×1 convolution to generate the spatial feature tensor. In the spectral feature branch, the stage input feature tensor is processed by spectral dimension mixing mapping through 1×1 group convolution along the channel dimension, nonlinear mapping processing through GELU activation function, and channel integration mapping processing through 1×1 convolution to obtain the spectral feature tensor. The spatial feature tensor and the spectral feature tensor are concatenated by channels, and the feature is fused by a trainable fusion mapping matrix to obtain the fused feature tensor. The input feature tensor and the fused feature tensor are then residually connected to obtain the stage output feature tensor. In the stage-level spectral band weight modulation module, the stage output feature tensor of each feature extraction stage is globally averaged and pooled to obtain the stage channel statistical vector. The stage channel statistical vector is then mapped to the Sigmoid function to generate the stage spectral band modulation vector. Based on the stage spectral modulation vector, channel-by-channel modulation is performed on the stage output feature tensor to obtain the modulated output feature tensor. The modulation output feature tensor generated by the last stage-level spectral band weighted modulation module is used as the hyperspectral feature tensor.

[0011] Optionally, the RGB auxiliary encoder includes a texture enhancement mapping module, an input mapping downsampling module, a four-stage feature extraction module, and a spatial weight guidance module, specifically including: In the texture enhancement mapping module, the initial appearance feature tensor is subjected to two 3×3 convolution operations to generate the texture enhancement feature tensor; In the input mapping downsampling module, the texture enhancement feature tensor is downsampled and channel mapped by a 4×4 convolution with a stride of 4 to obtain the RGB mapping feature tensor. The four-stage feature extraction module includes four RGB feature extraction stages. The RGB input feature tensor of the first RGB feature extraction stage is an RGB mapping feature tensor, and the RGB input feature tensors of the second to fourth RGB feature extraction stages are the RGB output feature tensors of the previous RGB feature extraction stage. The RGB output feature tensor of the fourth RGB feature extraction stage is input into the spatial weight guidance module, and a spatial weight logit graph is generated through 1×1 convolution. The spatial weight logit graph is then normalized by Softmax to obtain the spatial weight graph. Based on the spatial weighted graph, position-wise weighted modulation is performed on the RGB output feature tensor to generate the RGB visual feature tensor.

[0012] Optionally, step five specifically includes: Obtain the number of target ingredient types in the food to be tested; Based on the number of target component types, a component content decoding parameter matrix and a component content bias vector are constructed through parameter initialization methods. Based on the component content decoding parameter matrix and the component content bias vector, a linear mapping is performed on the component prediction feature vector to obtain the component content prediction vector; The component distribution decoding parameter matrix and component distribution bias vector are constructed using parameter initialization methods. Based on the component distribution decoding parameter matrix and the component distribution bias vector, a linear mapping is performed on the component prediction feature vector to generate the component distribution logit vector; The component composition distribution vector is obtained by performing Softmax normalization on the component distribution logit vector.

[0013] Optionally, step six specifically includes: Set lower and upper content thresholds for each target component, and perform a compliance check on each component content prediction vector: If the current dimension component of the component content prediction vector is greater than or equal to the lower limit threshold of the target component content and less than or equal to the upper limit threshold of the target component content, then a compliance label 1 is generated; otherwise, a compliance label 0 is generated. All compliance labels are summed up and recorded as a compliance score. If the compliance score equals the number of target ingredient types, the compliance judgment result of the food to be tested is qualified; otherwise, the compliance judgment result is abnormal. Set the formula distribution vector and formula consistency judgment threshold, and perform consistency judgment on the component composition distribution vector: Calculate the L1 norm of the vector difference between the component composition distribution vector and the formula distribution vector to obtain the consistency deviation value. If the consistency deviation value is less than or equal to the formula consistency judgment threshold, the consistency judgment result of the current food to be tested is that the formula is consistent; otherwise, the consistency judgment result is that the formula is inconsistent. Based on the compliance assessment results and the consistency assessment results, a risk level label is generated for the food to be tested: Set the number of risk levels, sum the differences between all 1s and the compliance labels to obtain a complementary compliance score; Based on the consistency determination result, a formula indication score is generated: if the formulas are consistent, the formula indication score is 0; if the formulas are inconsistent, the formula indication score is 1. Calculate the sum of the complementary compliance score, the formula instruction score, and 1 to obtain the original risk level value; The minimum of the number of risk levels and the original value of the risk level is used as the risk level identifier for the food to be tested.

[0014] According to an embodiment of the present invention, a deep learning-based intelligent food component detection system includes: The image acquisition module is used to acquire food image sample pairs of the food to be tested; The image preprocessing module is used to preprocess food image sample pairs to generate standard image sample pairs; The feature construction module is used to construct spectral and appearance features from standard image sample pairs, generating initial spectral feature tensors and initial appearance feature tensors. The component feature modeling module is used to build a food component detection model based on the improved ConvNeXt V2 network. It inputs the initial spectral feature tensor and the initial appearance feature tensor into the food component detection model to perform joint feature modeling and prediction inference, and obtains the component prediction feature vector. The component decoding module is used to perform decoding operations on the component prediction feature vector to obtain the component content prediction vector and the component composition distribution vector. The component determination and risk assessment module is used to perform compliance and consistency determination of food components based on the component content prediction vector and component composition distribution vector, and generate the risk level label of the food to be tested. The model update module is used to update the version of the food component detection model.

[0015] The beneficial effects of this invention are: This invention introduces an improved ConvNeXt V2 network to construct a food component detection model comprising a hyperspectral backbone encoder, an RGB auxiliary encoder, a spectral-appearance feature fusion module, and a component prediction module. This enables collaborative modeling and joint representation of food spectral and appearance structural information, thus simultaneously considering differences in component spectral response and spatial texture distribution characteristics during the feature modeling stage, reducing the sensitivity of a single modality to illumination changes, noise interference, and surface morphology differences. In the decoding stage, the component prediction feature vector is mapped to a component content prediction vector and a component composition distribution vector, respectively, giving the detection results both quantitative numerical representation and formulation structure representation capabilities. In the judgment stage, a risk level identifier is generated through a joint judgment mechanism of component content threshold constraints and formulation consistency constraints, avoiding misjudgments and omissions caused by relying on a single indicator. In the model operation stage, a backpropagation update mechanism based on re-examined component labeling data is further introduced to construct a closed-loop optimization process of detection, feedback, and update. This allows the food component detection model to continuously correct parameters and maintain stable detection accuracy during long-term operation, thereby improving the accuracy, robustness, and adaptability of component detection without increasing manual detection costs. Attached Figure Description

[0016] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings: Figure 1 This is a schematic diagram of a deep learning-based intelligent detection method and system for food components proposed in this invention. Figure 2 This is a flowchart of the food component detection model structure in a deep learning-based intelligent food component detection method and system proposed in this invention. Figure 3 This invention presents a flowchart of the decoding and judgment process in a deep learning-based intelligent food component detection method and system. Detailed Implementation

[0017] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0018] refer to Figures 1-3 A deep learning-based intelligent detection method for food components includes the following steps: Step 1: Collect food image sample pairs of the food to be tested; Step 2: Perform image preprocessing on the food image sample pairs to generate standard image sample pairs; Step 3: Construct spectral and appearance features from standard image sample pairs to generate initial spectral feature tensors and initial appearance feature tensors; Step 4: Construct a food component detection model based on the improved ConvNeXt V2 network. Input the initial spectral feature tensor and the initial appearance feature tensor into the food component detection model for joint feature modeling and prediction inference to obtain the component prediction feature vector. The food component detection model includes a hyperspectral backbone encoder, an RGB auxiliary encoder, a spectral-appearance feature fusion module, and a component prediction module. Step 5: Perform a decoding operation on the component prediction feature vector to obtain the component content prediction vector and component composition distribution vector of the food to be tested; Step 6: Based on the component content prediction vector and component composition distribution vector, perform a food component compliance and consistency determination, determine the risk level, and generate a risk level label for the food to be tested; Step 7: Collect the labeling data of the retested ingredients of the food to be tested, and update the food ingredient detection model using the backpropagation method.

[0019] In this invention, the re-examination of component labeling data includes the true component content labeling vector and the true component composition distribution vector; the mean squared error loss between the component content prediction vector and the true component content labeling vector is calculated, and the cross-entropy loss between the component composition distribution vector and the true component composition distribution vector is calculated; the mean squared error loss and the cross-entropy loss are weighted to generate a joint supervised loss function; based on the joint supervised loss function, the gradient of each network parameter in the food component detection model is calculated using the backpropagation method, and the network parameters are updated using gradient descent to generate an updated version of the food component detection model.

[0020] In this embodiment, step one specifically includes: Obtain the sample identification information of the food to be tested; the sample identification information includes the food number, batch number, collection time, and collection station identification. Hyperspectral imaging images of the food to be tested are acquired using a hyperspectral imaging acquisition device, and RGB visible light images of the food to be tested are acquired using a visible light image acquisition device. Based on the sample identification information, the hyperspectral imaging image of the food to be tested is matched with the RGB visible light image to form a food image sample pair.

[0021] In this invention, the hyperspectral imaging image includes the reflection intensity distribution information of the food to be detected under multiple continuous spectral bands and the corresponding spatial pixel position information. The reflection intensity distribution information under multiple continuous spectral bands can characterize the differences in absorption and reflection of different components in the food to different wavelengths of light, and is used to characterize the spectral characteristics of components such as water, fat, protein, sugar, and additives in the food, thereby providing a direct basis for component identification, such as food component content estimation, component category identification, and adulteration identification. The RGB visible light image includes the appearance color distribution information, surface texture information, and structural morphology information of the food to be detected, and is used to characterize the appearance structural features, particle distribution state, surface uniformity, and location of foreign objects or abnormal areas of the food. By providing spatial region constraints and appearance structure reference information for the hyperspectral imaging image, it helps to locate key areas related to components and complements the hyperspectral imaging image, thereby improving the stability and reliability of food component detection.

[0022] In this embodiment, step two specifically includes: Image preprocessing includes noise suppression, resolution resampling, intensity normalization, and spatial registration; Noise suppression was performed on the hyperspectral imaging image and the RGB visible light image using median filtering to obtain a denoised hyperspectral image and a denoised RGB image. In this invention, the median filtering method is an existing technology, and the specific implementation process is as follows: Assume that in a hyperspectral imaging image, the intensity values of the 3×3 neighborhood pixels around a certain pixel 255 are [52, 50, 51; 49, 255, 50; 48, 51, 49]. Among them, 255 is obviously an abnormal noise point caused by sensor jitter or impulse noise. First, sort the 9 values in the neighborhood as [48, 49, 49, 50, 50, 51, 51, 52, 255]. Then, take the value 50 located in the middle position as the pixel intensity value of pixel 255, that is, replace the original 255 with 50. This can effectively remove prominent noise points and maintain the overall distribution of the surrounding food image texture and edge structure without being obviously smoothed or blurred, thereby achieving noise suppression of hyperspectral imaging images or RGB visible light images.

[0023] The denoised hyperspectral image and the denoised RGB image were resampled according to the resolution using bilinear interpolation to obtain the resampled hyperspectral image and the resampled RGB image. In this invention, the bilinear interpolation method is an existing technology. The specific implementation process is as follows: Assume that when resampling a denoised hyperspectral image at a certain resolution, it is necessary to calculate the pixel intensity value of a target pixel (x, y) at the target resolution. The target pixel (x, y) in the original image falls between four adjacent pixels (x1, y1), (x2, y1), (x1, y2), and (x2, y2), with pixel intensity values of 40, 60, 80, and 100 respectively. Furthermore, the target pixel (x, y)... If the relative position weights of y) in the horizontal and vertical directions are 0.25 and 0.5 respectively, then in the horizontal direction, the interpolation for the two adjacent pixels above is: 40×(1−0.25)+60×0.25=45, and the interpolation for the two adjacent pixels below is: 80×(1−0.25)+100×0.25=85; in the vertical direction, the interpolation of the above two results is 45×(1−0.5)+85×0.5=65; therefore, the new pixel intensity value of the target pixel is 65. By repeating the above calculation for each target pixel in the entire image, the original image can be smoothly resampled to a new spatial resolution while maintaining the continuity of the food image in terms of spatial structure and grayscale variation.

[0024] Intensity normalization specifically involves performing minimum-maximum normalization on the intensity value of each pixel in the resampled hyperspectral image and the resampled RGB image to obtain the standard hyperspectral image and the normalized RGB image. In this invention, minimum-maximum normalization is a prior art technique, and the specific implementation process is as follows: Assuming that the minimum pixel intensity of a certain spectral band in a resampled hyperspectral image is 20 and the maximum is 220, when performing minimum-maximum normalization on a pixel with an original pixel intensity of 120, the normalization result is calculated according to the formula (120−20) / (220−20)=100 / 200=0.5; Similarly, a pixel with an original pixel intensity of 20 becomes 0 after minimum-maximum normalization, and a pixel with an original pixel intensity of 220 becomes 1 after minimum-maximum normalization.

[0025] Spatial registration specifically involves: using a standard hyperspectral image as a reference image, mapping the normalized RGB image to the spatial coordinate system of the standard hyperspectral image through geometric transformation to obtain the standard RGB image; Standard image sample pairs are constructed by combining standard hyperspectral images with standard RGB images.

[0026] In this embodiment, step three specifically includes: Based on a standard hyperspectral image, the pixel intensity values of each hyperspectral band are extracted at each pixel coordinate and arranged in the order of the spectral band index to obtain the spectral band vector. The spectral segment vector is compressed and mapped by a linear mapping and the GELU activation function to obtain the spectral embedding vector. The spectral embedding vectors are stacked according to their spatial positions to obtain the initial spectral feature tensor; The standard RGB image is convolved position by position in the spatial dimension using a sliding window method through 3×3 convolution to generate the initial appearance feature tensor.

[0027] In this embodiment, step four specifically includes: The initial spectral feature tensor is input into the hyperspectral backbone encoder to generate the hyperspectral feature tensor. The initial appearance feature tensor is input into the RGB auxiliary encoder to generate the RGB visual feature tensor. The hyperspectral feature tensor and the RGB visual feature tensor are input into the spectral-appearance feature fusion module for attention operations to obtain the food component fusion feature tensor, specifically: Define a feature mapping space, and map the hyperspectral feature tensor to the feature mapping space through a trainable query mapping matrix to obtain the hyperspectral query tensor; The RGB visual feature tensor is mapped to the feature mapping space through trainable key mapping matrix and value mapping matrix respectively, to obtain the RGB key tensor and RGB value tensor; Flatten the hyperspectral query tensor, RGB key tensor, and RGB value tensor into matrix form in the spatial dimension to obtain the hyperspectral query matrix, RGB key matrix, and RGB value matrix. Multiply the hyperspectral query matrix with the transpose of the RGB key matrix, divide by the square root of the dimension of the feature mapping space, and then normalize using the Softmax function to generate the food component attention weight matrix. Based on the food component attention weight matrix, the RGB value matrix is weighted and summed to obtain the food component fusion feature matrix. The food component fusion feature matrix is then reshaped into a tensor form to obtain the food component fusion feature tensor. In the component prediction module, the food component fusion feature tensor is subjected to global average pooling to obtain the food component fusion feature vector; the food component fusion feature vector is then subjected to linear mapping and GELU activation to obtain the component prediction feature vector.

[0028] In this embodiment, the hyperspectral backbone encoder includes a spectral band sensing mapping module, an input mapping downsampling module, a multi-stage feature extraction backbone, and a stage-level spectral band weight modulation module, specifically including: In the spectral band perception mapping module, the spectral band embedding channel of the initial spectral feature tensor is mapped to the backbone channel dimension by 1×1 convolution to obtain the channel mapping feature tensor. By using one-dimensional convolution to model the channel-mapped feature tensor in the channel dimension, a spectral band-aware feature tensor is obtained. In the input mapping downsampling module, the spectral band sensing feature tensor is downsampled and channel mapped by a 4×4 convolution with a stride of 4 to obtain the stage mapping feature tensor. The multi-stage feature extraction backbone consists of four feature extraction stages. The input feature tensor of each feature extraction stage is the modulation output feature tensor of the previous stage-level spectral band weight modulation module. The input feature tensor of the first feature extraction stage is the stage mapping feature tensor. Each feature extraction stage consists of several spectral band hybrid enhancement blocks, and each spectral band hybrid enhancement block includes a spatial feature branch and a spectral band feature branch; In the spatial feature branch, the stage input feature tensor is processed by 7×7 convolution and layer normalization to extract spatial features and normalize features, thereby obtaining the intermediate spatial feature tensor. The intermediate spatial feature tensor is subjected to channel expansion and nonlinear mapping by 1×1 convolution and GELU activation function to obtain the expanded spatial feature tensor. The expanded spatial feature tensor is then subjected to channel back mapping by 1×1 convolution to generate the spatial feature tensor. In the spectral feature branch, the stage input feature tensor is processed by spectral dimension mixing mapping through 1×1 group convolution along the channel dimension, nonlinear mapping processing through GELU activation function, and channel integration mapping processing through 1×1 convolution to obtain the spectral feature tensor. The spatial feature tensor and the spectral feature tensor are concatenated by channels, and the feature is fused by a trainable fusion mapping matrix to obtain the fused feature tensor. The input feature tensor and the fused feature tensor are then residually connected to obtain the stage output feature tensor. In the stage-level spectral band weight modulation module, the stage output feature tensor of each feature extraction stage is globally averaged and pooled to obtain the stage channel statistical vector. The stage channel statistical vector is then mapped to the Sigmoid function to generate the stage spectral band modulation vector. Based on the stage spectral modulation vector, channel-by-channel modulation is performed on the stage output feature tensor to obtain the modulated output feature tensor. The modulation output feature tensor generated by the last stage-level spectral band weighted modulation module is used as the hyperspectral feature tensor.

[0029] In this embodiment, the RGB auxiliary encoder includes a texture enhancement mapping module, an input mapping downsampling module, a four-stage feature extraction module, and a spatial weight guidance module, specifically including: In the texture enhancement mapping module, the initial appearance feature tensor is subjected to two 3×3 convolution operations to generate the texture enhancement feature tensor; In the input mapping downsampling module, the texture enhancement feature tensor is downsampled and channel mapped by a 4×4 convolution with a stride of 4 to obtain the RGB mapping feature tensor. The four-stage feature extraction module includes four RGB feature extraction stages. The RGB input feature tensor of the first RGB feature extraction stage is the RGB mapping feature tensor, and the RGB input feature tensors of the second to fourth RGB feature extraction stages are the RGB output feature tensors of the previous RGB feature extraction stage. The RGB output feature tensor of the fourth RGB feature extraction stage is input into the spatial weight guidance module, and a spatial weight logit graph is generated through 1×1 convolution. The spatial weight logit graph is then normalized by Softmax to obtain the spatial weight graph. Based on the spatial weighted graph, position-wise weighted modulation is performed on the RGB output feature tensor to generate the RGB visual feature tensor.

[0030] In this invention, both the hyperspectral backbone encoder and the RGB auxiliary encoder are built based on the original ConvNeXt V2 network structure and inherit its hierarchical convolutional backbone framework. The overall structure still adopts a hierarchical feature extraction structure consisting of an input mapping layer and four-stage feature extraction modules. The spatial resolution decreases and the channel dimension increases between each stage through downsampling layers. Within each stage, the basic computational paradigm of ConvNeXt V2 is followed, that is, using large kernel depth separable convolutions as spatial modeling units, and combining layer normalization, pointwise convolution channel expansion and back mapping structure, and residual connection mechanism to realize local spatial feature modeling and channel dimension feature transformation. At the same time, the stage-by-stage stacking method and feature pyramid structure are maintained in line with ConvNeXt V2, so that the two branches are consistent with the original ConvNeXt V2 network in terms of overall network topology, hierarchical organization, and feature hierarchical abstraction mechanism.

[0031] While maintaining the aforementioned backbone structure, the hyperspectral backbone encoder and RGB auxiliary encoder have undergone structural improvements tailored to the characteristics of the input data. Specifically, the hyperspectral backbone encoder adds a spectral band sensing mapping module at the input end, performing spectral band blending modeling on multi-spectral inputs through one-dimensional convolution and channel mapping operations. It also introduces a spectral band blending enhancement block composed of spatial feature branches and spectral band feature branches within the backbone. Furthermore, a stage-level spectral band weight modulation module is set after each stage to modulate the responses of different spectral bands channel by channel, thus explicitly introducing spectral band-dimensional modeling and spectral band weight modulation mechanisms on top of the original ConvNeXt V2 spatial modeling. The RGB auxiliary encoder, while maintaining the ConvNeXt V2 backbone structure, adds a texture enhancement mapping module at the input end and a spatial weight guidance module at the output end. It enhances the appearance texture and structural regions through lightweight convolution and spatial weight modulation without introducing additional spectral band-dimensional modeling structures, thus forming a differentiated dual-branch structure with the hyperspectral branch as the main branch and the RGB branch as the auxiliary branch.

[0032] Through the aforementioned structural improvements, the hyperspectral backbone encoder inherits the strong spatial feature modeling capabilities of ConvNeXt V2, further introducing explicit modeling of spectral band dimensional correlation and adaptive modulation of spectral band weights. This enables the network to more effectively characterize the response differences of food components across different spectral bands, thereby enhancing its ability to express and distinguish component-related spectral features. The RGB auxiliary encoder, through texture enhancement and spatial weight guidance, emphasizes only key appearance regions and texture structures, providing stable spatial prior constraints for the hyperspectral branch without significantly increasing model complexity. The combination of these two features gives the food component detection model both dedicated modeling capabilities for hyperspectral component information and region guidance capabilities for RGB appearance information, thereby improving feature discriminancy, modeling stability, and detection accuracy in food component detection tasks.

[0033] In this embodiment, step five specifically includes: Obtain the number of target ingredient types in the food to be tested; Based on the number of target component types, a component content decoding parameter matrix and a component content bias vector are constructed through parameter initialization. The component content decoding parameter matrix is structured as the number of target component types × the feature dimension of the component prediction feature vector, and the feature dimension of the component content bias vector is equal to the number of target component types. Based on the component content decoding parameter matrix and the component content bias vector, a linear mapping is performed on the component prediction feature vector to obtain the component content prediction vector; where the i-th component of the component content prediction vector represents the predicted content value of the i-th target component. The component distribution decoding parameter matrix and component distribution bias vector are constructed using parameter initialization methods. Based on the component distribution decoding parameter matrix and the component distribution bias vector, a linear mapping is performed on the component prediction feature vector to generate the component distribution logit vector; The component composition distribution vector is obtained by performing Softmax normalization on the component distribution logit vector. Here, the i-th component of the component composition distribution vector represents the compositional proportion of the i-th target component in the food to be tested. For example, if the target number of components in the food to be tested is set to 4, corresponding to moisture, protein, fat, and sugar respectively, and the feature dimension of the component prediction feature vector is 128, the component content prediction vector obtained through linear mapping is [62.5, 12.3, 8.7, 5.4], where the first to fourth dimensions represent the predicted content values of moisture, protein, fat, and sugar in the food to be tested, respectively; the generated component distribution logit vector is [2.1, 1.3, 0.2, −0.4], and the component composition distribution vector obtained after performing Softmax normalization is [0.55, 0.25, 0.12, 0.08], where the first to fourth dimensions represent the composition ratio of moisture, protein, fat, and sugar in the food to be tested, respectively.

[0034] In this invention, since the component prediction feature vector has already integrated the component-related discrimination information contained in the hyperspectral features and RGB appearance features, and the response patterns of the component prediction feature vector to different components in the feature space are separable, by constructing the component content decoding parameter matrix and the component distribution decoding parameter matrix and performing a linear mapping on the component prediction feature vector, the high-dimensional discrimination features can be mapped to the component semantic space with the number of target component types as the dimension, thereby obtaining the component content prediction vector representing the quantitative value of each target component and the component composition distribution vector representing the relative proportion of each target component, respectively. Among them, the Softmax normalization operation ensures that the component composition distribution vector satisfies the normalization constraint, so that each component can be directly used as the component proportion. Therefore, the above decoding method can stably and interpretably convert the component prediction feature vector output by the network into the quantitative results and structured distribution results required for the food component detection task.

[0035] In this embodiment, step six specifically includes: Set lower and upper content thresholds for each target component, and perform a compliance check on each component content prediction vector: If the current dimension component of the component content prediction vector is greater than or equal to the lower limit threshold of the target component content and less than or equal to the upper limit threshold of the target component content, then a compliance label 1 is generated; otherwise, a compliance label 0 is generated. All compliance labels are summed up and recorded as a compliance score. If the compliance score equals the number of target ingredient types, the compliance judgment result of the food to be tested is qualified; otherwise, the compliance judgment result is abnormal. Set the formula distribution vector and formula consistency judgment threshold, and perform consistency judgment on the component composition distribution vector: Calculate the L1 norm of the vector difference between the component composition distribution vector and the formula distribution vector to obtain the consistency deviation value. If the consistency deviation value is less than or equal to the formula consistency judgment threshold, the consistency judgment result of the current food to be tested is that the formula is consistent; otherwise, the consistency judgment result is that the formula is inconsistent. Based on the compliance assessment results and the consistency assessment results, a risk level label is generated for the food to be tested: Set the number of risk levels, sum the differences between all 1s and the compliance labels to obtain a complementary compliance score; Based on the consistency determination result, a formula indication score is generated: if the formulas are consistent, the formula indication score is 0; if the formulas are inconsistent, the formula indication score is 1. Calculate the sum of the complementary compliance score, the formula instruction score, and 1 to obtain the original risk level value; The minimum of the number of risk levels and the original value of the risk level is used as the risk level identifier for the food to be tested.

[0036] A deep learning-based intelligent food component detection system includes: The image acquisition module is used to acquire food image sample pairs of the food to be tested; The image preprocessing module is used to preprocess food image sample pairs to generate standard image sample pairs; The feature construction module is used to construct spectral and appearance features from standard image sample pairs, generating initial spectral feature tensors and initial appearance feature tensors. The component feature modeling module is used to build a food component detection model based on the improved ConvNeXt V2 network. It inputs the initial spectral feature tensor and the initial appearance feature tensor into the food component detection model to perform joint feature modeling and prediction inference, and obtains the component prediction feature vector. The component decoding module is used to perform decoding operations on the component prediction feature vector to obtain the component content prediction vector and the component composition distribution vector. The component determination and risk assessment module is used to perform compliance and consistency determination of food components based on the component content prediction vector and component composition distribution vector, and generate the risk level label of the food to be tested. The model update module is used to update the version of the food component detection model.

[0037] Example 1: To verify the feasibility of this invention in practice, the method was applied to the online sampling inspection stage of a sausage filling and packaging production line in a meat processing company. This production line produces approximately 120,000 pre-packaged sausages daily. The key target components in the formula include six categories: protein, fat, moisture, starch, salt, and nitrite. Both regulatory and internal control requirements mandate consistency verification of the component content range and formula composition ratios. Existing practices mainly rely on laboratory physicochemical testing or empirical judgment based on visible light appearance. Physicochemical testing requires sampling, testing, and pretreatment, with results typically taking hours, making timely interception on the production line difficult.

[0038] In the implementation scenario, the hyperspectral imaging acquisition device and the visible light image acquisition device are placed above the conveyor belt after packaging. The acquisition station and time are synchronized with the production line MES system, and sample identification information is generated for each sample. The sample identification information includes the food number to be tested, batch number, acquisition time, and acquisition station identification. When the conveyor belt passes the fixed trigger line, the hyperspectral imaging acquisition device acquires a hyperspectral imaging image covering the range of 400 to 1000 nm, and the visible light image acquisition device simultaneously acquires an RGB visible light image. Then, based on the sample identification information, the hyperspectral imaging image and the RGB visible light image of the same sausage are matched to form a food image sample pair. Standard image sample pairs are generated through preprocessing, and then spectral and appearance feature construction is performed on the standard image sample pairs to generate initial spectral feature tensors and initial appearance feature tensors.

[0039] The initial spectral feature tensor is input into the hyperspectral backbone encoder to generate a hyperspectral feature tensor, and the initial appearance feature tensor is input into the RGB auxiliary encoder to generate an RGB visual feature tensor. Attention operations are performed in the spectral-appearance feature fusion module to obtain the food component fusion feature tensor, which is then output as a component prediction feature vector by the component prediction module. By decoding the component prediction feature vector, a component content prediction vector and a component composition distribution vector are obtained. The compliance of each component content prediction vector is then determined based on the enterprise's internal control threshold library, and the consistency between the component composition distribution vector and the target formula distribution vector is determined to generate a risk level identifier, which is written back to the production line MES system. When the risk level identifier reaches the preset interception condition, the sampling rejection and re-inspection process is triggered. The laboratory performs near-infrared / liquid chromatography re-inspection on the sample to generate a true component content label vector and a true component composition distribution vector, which are then fed back as re-inspection component label data for updating the food component detection model version through backpropagation.

[0040] To further verify the actual effect of the present invention, comparative experiments were conducted with the following alternative solutions: Comparative solution 1 is a food component detection method based on traditional manual features and support vector machine classifiers. This method extracts only band mean, principal component features and texture statistical features from hyperspectral images and uses a support vector machine model for component content regression and category determination; Comparative solution 2 is a method based on ConvNeXt V2 network that only processes RGB visible light images; Comparative solution 3 is a method based on ConvNeXt V2 network that only processes hyperspectral imaging images. The results of the comparative experiments are shown in Table 1.

[0041] Table 1. Performance Comparison of Different Solutions in Intelligent Food Component Detection Scenarios on Sausage Production Lines

[0042] As shown in Table 1, the present invention outperforms the comparative schemes in multiple performance indicators. Specifically, the present invention has errors of 0.34 and 0.49 in the MAE and RMSE of component content, respectively, which are significantly lower than those of comparative scheme 2 (1.05 and 1.48) and comparative scheme 3 (0.62 and 0.91), indicating that the present invention has a smaller prediction error in terms of component content prediction accuracy. In terms of the content prediction R², the present invention reaches 0.968, significantly higher than that of comparative scheme 2 (0.861) and comparative scheme 3 (0.927), indicating that the present invention has a stronger ability to fit the trend of component content changes and a higher consistency between the predicted results and the actual values. In terms of the accuracy rate of formula consistency determination and risk level identification, the present invention reaches 97.8% and 96.9%, respectively, both significantly better than those of comparative scheme 2 (86.4% and 84.9%) and comparative scheme 3 (91.7% and 90.1%), indicating that the present invention has higher reliability in formula consistency determination and risk level identification. In terms of false alarm rate and false alarm rate, the present invention has a false alarm rate of 1.9% and a false alarm rate of 2.6%, which are significantly lower than the 6.8% and 8.7% of the comparative scheme 2 and the 4.3% and 6.0% of the comparative scheme 3, indicating that the present invention has a better control effect in reducing false alarms and false alarms.

[0043] Furthermore, regarding single-sample detection time, this invention achieves 86ms, which, while slightly higher than the 42ms of comparative scheme 2, is significantly lower than the 118ms of comparative scheme 3. It also demonstrates an order-of-magnitude performance advantage over the 720,000ms of comparative scheme 1, thus meeting the real-time detection requirements of actual production scenarios. In terms of accuracy degradation after 4 weeks, this invention exhibits only a 1.8% decrease, significantly lower than the 7.6% decrease in comparative scheme 2 and the 4.9% decrease in comparative scheme 3, indicating that this invention possesses better stability and robustness under long-term operating conditions.

[0044] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A deep learning-based intelligent detection method for food components, characterized in that, Includes the following steps: Step 1: Collect food image sample pairs of the food to be tested; Step 2: Perform image preprocessing on the food image sample pairs to generate standard image sample pairs; Step 3: Construct spectral and appearance features from standard image sample pairs to generate initial spectral feature tensors and initial appearance feature tensors; Step 4: Construct a food component detection model based on the improved ConvNeXt V2 network. Input the initial spectral feature tensor and the initial appearance feature tensor into the food component detection model for joint feature modeling and prediction inference to obtain the component prediction feature vector. The food component detection model includes a hyperspectral backbone encoder, an RGB auxiliary encoder, a spectral-appearance feature fusion module, and a component prediction module. Step 5: Perform a decoding operation on the component prediction feature vector to obtain the component content prediction vector and component composition distribution vector of the food to be tested; Step 6: Based on the component content prediction vector and component composition distribution vector, perform a food component compliance and consistency determination, determine the risk level, and generate a risk level label for the food to be tested; Step 7: Collect the labeling data of the retested ingredients of the food to be tested, and update the food ingredient detection model using the backpropagation method.

2. The intelligent food component detection method based on deep learning according to claim 1, characterized in that, Step one specifically includes: Obtain the sample labeling information of the food to be tested; Hyperspectral imaging images of the food to be tested are acquired using a hyperspectral imaging acquisition device, and RGB visible light images of the food to be tested are acquired using a visible light image acquisition device. Based on the sample identification information, the hyperspectral imaging image of the food to be tested is matched with the RGB visible light image to form a food image sample pair.

3. The intelligent food component detection method based on deep learning according to claim 1, characterized in that, Step two specifically includes: The image preprocessing includes noise suppression, resolution resampling, intensity normalization, and spatial registration; Noise suppression was performed on the hyperspectral imaging image and the RGB visible light image using median filtering to obtain a denoised hyperspectral image and a denoised RGB image. The denoised hyperspectral image and the denoised RGB image were resampled according to the resolution using bilinear interpolation to obtain the resampled hyperspectral image and the resampled RGB image. The intensity normalization specifically involves performing minimum-maximum normalization on the intensity value of each pixel in the resampled hyperspectral image and the resampled RGB image to obtain the standard hyperspectral image and the normalized RGB image. The spatial registration specifically involves: using a standard hyperspectral image as a reference image, mapping the normalized RGB image to the spatial coordinate system of the standard hyperspectral image through geometric transformation to obtain a standard RGB image; Standard image sample pairs are constructed by combining standard hyperspectral images with standard RGB images.

4. The intelligent food component detection method based on deep learning according to claim 1, characterized in that, Step three specifically includes: Based on a standard hyperspectral image, the pixel intensity values of each hyperspectral band are extracted at each pixel coordinate and arranged in the order of the spectral band index to obtain the spectral band vector. The spectral segment vector is compressed and mapped by a linear mapping and the GELU activation function to obtain the spectral embedding vector. The spectral embedding vectors are stacked according to their spatial positions to obtain the initial spectral feature tensor; The standard RGB image is convolved position by position in the spatial dimension using a sliding window method through 3×3 convolution to generate the initial appearance feature tensor.

5. The intelligent food component detection method based on deep learning according to claim 1, characterized in that, Step four specifically includes: The initial spectral feature tensor is input into the hyperspectral backbone encoder to generate the hyperspectral feature tensor. The initial appearance feature tensor is input into the RGB auxiliary encoder to generate the RGB visual feature tensor. The hyperspectral feature tensor and the RGB visual feature tensor are input into the spectral-appearance feature fusion module for attention operations to obtain the food component fusion feature tensor, specifically: Define a feature mapping space, and map the hyperspectral feature tensor to the feature mapping space through a trainable query mapping matrix to obtain the hyperspectral query tensor; The RGB visual feature tensor is mapped to the feature mapping space through trainable key mapping matrix and value mapping matrix respectively, to obtain the RGB key tensor and RGB value tensor; Flatten the hyperspectral query tensor, RGB key tensor, and RGB value tensor into matrix form in the spatial dimension to obtain the hyperspectral query matrix, RGB key matrix, and RGB value matrix. Multiply the hyperspectral query matrix with the transpose of the RGB key matrix, divide by the square root of the dimension of the feature mapping space, and then normalize using the Softmax function to generate the food component attention weight matrix. Based on the food component attention weight matrix, the RGB value matrix is weighted and summed to obtain the food component fusion feature matrix. The food component fusion feature matrix is then reshaped into a tensor form to obtain the food component fusion feature tensor. In the component prediction module, the food component fusion feature tensor is subjected to global average pooling to obtain the food component fusion feature vector; the food component fusion feature vector is then subjected to linear mapping and GELU activation to obtain the component prediction feature vector.

6. The intelligent food component detection method based on deep learning according to claim 5, characterized in that, The hyperspectral backbone encoder includes a spectral band sensing mapping module, an input mapping downsampling module, a multi-stage feature extraction backbone, and a stage-level spectral band weight modulation module, specifically including: In the spectral band perception mapping module, the spectral band embedding channel of the initial spectral feature tensor is mapped to the backbone channel dimension by 1×1 convolution to obtain the channel mapping feature tensor. By using one-dimensional convolution to model the channel-mapped feature tensor in the channel dimension, a spectral band-aware feature tensor is obtained. In the input mapping downsampling module, the spectral band sensing feature tensor is downsampled and channel mapped by a 4×4 convolution with a stride of 4 to obtain the stage mapping feature tensor. The multi-stage feature extraction backbone includes four feature extraction stages, wherein the input feature tensor of each feature extraction stage is the modulation output feature tensor of the previous stage-level spectral band weight modulation module; the input feature tensor of the first feature extraction stage is the stage mapping feature tensor. Each feature extraction stage consists of several spectral band hybrid enhancement blocks, and each spectral band hybrid enhancement block includes a spatial feature branch and a spectral band feature branch; In the spatial feature branch, the stage input feature tensor is processed by 7×7 convolution and layer normalization to extract spatial features and normalize features, thereby obtaining the intermediate spatial feature tensor. The intermediate spatial feature tensor is subjected to channel expansion and nonlinear mapping by 1×1 convolution and GELU activation function to obtain the expanded spatial feature tensor. The expanded spatial feature tensor is then subjected to channel back mapping by 1×1 convolution to generate the spatial feature tensor. In the spectral feature branch, the stage input feature tensor is processed by spectral dimension mixing mapping through 1×1 group convolution along the channel dimension, nonlinear mapping processing through GELU activation function, and channel integration mapping processing through 1×1 convolution to obtain the spectral feature tensor. The spatial feature tensor and the spectral feature tensor are concatenated by channels, and the feature is fused by a trainable fusion mapping matrix to obtain the fused feature tensor. The input feature tensor and the fused feature tensor are then residually connected to obtain the stage output feature tensor. In the stage-level spectral band weight modulation module, the stage output feature tensor of each feature extraction stage is globally averaged and pooled to obtain the stage channel statistical vector. The stage channel statistical vector is then mapped to the Sigmoid function to generate the stage spectral band modulation vector. Based on the stage spectral modulation vector, channel-by-channel modulation is performed on the stage output feature tensor to obtain the modulated output feature tensor. The modulation output feature tensor generated by the last stage-level spectral band weighted modulation module is used as the hyperspectral feature tensor.

7. The intelligent food component detection method based on deep learning according to claim 5, characterized in that, The RGB auxiliary encoder includes a texture enhancement mapping module, an input mapping downsampling module, a four-stage feature extraction module, and a spatial weight guidance module, specifically including: In the texture enhancement mapping module, the initial appearance feature tensor is subjected to two 3×3 convolution operations to generate the texture enhancement feature tensor; In the input mapping downsampling module, the texture enhancement feature tensor is downsampled and channel mapped by a 4×4 convolution with a stride of 4 to obtain the RGB mapping feature tensor. The four-stage feature extraction module includes four RGB feature extraction stages. The RGB input feature tensor of the first RGB feature extraction stage is an RGB mapping feature tensor, and the RGB input feature tensors of the second to fourth RGB feature extraction stages are the RGB output feature tensors of the previous RGB feature extraction stage. The RGB output feature tensor of the fourth RGB feature extraction stage is input into the spatial weight guidance module, and a spatial weight logit graph is generated through 1×1 convolution. The spatial weight logit graph is then normalized by Softmax to obtain the spatial weight graph. Based on the spatial weighted graph, position-wise weighted modulation is performed on the RGB output feature tensor to generate the RGB visual feature tensor.

8. The intelligent food component detection method based on deep learning according to claim 1, characterized in that, Step five specifically includes: Obtain the number of target ingredient types in the food to be tested; Based on the number of target component types, a component content decoding parameter matrix and a component content bias vector are constructed through parameter initialization methods. Based on the component content decoding parameter matrix and the component content bias vector, a linear mapping is performed on the component prediction feature vector to obtain the component content prediction vector; The component distribution decoding parameter matrix and component distribution bias vector are constructed using parameter initialization methods. Based on the component distribution decoding parameter matrix and the component distribution bias vector, a linear mapping is performed on the component prediction feature vector to generate the component distribution logit vector; The component composition distribution vector is obtained by performing Softmax normalization on the component distribution logit vector.

9. The intelligent food component detection method based on deep learning according to claim 1, characterized in that, Step six specifically includes: Set lower and upper content thresholds for each target component, and perform a compliance check on each component content prediction vector: If the current dimension component of the component content prediction vector is greater than or equal to the lower limit threshold of the target component content and less than or equal to the upper limit threshold of the target component content, then a compliance label 1 is generated; otherwise, a compliance label 0 is generated. All compliance labels are summed up and recorded as a compliance score. If the compliance score equals the number of target ingredient types, the compliance judgment result of the food to be tested is qualified; otherwise, the compliance judgment result is abnormal. Set the formula distribution vector and formula consistency judgment threshold, and perform consistency judgment on the component composition distribution vector: Calculate the L1 norm of the vector difference between the component composition distribution vector and the formula distribution vector to obtain the consistency deviation value. If the consistency deviation value is less than or equal to the formula consistency judgment threshold, the consistency judgment result of the current food to be tested is that the formula is consistent; otherwise, the consistency judgment result is that the formula is inconsistent. Based on the compliance assessment results and the consistency assessment results, a risk level label is generated for the food to be tested: Set the number of risk levels, sum the differences between all 1s and the compliance labels to obtain a complementary compliance score; Based on the consistency determination result, a formula indication score is generated: if the formulas are consistent, the formula indication score is 0; if the formulas are inconsistent, the formula indication score is 1. Calculate the sum of the complementary compliance score, the formula instruction score, and 1 to obtain the original risk level value; The minimum of the number of risk levels and the original value of the risk level is used as the risk level identifier for the food to be tested.

10. A deep learning-based intelligent food component detection system, executing the deep learning-based intelligent food component detection method according to any one of claims 1 to 9, characterized in that, include: The image acquisition module is used to acquire food image sample pairs of the food to be tested; The image preprocessing module is used to preprocess food image sample pairs to generate standard image sample pairs; The feature construction module is used to construct spectral and appearance features from standard image sample pairs, generating initial spectral feature tensors and initial appearance feature tensors. The component feature modeling module is used to build a food component detection model based on the improved ConvNeXt V2 network. It inputs the initial spectral feature tensor and the initial appearance feature tensor into the food component detection model to perform joint feature modeling and prediction inference, and obtains the component prediction feature vector. The component decoding module is used to perform decoding operations on the component prediction feature vector to obtain the component content prediction vector and the component composition distribution vector. The component determination and risk assessment module is used to perform compliance and consistency determination of food components based on the component content prediction vector and component composition distribution vector, and generate the risk level label of the food to be tested. The model update module is used to update the version of the food component detection model.