Single-camera near-infrared-based living body detection method and system, and readable storage medium
By employing a monocular near-infrared-based liveness detection method, and utilizing multi-scale behavior fusion analysis and an encoder-decoder structure, the accuracy and cost issues of existing liveness detection methods under highly realistic forgery techniques are resolved, achieving efficient and accurate liveness detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI FULLHAN MICROELECTRONICS
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing liveness detection methods are easily compromised by highly sophisticated forgery techniques, and they rely on expensive hardware or require user cooperation, making it difficult to balance user experience and deployment costs.
A liveness detection method based on monocular near-infrared imaging is adopted. Near-infrared image sequences are acquired, preprocessed, and multi-scale behavior fusion analysis is performed. The encoder-decoder structure is used to decompose the image into multiple latent space vectors to determine whether the target object is a live object.
It improves detection accuracy and generalization at a low cost, effectively responds to unknown attack samples, requires no user cooperation, and improves operational efficiency and detection accuracy.
Smart Images

Figure CN122244964A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image recognition technology, and in particular to a method, system, and readable storage medium for liveness detection based on monocular near-infrared. Background Technology
[0002] While current liveness detection technology has achieved some success in preventing common attacks such as photo and video playback attacks, it still faces many challenges in practical applications. When faced with highly realistic silicone masks, 3D-printed models, or deepfake videos, many systems are easily compromised because these forgeries are becoming increasingly similar to the real human body in texture, three-dimensional structure, and even physiological signals.
[0003] In addition, many methods often rely on expensive, specific hardware, such as depth cameras or Time of Flight (ToF), to obtain depth information, while their performance is significantly reduced when equipped with only ordinary RGB cameras, especially in scenarios with large changes in lighting, incorrect posture, or cross-ethnicity scenarios where the misjudgment rate increases significantly.
[0004] At the same time, existing recognition methods often require users to perform specific actions, such as blinking or shaking their heads, which reduces the ease of use. Furthermore, the practice of collecting physiological information such as heart rate and micro-expressions through remote photoplethysmography (RPPG) has raised concerns among users about privacy leaks.
[0005] Ultimately, the root of the problem lies in the lack of single-modal information, the model's overfitting to the training environment, and the difficulty in balancing user experience and deployment costs. Summary of the Invention
[0006] The purpose of this invention is to provide a method, system, and readable storage medium for liveness detection based on monocular near-infrared, in order to solve the problem of difficulty in balancing user experience and deployment cost in existing liveness detection methods.
[0007] To address the aforementioned technical problems, this invention provides a liveness detection method based on monocular near-infrared spectroscopy, comprising:
[0008] Acquire a near-infrared image sequence of the target object, and preprocess the near-infrared image sequence;
[0009] Multi-scale behavior fusion analysis is performed on the preprocessed data to determine whether the target object conforms to a natural behavior pattern. If so, the near-infrared image sequence of the target object is determined to pass the behavior analysis.
[0010] An encoder-decoder structure is used on the near-infrared image sequence obtained through behavior analysis to decompose the near-infrared image sequence into multiple latent space vectors;
[0011] Based on the multiple latent space vectors, the result of whether the target object is a living being is output.
[0012] Optionally, the preprocessing steps include:
[0013] The near-infrared image sequence is cropped, and the cropped image sequence is spliced together according to the channel direction in the order of the video stream;
[0014] The stitched image sequence is then normalized.
[0015] Optionally, the monocular near-infrared-based liveness detection method is used to detect faces; the step of cropping the near-infrared image sequence includes:
[0016] The face detection model is input into each frame of the near-infrared image sequence, and outputs confidence and location information. When the confidence is greater than a set threshold, the frame image is center-cropped and affine-aligned based on the bounding box represented by the location information, and a feature region of standardized size is output.
[0017] Optionally, the normalization process includes:
[0018] The stitched image sequence is stacked into a uniform tensor along the channel dimension;
[0019] Each pixel is normalized based on the global maximum and global minimum values among all pixel values of the tensor.
[0020] Optionally, the multi-scale behavior fusion analysis uses a pyramid-shaped image behavior analysis network to analyze the preprocessed data; wherein, the bottom layer of the image behavior analysis network processes high-frequency frames, and the top layer processes low-frequency frames.
[0021] Optionally, the steps for analyzing the preprocessed data using a pyramid-shaped image behavior analysis network include:
[0022] The preprocessed data is concatenated into a tensor;
[0023] Introducing a temporal attention mechanism to enhance key action frames;
[0024] Behavior classification is performed on multiple frames of images by using global pooling and fully connected layers to output behavior category probabilities.
[0025] Optionally, the step of using an encoder-decoder structure to decompose the near-infrared image sequence obtained through behavior analysis into multiple latent space vectors includes:
[0026] A visual transformer is used to extract single-frame spatial features from the near-infrared image sequence to obtain a frame feature sequence.
[0027] The frame feature sequence is encoded with time position, and temporal feature fusion is performed. Then, it is input into the encoder, and a global temporal latent vector is output through self-attention mechanism.
[0028] The global temporal latent vector is input into the decoder and decomposed to obtain multiple latent space vectors representing the physiological attributes of living organisms.
[0029] Optionally, the step of using a visual transformer to extract single-frame spatial features from the near-infrared image sequence to obtain a frame feature sequence includes:
[0030] Each frame of the near-infrared image sequence is mapped into multiple image blocks;
[0031] The image blocks are linearly projected and positionally encoded to obtain a marker sequence;
[0032] Multiple tags are selected from the tag sequence as frame features for that frame;
[0033] The frame features of each frame in the near-infrared image sequence are sorted to obtain the frame feature sequence.
[0034] To address the aforementioned technical problems, the present invention also provides a readable storage medium storing a program thereon, which, when executed, implements the steps of the monocular near-infrared-based liveness detection method as described above.
[0035] To address the aforementioned technical problems, the present invention also provides a liveness detection system based on monocular near-infrared, comprising: an object detection and preprocessing module, a multi-scale fusion behavior analysis module, and an encoder-decoder analysis module;
[0036] The object detection and preprocessing module is used to acquire near-infrared image sequences of the target object and to preprocess the near-infrared image sequences.
[0037] The multi-scale fusion behavior analysis module is used to perform multi-scale behavior fusion analysis on the preprocessed data to determine whether the target object conforms to the natural behavior pattern. If so, the near-infrared image sequence of the target object is determined to pass the behavior analysis.
[0038] The encoder-decoder analysis module is used to apply an encoder-decoder structure to the near-infrared image sequence obtained through behavior analysis, decomposing the near-infrared image sequence into multiple latent space vectors, and then outputting a result on whether the target object is a living being based on the multiple latent space vectors.
[0039] In summary, in the monocular near-infrared-based liveness detection method, system, and readable storage medium provided by this invention, the monocular near-infrared-based liveness detection method includes: acquiring a near-infrared image sequence of a target object and preprocessing the near-infrared image sequence; performing multi-scale behavior fusion analysis on the preprocessed data to determine whether the target object conforms to a natural behavior pattern; if so, determining that the near-infrared image sequence of the target object passes the behavior analysis; using an encoder-decoder structure on the near-infrared image sequence that passes the behavior analysis to decompose the near-infrared image sequence into multiple latent space vectors; and outputting the result of whether the target object is a live object based on the multiple latent space vectors.
[0040] This configuration, through multi-scale behavioral fusion analysis, can determine whether a target object conforms to a natural behavioral pattern. Furthermore, using an encoder-decoder structure on the near-infrared image sequence obtained through behavioral analysis, the near-infrared image sequence can be decomposed into multiple latent space vectors representing the physiological attributes of a living organism, thereby accurately outputting the result of whether the target object is alive. This method solves the problem of low accuracy in low-cost products, is better able to handle unknown attack samples, has broad versatility, and requires no user intervention to perform specified actions, effectively improving operational efficiency, detection accuracy, and generalization. Attached Figure Description
[0041] Those skilled in the art will understand that the accompanying drawings are provided to better understand the invention and do not constitute any limitation on the scope of the invention.
[0042] Figure 1 This is a schematic flowchart of a liveness detection method based on monocular near-infrared light according to an embodiment of the present invention. Detailed Implementation
[0043] To make the objectives, advantages, and features of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the drawings are all in a very simplified form and are not drawn to scale, and are only used to facilitate and clarify the explanation of the embodiments of this invention. Furthermore, the structures shown in the drawings are often part of the actual structures. In particular, different figures may emphasize different aspects and may sometimes use different scales.
[0044] As used in this invention, the singular forms “a,” “an,” “one,” and “the” include plural objects; the term “or” is generally used to mean “and / or”; the term “a number” is generally used to mean “at least one”; and the term “at least two” is generally used to mean “two or more”. Furthermore, the terms “first,” “second,” and “third” are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as “first,” “second,” or “third” may explicitly or implicitly include one or at least two of that feature; “one end” and “the other end,” and “proximal end” and “distal end” generally refer to two corresponding parts, which include not only endpoints. Furthermore, the terms "installed," "connected," and "attached," as used in this invention, and the term "set" on one element from another, should be interpreted broadly. They generally only indicate a connection, coupling, cooperation, or transmission relationship between the two elements, which can be direct or indirect through an intermediate element. They should not be construed as indicating or implying a spatial relationship between the two elements, meaning one element can be located inside, outside, above, below, or to one side of another element, unless otherwise explicitly stated. Those skilled in the art can understand the specific meaning of these terms in this invention based on the specific circumstances. Additionally, directional terms such as above, below, up, down, upward, downward, left, and right are used relative to exemplary embodiments as shown in the figures, with upward or upper directions pointing towards the top of the corresponding figure, and downward or lower directions pointing towards the bottom of the corresponding figure.
[0045] The purpose of this invention is to provide a method, system, and readable storage medium for liveness detection based on monocular near-infrared imaging, in order to solve the problem of balancing user experience and deployment cost in existing liveness detection methods. The following description refers to the accompanying drawings.
[0046] Please refer to Figure 1 This invention provides a liveness detection method based on monocular near-infrared imaging. In the fields of computer vision and liveness detection, "monocular" typically refers to image acquisition using a single camera (single lens), as opposed to binocular (stereo vision) or multi-view (multiple cameras working together) methods that require multiple sensors. Monocular cameras offer hardware simplification advantages, eliminating the need for multiple cameras in this embodiment's liveness detection method, thus reducing costs and simplifying deployment. Near-infrared imaging refers to the camera's photosensitive element responding to electromagnetic waves with wavelengths from 700nm to 1400nm, typically outputting a single-channel grayscale image sequence. It is sensitive to subtle changes in subcutaneous blood flow and thermal radiation distribution, and can achieve stable all-weather acquisition under active illumination.
[0047] The liveness detection method based on monocular near-infrared includes:
[0048] Step S1: Obtain the near-infrared image sequence of the target object and preprocess the near-infrared image sequence;
[0049] Step S2: Perform multi-scale behavior fusion analysis on the preprocessed data to determine whether the target object conforms to the natural behavior pattern. If so, determine that the near-infrared image sequence of the target object passes the behavior analysis.
[0050] Step S3: Use an encoder-decoder structure on the near-infrared image sequence obtained through behavior analysis to decompose the near-infrared image sequence into multiple latent space vectors;
[0051] Step S4: Based on the multiple latent space vectors, output the result of whether the target object is a living being.
[0052] Step S1 is the data acquisition and preprocessing step. The target object here can be a living object, such as a human face, or other living objects with detection features, such as animals. It should be noted that the target object may also be a non-living object, such as a highly realistic 3D-printed human face model. Whether the target object is truly alive is the objective of this embodiment. Near-infrared image sequences can be continuously acquired, for example, using a monocular near-infrared camera, and consist of multiple frames ordered chronologically. The acquisition process does not require the active cooperation of the target object; it is a passive acquisition method. After acquiring the near-infrared image sequence, it can be preprocessed to provide standardized, high-quality temporal input for subsequent analysis.
[0053] Step S2 is a multi-scale behavior fusion analysis step, the purpose of which is to dynamically screen effective samples through natural behavior analysis, filter static attacks (such as simulation models, photos or masks) and abnormal behaviors (fixed postures), and reduce the subsequent computational load.
[0054] Step S3 employs an encoder-decoder latent space vector decomposition step to further analyze the physiological features of the image sequences filtered through behavioral analysis, decoupling the fused spatiotemporal information into multi-dimensional independent evidence to crack highly realistic attacks (such as silicone masks).
[0055] Step S4 is a fusion and determination step of multiple latent space vectors, which integrates multi-dimensional physiological evidence to achieve high-confidence live decision-making with high robustness and anti-attack capability.
[0056] The liveness detection method in this embodiment, based on the configuration of steps S1 to S4, uses multi-scale behavior fusion analysis to determine whether the target object conforms to a natural behavior pattern. Then, using an encoder-decoder structure, the near-infrared image sequence obtained through behavior analysis is decomposed into multiple latent space vectors representing the physiological attributes of a living organism, thereby accurately outputting the result of whether the target object is alive. This method solves the problem of low accuracy in low-cost products, better handles unknown attack samples, has broad versatility, and requires no user intervention to perform specified actions, effectively improving operational efficiency, detection accuracy, and generalization. The liveness detection method of this embodiment will be further explained below with several examples.
[0057] Optionally, the preprocessing steps in step S1 include:
[0058] Step S11: Crop the near-infrared image sequence, and then stitch the cropped image sequence together according to the channel direction in the order of the video stream;
[0059] Step S12: Normalize the stitched image sequence.
[0060] In one embodiment, the monocular near-infrared-based liveness detection method is used to detect faces; the step S11 of cropping the near-infrared image sequence includes:
[0061] Step S111: Input each frame of the near-infrared image sequence into the face detection model, and output confidence and location information; when the confidence is greater than a set threshold, perform center cropping and affine alignment on the frame image based on the bounding box represented by the location information, and output a feature region of standardized size. The face detection model used in step S111 can be a commonly used face detection model in the field, such as YOLOv5-face or RetinaFace-Mobile.
[0062] Taking a frame from a near-infrared image sequence as an example, it is input into a face detection model. The model outputs the confidence score of the content in that frame as a face, along with the face's location information. If the confidence score is less than a set threshold, the content in that frame is clearly not a face and should be excluded. Only when the confidence score is greater than the set threshold is the frame centered and affine aligned based on the face's location information. The output is a standardized size (e.g., 112*112 pixels) of feature region (i.e., the face region). It's easy to understand that when the liveness detection method is used to detect other live animals, the face detection model can be replaced with the corresponding animal detection model.
[0063] The above exemplifies the cropping steps for a single frame in a near-infrared image sequence. In step S11, the above cropping steps can be repeatedly performed to crop multiple frames of the near-infrared image sequence sequentially. The cropped image sequence is then stitched together according to the video stream order and channel direction to ensure consistency and numerical stability of subsequent inputs.
[0064] Optionally, the normalization process in step S12 includes:
[0065] Step S121: The stitched image sequences are stacked into a uniform tensor along the channel dimension;
[0066] Step S122: Normalize each pixel based on the global maximum and global minimum values among all pixel values of the tensor.
[0067] To facilitate subsequent multi-scale behavior fusion analysis, the cropped and stitched image sequence needs to be normalized after processing. Steps S121 and S122 illustrate an example of this normalization process. First, step S121 stacks the stitched image sequence into a unified tensor. Then, in step S122, normalization is calculated based on the global maximum and global minimum values of this tensor. This maintains inter-frame consistency, eliminates the influence of inter-frame illumination variations, and ensures consistency of temporal features.
[0068] In one example, the normalization of step S122 can be calculated using the following formula:
[0069]
[0070] in, The normalized pixel value (channel c, row i, column j). These are the original pixel values. This is the minimum pixel value in the entire tensor. This represents the maximum pixel value in the entire tensor.
[0071] Optionally, in step S2, the multi-scale behavior fusion analysis employs a pyramid-shaped image behavior analysis network to analyze the preprocessed data. The bottom layer of this network processes high-frequency frames, while the higher layers process low-frequency frames. It should be noted that high-frequency and low-frequency frames refer to the temporal frequency of behavioral features, not the video frame rate. High-frequency frames represent frame sequences of rapid, minute movements (such as blinking), while low-frequency frames represent frame sequences of slow, overall behaviors (such as breathing). The pyramid-shaped image behavior analysis network enables the extraction of multi-scale behavioral features.
[0072] Optionally, the steps for analyzing the preprocessed data using a pyramid-shaped image behavior analysis network include:
[0073] Step S21: Concatenate the preprocessed data into a tensor;
[0074] Step S22: Introduce a temporal attention mechanism to enhance key action frames;
[0075] Step S23: Perform behavior classification on multiple frames of images by outputting behavior category probabilities through global pooling and fully connected layers.
[0076] First, step S21 takes the preprocessed data from step S1 as input, such as the cropped and normalized image sequence. Step S21 concatenates this input data into a tensor, which is then input into the pyramid-shaped image behavior analysis network.
[0077] Step S22 then introduces a temporal attention mechanism to dynamically identify and enhance frames strongly correlated with live physiological behaviors (such as blinking and breathing fluctuations), while suppressing invalid frames (static or occluded frames), thus addressing interference from abnormal behaviors such as fixed-pose attacks and lack of natural facial expressions. The micro-motion features output from the bottom branches (high-frequency frames) of the pyramid-shaped image behavior analysis network and the overall behavioral features output from the top branches (low-frequency frames) jointly participate in attention calculation.
[0078] Step S23 outputs the behavior category probability through global pooling and a fully connected layer. Global pooling, such as Global Average Pooling (GAP), compresses the high-dimensional spatiotemporal features weighted by temporal attention in step S22 into a discriminative vector, generating quantifiable confidence scores regarding natural behavior. This allows for behavior classification of multiple frames in an image sequence, filtering out abnormal behaviors such as "fixed pose" or "lack of natural expression," and capturing natural behavior patterns such as "slight nodding," "blinking," and "breathing fluctuations." For target objects matching natural behavior patterns, their near-infrared image sequences are deemed to pass behavior analysis; otherwise, they are deemed to fail.
[0079] After completing the multi-scale behavior fusion analysis, proceed to step S3. In one example, step S3 includes:
[0080] Step S31: Apply a visual transformer to the near-infrared image sequence to extract single-frame spatial features and obtain a frame feature sequence;
[0081] Step S32: Add time position encoding to the frame feature sequence, perform temporal feature fusion, and then input it into the encoder. Learn through the self-attention mechanism to output a global temporal latent vector.
[0082] Step S33: Input the global temporal latent vector into the decoder to decompose it into multiple latent space vectors representing the physiological attributes of living organisms.
[0083] The Vision Transformer (ViT) is a deep learning model for image processing based on the Transformer architecture. In step S31, for the high-quality near-infrared image sequence that has already passed behavioral analysis, physiological latent features can be further extracted using the Vision Transformer.
[0084] Optionally, step S31 includes:
[0085] Step S311: Map each frame of the near-infrared image sequence into multiple image patches; in one example, each frame is mapped into multiple 16*16 pixel image patches.
[0086] Step S312: After linear projection and positional encoding, the image block is used to obtain a token sequence;
[0087] Step S313: Select multiple (e.g., 10) tags from the tag sequence as the frame features of the frame;
[0088] Step S314: Sort the frame features of each frame in the near-infrared image sequence to obtain the frame feature sequence.
[0089] The encoder in step S32, such as a 4-layer standard Transformer Encoder, is used for cross-frame temporal modeling. After adding temporal position encoding to the frame feature sequence obtained in step S31, it can be input into the 4-layer standard Transformer Encoder. A self-attention mechanism is used to automatically learn which time points are more important for liveness detection (e.g., frames corresponding to the heartbeat cycle). A global temporal latent vector is then output, preferably the sequence average. This global temporal latent vector integrates spatial semantics and temporal dynamics, representing the "physiological fingerprint" of the near-infrared image sequence.
[0090] Step S33 is the decoder part in the encoder-decoder structure. In one example, the decoder adopts a multi-branch structure, which includes decoding sub-networks for multiple aspects such as material, temperature, and depth, and maps the global temporal latent vector to multiple latent space vectors that represent the physiological attributes of living organisms (such as material, temperature, and depth).
[0091] Finally, step S4 determines whether the organism is alive based on certain latent space vectors obtained in step S33, such as the latent space vector corresponding to temperature, integrates the final detection results, and outputs the result of whether the organism is alive.
[0092] The present invention also provides a readable storage medium having a program stored thereon, which, when executed, implements the steps of the monocular near-infrared-based liveness detection method as described above.
[0093] This invention also provides a liveness detection system based on monocular near-infrared, which includes: an object detection and preprocessing module, a multi-scale fusion behavior analysis module, and an encoder-decoder analysis module;
[0094] The object detection and preprocessing module is used to acquire the near-infrared image sequence of the target object and preprocess the near-infrared image sequence; that is, the object detection and preprocessing module is used to perform step S1.
[0095] The multi-scale fusion behavior analysis module is used to perform multi-scale behavior fusion analysis on the preprocessed data to determine whether the target object conforms to the natural behavior pattern. If so, the near-infrared image sequence of the target object is determined to pass the behavior analysis; that is, the multi-scale fusion behavior analysis module is used to execute step S2 accordingly.
[0096] The encoder-decoder analysis module is used to apply an encoder-decoder structure to the near-infrared image sequence obtained through behavior analysis, decomposing the near-infrared image sequence into multiple latent space vectors, and then outputting a result on whether the target object is a living object based on the multiple latent space vectors. That is, the encoder-decoder analysis module is used to execute steps S3 and S4 respectively.
[0097] Furthermore, the liveness detection system can also implement any one of the sub-steps in the above-mentioned liveness detection method. The specific physical components of the liveness detection system can be configured by those skilled in the art in combination with actual conditions, and this embodiment is not limited in this regard.
[0098] In summary, the monocular near-infrared-based liveness detection method, system, and readable storage medium provided by this invention include: acquiring a near-infrared image sequence of a target object and preprocessing the near-infrared image sequence; performing multi-scale behavior fusion analysis on the preprocessed data to determine whether the target object conforms to a natural behavior pattern; if so, determining that the near-infrared image sequence of the target object passes the behavior analysis; using an encoder-decoder structure on the near-infrared image sequence that passes the behavior analysis to decompose the near-infrared image sequence into multiple latent space vectors; and outputting the result of whether the target object is a live object based on the multiple latent space vectors. With this configuration, multi-scale behavior fusion analysis can determine whether the target object conforms to a natural behavior pattern; furthermore, using an encoder-decoder structure on the near-infrared image sequence that passes the behavior analysis can decompose the near-infrared image sequence into multiple latent space vectors representing the physiological attributes of a live object, thereby accurately outputting the result of whether the target object is a live object. This method can solve the problem of low product accuracy under low cost, better cope with unknown attack samples, has wide applicability, and does not require users to complete specified actions, effectively improving operating efficiency, detection accuracy and generalization.
[0099] It should be noted that the above embodiments can be combined with each other. The above description is only a description of preferred embodiments of the present invention and is not intended to limit the scope of the present invention in any way. Any changes or modifications made by those skilled in the art based on the above disclosure shall fall within the protection scope of the present invention.
Claims
1. A liveness detection method based on monocular near-infrared spectroscopy, characterized in that, include: Acquire a near-infrared image sequence of the target object, and preprocess the near-infrared image sequence; Multi-scale behavior fusion analysis is performed on the preprocessed data to determine whether the target object conforms to a natural behavior pattern. If so, the near-infrared image sequence of the target object is determined to pass the behavior analysis. An encoder-decoder structure is used on the near-infrared image sequence obtained through behavior analysis to decompose the near-infrared image sequence into multiple latent space vectors; Based on the multiple latent space vectors, the result of whether the target object is a living being is output.
2. The liveness detection method based on monocular near-infrared as described in claim 1, characterized in that, The preprocessing steps include: The near-infrared image sequence is cropped, and the cropped image sequence is spliced together according to the channel direction in the order of the video stream; The stitched image sequence is then normalized.
3. The liveness detection method based on monocular near-infrared as described in claim 2, characterized in that, The monocular near-infrared-based liveness detection method is used to detect faces; the step of cropping the near-infrared image sequence includes: The face detection model is input into each frame of the near-infrared image sequence, and outputs confidence and location information. When the confidence is greater than a set threshold, the frame image is center-cropped and affine-aligned based on the bounding box represented by the location information, and a feature region of standardized size is output.
4. The liveness detection method based on monocular near-infrared as described in claim 2, characterized in that, The normalization process includes the following steps: The stitched image sequence is stacked into a uniform tensor along the channel dimension; Each pixel is normalized based on the global maximum and global minimum values among all pixel values of the tensor.
5. The liveness detection method based on monocular near-infrared as described in claim 1, characterized in that, The multi-scale behavior fusion analysis uses a pyramid-shaped image behavior analysis network to analyze the preprocessed data; wherein, the bottom layer of the image behavior analysis network processes high-frequency frames, and the top layer processes low-frequency frames.
6. The liveness detection method based on monocular near-infrared as described in claim 5, characterized in that, The steps for analyzing preprocessed data using a pyramid-shaped image behavior analysis network include: The preprocessed data is concatenated into a tensor; Introducing a temporal attention mechanism to enhance key action frames; Behavior classification is performed on multiple frames of images by using global pooling and fully connected layers to output behavior category probabilities.
7. The liveness detection method based on monocular near-infrared as described in claim 1, characterized in that, The step of using an encoder-decoder structure to decompose the near-infrared image sequence obtained through behavior analysis into multiple latent space vectors includes: A visual transformer is used to extract single-frame spatial features from the near-infrared image sequence to obtain a frame feature sequence. The frame feature sequence is encoded with time position, and temporal feature fusion is performed. Then, it is input into the encoder, and a global temporal latent vector is output through self-attention mechanism. The global temporal latent vector is input into the decoder and decomposed to obtain multiple latent space vectors representing the physiological attributes of living organisms.
8. The liveness detection method based on monocular near-infrared as described in claim 7, characterized in that, The steps of using a visual transformer to extract single-frame spatial features from the near-infrared image sequence to obtain a frame feature sequence include: Each frame of the near-infrared image sequence is mapped into multiple image blocks; The image blocks are linearly projected and positionally encoded to obtain a marker sequence; Multiple tags are selected from the tag sequence as frame features for that frame; The frame features of each frame in the near-infrared image sequence are sorted to obtain the frame feature sequence.
9. A readable storage medium having a program stored thereon, characterized in that, When the program is executed, it implements the steps of the liveness detection method based on monocular near-infrared according to any one of claims 1 to 8.
10. A liveness detection system based on monocular near-infrared spectroscopy, characterized in that, include: Object detection and preprocessing module, multi-scale fusion behavior analysis module, and encoder-decoder analysis module; The object detection and preprocessing module is used to acquire near-infrared image sequences of the target object and to preprocess the near-infrared image sequences. The multi-scale fusion behavior analysis module is used to perform multi-scale behavior fusion analysis on the preprocessed data to determine whether the target object conforms to the natural behavior pattern. If so, the near-infrared image sequence of the target object is determined to pass the behavior analysis. The encoder-decoder analysis module is used to apply an encoder-decoder structure to the near-infrared image sequence obtained through behavior analysis, so as to decompose the near-infrared image sequence into multiple latent space vectors, and then output the result of whether the target object is a living body based on the multiple latent space vectors.