Methods for constructing and detecting remote photoplethysmography signals and heart rate detection models

By constructing a remote photoplethysmography (PPG) signal and heart rate detection model, and employing a 3D spatiotemporal convolution and deconvolution encoder and decoder structure, the accuracy problem of remote PPG signals in an unconstrained environment was solved, and high-precision heart rate detection was achieved.

CN116012916BActive Publication Date: 2026-06-30SHAN XI XIN HE PU GUANG DIAN YOU XIAN GONG SI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHAN XI XIN HE PU GUANG DIAN YOU XIAN GONG SI
Filing Date
2023-01-04
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing remote optical volumetric imaging signals have low accuracy in unconstrained environments and are difficult to cope with complex and ever-changing real-world environments and data noise.

Method used

A remote photoplethysmography (TP) signal and heart rate detection model is constructed, including a video preprocessing module, a physiological feature extraction module, and a signal estimation module. A 3D spatiotemporal convolution and deconvolution encoder and decoder structure is adopted. Through multi-level feature fusion and multi-scale enhancement modules, accurate remote photoplethysmography signals are extracted and recovered.

Benefits of technology

It improves the accuracy and stability of the model in complex environments, effectively eliminates noise and light interference, and achieves high-precision heart rate detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116012916B_ABST
    Figure CN116012916B_ABST
Patent Text Reader

Abstract

This invention provides a method for constructing and detecting remote photoplethysmography (PPG) signals and heart rate. The construction method includes: acquiring and preprocessing a sequence of facial video images as an initial dataset; acquiring PPG signals as a label set; constructing a PPG signal and heart rate detection model, using the initial dataset as input and the label set as output to train the PPG signal and heart rate detection model, and the model outputting the final remote PPG signal. This invention adds a multi-level feature fusion module and a multi-scale enhancement module to the encoder and decoder, respectively, to ensure that effective physiological features highly correlated with the temporal information of the remote PPG signal are retained during feature extraction, thereby recovering accurate remote PPG signals and improving model performance. The signal estimation module performs filtering operations on the predicted signal, which can better adapt to waveform distortion values ​​to a certain extent and improve the model's prediction accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of physiological signal detection and processing technology, specifically to a method for constructing and detecting remote photoplethysmography signals and heart rate detection models. Background Technology

[0002] Physiological signals such as heart rate, respiratory rate, and heart rate variability are closely related to human health and can be widely used in fields such as healthcare, health assessment, in vivo detection, and emotion computing. Two common methods for measuring physiological signals such as heart rate are electrocardiogram (ECG) and photoplethysmography (PPG). ECG measures the changes in electrical activity of the human heart muscle and usually requires electrodes to be attached to the body. The pads and gels used to connect the electrodes can irritate the skin, causing discomfort and severely restricting the patient's freedom of movement. Photoplethysmography (PPG) combines a light source and a small optical sensor to measure changes in the absorption of light by blood vessels due to the pulse cycle. However, both methods are based on contact measurement and may cause discomfort or pain.

[0003] Remote photoplethysmography (rPPG) can solve the problems of the two methods mentioned above. However, most existing remote photoplethysmography signals rely on the researcher's prior knowledge and are difficult to cope with the complex and ever-changing real environment and data noise, resulting in low accuracy of the obtained remote photoplethysmography signals. Therefore, this invention studies and designs a method for constructing a remote photoplethysmography signal and a heart rate detection model, as well as a detection method. Summary of the Invention

[0004] Therefore, the technical problem to be solved by the present invention is to overcome the defect of low accuracy of remote photoplethysmography signals in unconstrained environments in the prior art, thereby providing a method for constructing and detecting remote photoplethysmography signals and heart rate detection models.

[0005] To address the aforementioned problems, this invention provides a method for constructing a remote photoplethysmography signal and heart rate detection model, comprising:

[0006] S1: Acquire face video image sequences and remote photoplethysmography signals. Preprocess the acquired face video images to obtain preprocessed face video image sequences. Use the preprocessed face video image sequences as the initial dataset and the acquired remote photoplethysmography signals as the label set.

[0007] S2: Construct a remote photoplethysmography signal and heart rate detection model, using the initial dataset as input and the label set as output to train the remote photoplethysmography signal and heart rate detection model;

[0008] S3: Input the initial dataset into the trained remote photoplethysmography signal and heart rate detection model, and output the final remote photoplethysmography signal.

[0009] Preferably, the remote photoplethysmography signal and heart rate detection model includes a video preprocessing module, a physiological feature extraction module, and a signal estimation module. The physiological feature extraction module is used to extract physiological features, and the signal estimation module is used to merge signals.

[0010] The video preprocessing module is used to crop out the cheeks containing rich physiological information from the face image sequence. The video preprocessing module separates the RGB three channels in the acquired face video image sequence and sets different weights for the RGB three channels.

[0011] Preferably, the physiological feature extraction module includes an encoder and a decoder, and three encoders and three decoders are arranged in parallel. The encoders are used to extract the physiological features of the RGB three channels respectively, and the decoders are used to recover the signals of the RGB three channels respectively.

[0012] The encoder is equipped with a multi-level feature fusion module to sequentially fuse the output features of each previous layer, and the decoder is equipped with a multi-scale enhancement module to achieve skip-layer connection and recover accurate remote optical volumetric recording signals.

[0013] Preferably, the signal estimation module is used to merge the RGB three-channel signals from the three decoders, and filter the merged signal to output a remote optical volumetric recording signal.

[0014] Preferably, the encoder has four layers and the decoder has three layers, wherein the encoder consists of a 1*5*5 three-dimensional convolutional block and three 3*3*3 spatiotemporal blocks, and a normalization layer and a ReLU layer are set after each convolutional block;

[0015] Each convolutional block is followed by a residual group, which is used to transmit physiological features lost during feature size changes, thereby improving the stability of the network.

[0016] The multi-level feature fusion module is respectively set in the 2nd to 4th layers of the encoder to fuse the output features of each previous layer in sequence and enhance the spatiotemporal correlation of each channel feature.

[0017] The decoder has three deconvolutional layers with a kernel size of 3*3*3. Each layer of the decoder has an added multi-scale enhancement module to enable skip-layer connections and reinforcement learning of facial physiological features of the upper layer to recover accurate remote photoplethysmography signals.

[0018] Preferably, the multi-level feature fusion module is defined as:

[0019]

[0020] Among them, e i These are the hidden features of the i-th layer of the encoder. It is a feature enhanced through feature fusion.

[0021] {1,2,…,i-1} represents the features fused by all multi-level feature fusion modules in the first i-1 layers of the encoder.

[0022] Preferably, the multi-scale enhancement module is defined as:

[0023] For features d from the previous layer i Interpolation upsampling is performed, and the output features r of each layer residual group of the encoder are used to enhance the features to achieve skip-layer connections. Furthermore, enhanced features d are generated through a refinement unit. i+1 The formula is d i+1 =R(r+(d) i )↑)-(d i )↑;

[0024] Where ↑ is the interpolation upsampling operator, r+(d i )↑ represents the enhanced feature, R represents the trainable refinement unit, and each refinement unit is implemented using a set of residuals containing three residual blocks.

[0025] This invention also provides a method for detecting remote photoplethysmography signals and heart rate detection models, the method comprising the following steps:

[0026] Step 1: Acquire a sequence of facial video images;

[0027] Step 2: After preprocessing the acquired face video image sequence, input it into any of the remote optical volumetric imaging signal and heart rate detection models constructed by the aforementioned remote optical volumetric imaging signal and heart rate detection model construction method to obtain the predicted remote optical volumetric imaging signal;

[0028] Step 3: The predicted remote photoplethysmography signal is sequentially subjected to bandpass filtering and energy spectral density conversion algorithms to calculate the corresponding predicted heart rate value.

[0029] Preferably, step 3 further includes using a one-dimensional convolutional filter for filtering to better adapt to waveform distortion values, calculating the power spectral density of the obtained remote photoplethysmography signal, and the value with the highest corresponding frequency is the predicted heart rate value.

[0030] The remote photoplethysmography signal and heart rate detection model detection method provided by this invention are used for heart rate estimation applications.

[0031] The remote photoplethysmography signal and heart rate detection model construction method and detection method provided by this invention have the following beneficial effects:

[0032] 1. This invention designs an encoder and decoder structure based on 3D spatiotemporal convolution and deconvolution, including a video preprocessing module, a physiological feature extraction module, and a signal estimation module. The physiological feature extraction module includes a spatiotemporal block, a residual group, a multi-level feature fusion module, and a multi-scale enhancement module. The video preprocessing module extracts cheeks from face videos as a facial image sequence, eliminating the interference of background information on signal prediction. The RGB three channels of the facial image sequence are separated and used as input, and different weights are set for different channels. While solving the illumination interference, the network can adaptively focus on the channel with the most physiological information. The physiological feature extraction module contains three parallel encoders and decoders. A multi-level feature fusion module and a multi-scale enhancement module are added to the encoders and decoders to ensure that effective physiological features that are highly correlated with the temporal information of the remote photoplethysmography signal are retained during the feature extraction process, thereby improving the performance of the model. The signal estimation module performs filtering operations on the predicted signal, which can better adapt to waveform distortion values ​​to a certain extent and improve the prediction accuracy of the model.

[0033] 2. This invention also improves the accuracy of the network by adding multi-level feature fusion modules to the encoder and decoder respectively, which effectively fuses features from non-adjacent layers while retaining the spatial information of high-resolution features. This fully combines the semantic information of deep features with the spatial information of shallow features. In addition, a multi-scale enhancement module is added to the decoder to enhance the features of the same level in the decoder and encoder by scale transformation. This not only realizes skip-layer connections, but also strengthens the network to learn the physiological features of the face, thereby recovering accurate remote photoplethysmography signals.

[0034] 3. Furthermore, this invention extracts only the cheeks from the face video as input, without complex preprocessing steps, and designs a high-precision end-to-end remote photoplethysmography signal and heart rate detection model method, solving the noise interference and illumination interference problems in existing remote photoplethysmography signal and heart rate estimation. Attached Figure Description

[0035] Figure 1 This is a schematic diagram illustrating the construction process of the remote photoplethysmography signal and heart rate detection model of the present invention;

[0036] Figure 2 This is a schematic diagram of the feature extraction module of the present invention;

[0037] Figure 3 This is a schematic diagram of the multi-level feature fusion module structure in the feature extraction module of the present invention;

[0038] Figure 4This is a schematic diagram of the spatiotemporal block and residual block in the feature extraction module of the present invention;

[0039] Figure 5 This is a schematic diagram of the signal estimation module of the present invention. Detailed Implementation

[0040] like Figure 1-5 As shown, this invention provides a method for constructing a remote photoplethysmography signal and heart rate detection model, which includes:

[0041] S1: Acquire face video image sequences and remote photoplethysmography signals. Preprocess the acquired face video images to obtain preprocessed face video image sequences. Use the preprocessed face video image sequences as the initial dataset and the acquired remote photoplethysmography signals as the label set.

[0042] S2: Construct a remote photoplethysmography signal and heart rate detection model, using the initial dataset as input and the label set as output to train the remote photoplethysmography signal and heart rate detection model;

[0043] S3: Input the initial dataset into the trained remote photoplethysmography signal and heart rate detection model, and output the final remote photoplethysmography signal.

[0044] In some embodiments, the remote photoplethysmography signal and heart rate detection model includes a video preprocessing module, a physiological feature extraction module, and a signal estimation module. The physiological feature extraction module is used to extract physiological features, and the signal estimation module is used to merge signals.

[0045] The video preprocessing module is used to crop out the cheeks containing rich physiological information from the face image sequence. The video preprocessing module separates the RGB three channels in the acquired face video image sequence and sets different weights for the RGB three channels.

[0046] In some embodiments, the physiological feature extraction module includes an encoder and a decoder, and three encoders and three decoders are arranged in parallel. The encoders are used to extract the physiological features of the RGB three channels respectively, and the decoders are used to recover the signals of the RGB three channels respectively.

[0047] The encoder is equipped with a multi-level feature fusion module to sequentially fuse the output features of each previous layer, and the decoder is equipped with a multi-scale enhancement module to achieve skip-layer connection and recover accurate remote optical volumetric recording signals.

[0048] In some implementations, the signal estimation module is used to combine the RGB three-channel signals from the three decoders, and filter the combined signal to output a remote photoplethysmography signal.

[0049] In some implementations, the encoder has four layers and the decoder has three layers, wherein the encoder consists of a 1*5*5 three-dimensional convolutional block and three 3*3*3 spatiotemporal blocks, and each convolutional block is followed by a normalization layer and a ReLU layer.

[0050] Each convolutional block is followed by a residual group, which is used to transmit physiological features lost during feature size changes, thereby improving the stability of the network.

[0051] The multi-level feature fusion module is respectively set in the 2nd to 4th layers of the encoder to fuse the output features of each previous layer in sequence and enhance the spatiotemporal correlation of each channel feature.

[0052] The decoder has three deconvolutional layers with a kernel size of 3*3*3. Each layer of the decoder has an added multi-scale enhancement module to enable skip-layer connections and reinforcement learning of facial physiological features of the upper layer to recover accurate remote photoplethysmography signals.

[0053] In some implementations, the multi-level feature fusion module is defined as:

[0054] e i * =E(e i |1,2,…,i-1);

[0055] Among them, e i e is the hidden feature of the i-th layer of the encoder. i * These are the enhanced features after feature fusion. {1,2,…,i-1} are the features fused by all multi-level feature fusion modules in the first i-1 layers of the encoder.

[0056] In some implementations, the multi-scale enhancement module is defined as:

[0057] For features d from the previous layer i Interpolation upsampling is performed, and the output features r of each layer residual group of the encoder are used to enhance the features to achieve skip-layer connections. Furthermore, enhanced features d are generated through a refinement unit. i+1 The formula is d i+1 =R(r+(d) i )↑)-(d i )↑;

[0058] Where ↑ is the interpolation upsampling operator, r+(d i)↑ represents the enhanced feature, R represents the trainable refinement unit, and each refinement unit is implemented using a set of residuals containing three residual blocks.

[0059] This invention also provides a method for detecting remote photoplethysmography signals and heart rate detection models, the method comprising the following steps:

[0060] Step 1: Acquire a sequence of facial video images;

[0061] Step 2: After preprocessing the acquired face video image sequence, input it into any of the remote optical volumetric imaging signal and heart rate detection models constructed by the aforementioned remote optical volumetric imaging signal and heart rate detection model construction method to obtain the predicted remote optical volumetric imaging signal;

[0062] Step 3: The predicted remote photoplethysmography signal is sequentially subjected to bandpass filtering and energy spectral density conversion algorithms to calculate the corresponding predicted heart rate value.

[0063] In some implementations, step 3 further includes using a one-dimensional convolutional filter for filtering to better accommodate waveform distortion values, calculating the power spectral density of the obtained remote photoplethysmography signal, and the value with the highest corresponding frequency is the predicted heart rate value.

[0064] The remote photoplethysmography signal and heart rate detection model detection method provided by this invention are used for heart rate estimation applications.

[0065] It should be noted that the pulse wave signal in the tag set is the photoplethysmography (PPG) signal collected by a contact measuring instrument. The output rPPG signal is a remote photoplethysmography signal. Although the two signals are collected in different ways, they contain the same physiological information. Therefore, the pulse wave signal is selected as the tag set to train the constructed remote photoplethysmography signal and heart rate detection model, and a well-trained remote photoplethysmography signal detection model is obtained.

[0066] A sequence of face video images and remote photoplethysmography signals were acquired. The acquired face video image sequence was preprocessed to obtain a face video image sequence as the initial dataset. The acquired remote photoplethysmography signals were bandpass filtered and downsampled to obtain a tag set.

[0067] Specifically, the training set contains 750 facial video image sequences of the subjects, and the final output is 750 remote photoplethysmography signals. The test set contains 150 facial video image sequences of the subjects, and the final output is 150 remote photoplethysmography signals.

[0068] like Figure 1As shown, the initial dataset and label set are used as input to train the proposed remote photoplethysmography signal and heart rate detection model. The model includes a video preprocessing module, a physiological feature extraction module, and a signal estimation module. The physiological feature extraction module includes a spatiotemporal block, a residual group, a multi-level feature fusion module, and a multi-scale enhancement module. The trained remote photoplethysmography signal and heart rate detection model is obtained, and the model construction is completed.

[0069] The video preprocessing module includes segmentation, cropping, and resizing of the face video image sequence. Starting from the first frame, key points are located in the acquired face video image sequence, and the cheeks are cropped frame by frame. Subsequent frames are cropped based on the face position in the first frame, and background information is removed to obtain the preprocessed face video image sequence. The obtained face video image sequence is then matched frame by frame with the corresponding remote photoplethysmography (RPM) signal tags. 320 consecutive face images are treated as a batch, and the size of each image frame is scaled to 32*32 as the initial dataset. Finally, the RGB three channels of the face image sequence are separated and used as input, and different weights are set for different channels.

[0070] The physiological feature extraction module includes three identical, parallel encoder-decoder structures. The three encoders are used to extract physiological features from the RGB channels, and the three decoders are used to recover the signals from the three channels. Each encoder has a multi-level feature fusion module that sequentially fuses the output features of each previous layer. Each decoder has a multi-scale enhancement module that enables skip-layer connections and recovers accurate rPPG signals.

[0071] Specifically, such as Figure 2 As shown, the physiological feature extraction module has a four-layer encoder and a three-layer decoder. The encoder consists of a 1*5*5 three-dimensional convolutional layer and three 3*3*3 spatiotemporal block layers. The 1*5*5 three-dimensional convolutional layer initially extracts the spatial color distribution of the face, and the 3*3*3 spatiotemporal blocks split the spatial and temporal dimensions, allowing the network to better focus on the temporal information related to physiological signals. Each convolutional layer is followed by a batch normalization layer and a ReLU layer. Each convolutional block has a residual group to transmit physiological features lost during feature size changes, thereby improving the stability of the network. Each of the encoder layers 2-4 has a multi-level feature fusion module. The decoder has three deconvolutional layers with a kernel size of 3*3*3. A multi-scale enhancement module is added to each layer of the decoder.

[0072] like Figure 3 As shown, the multi-level feature fusion module is defined as follows: Among them, e i These are the hidden features of the i-th layer of the encoder. The enhanced features are those obtained through feature fusion. {1,2,…,i-1} represents the features fused by all multi-level feature fusion modules in the first i-1 layers of the encoder. A progressive process is adopted, enhancing the feature e of the i-th layer of the encoder by providing a previous fused feature k (k=1,2,…,i-1) at a time. The updated process can be defined as: c t =v t (e i )-k, where c t The feature e is the t-th round of iterative fusion. i The difference between (e=e0) and the fused feature k before the t-th iteration, v t This indicates the upsampling operator, which merges the features e. i Upsampled to the same dimension as k, and then the fused feature e is updated using the projection difference according to the following formula. i :e i+1 =u t (c t )+e i , where u t This indicates the downsampling operator, which calculates the difference c in the t-th iteration. t Downsampling to e i The same dimensions. After fusing with all features from the previous i-1 layers, the final fused feature is obtained. The sampling operator u t and v t Implemented using convolution and deconvolution.

[0073] Specifically, the multi-level feature fusion of the i-th layer in the decoder can be defined as: Where d i These are the features enhanced by the i-th layer of the decoder. The enhanced features are obtained through feature fusion. {1,2,…,j-1} are the enhanced features of the first j-1 layers of the multi-level feature fusion module in the decoder. The multi-level feature fusion module of the decoder has the same structure as the multi-level feature fusion module of the encoder, but the downsampling u is swapped. t and upsampling v t The location.

[0074] The decoder is used to recover the signal, and in order to progressively refine the output characteristics of each layer of the decoder, d i A multi-scale enhancement module is added to the decoder. In the multi-scale enhancement module of the i-th layer, features d from the previous layer are processed. i Interpolation upsampling is performed, and the output features r of each layer's residual group from the encoder are used to enhance the features to achieve skip connections. Furthermore, enhanced features d are generated through a refinement unit. i+1 The formula is as follows:

[0075] di+1 =R(r+(d) i )↑)-(d i )↑, where ↑ is the interpolation upsampling operator, r+(d i )↑ represents the enhanced feature, R represents the trainable refinement unit, and each refinement unit is implemented using a set of residuals containing three residual blocks.

[0076] like Figure 4 As shown in (a), the spatiotemporal block in the physiological feature extraction module splits the 3D convolution into 2D and 1D convolutions. The 3D convolution kernel size is t*w*h, where t is the time dimension and w*h is the width and height of the space. After splitting, the kernel sizes of the spatial 2D convolution and the temporal 1D convolution become 1*w*h and t*1*1, respectively. Since 3D convolution extracts both temporal and spatial features simultaneously, but for non-contact physiological signal measurement based on video, temporal information is more important, the temporal and spatial dimensions are split to allow the network to better focus on temporal information. Cross-layer residual groups are introduced in the encoder and decoder to transmit the information lost during feature simplification. A residual group consists of 3 residual blocks with the same structure. The residual block structure in the physiological feature extraction module is as follows: Figure 4 As shown in (b), two three-dimensional convolutions with the same number of input and output channels and a kernel size of 3*3*3 are performed, followed by a ReLU operation. The features from the upper layer are input into the first three-dimensional convolution, activated by the ReLU function, and then enter the second three-dimensional convolution. The output features of the convolution are added to the input of the residual block to obtain the output features of the residual block.

[0077] like Figure 5 As shown, the signal estimation module merges the signals from the three channels of the decoder. After adaptive average pooling and dimensionality compression, the video information is finally converted into an rPPG signal. Finally, a one-dimensional convolutional filter is used for filtering, which can better adapt to waveform distortion to a certain extent. The power spectral density (PSD) of the obtained rPPG signal is calculated, and the value with the highest corresponding frequency is the predicted heart rate value.

[0078] A method for detecting remote photoplethysmography signals and heart rate using a model, comprising the following steps:

[0079] A sequence of facial video images is acquired; after preprocessing, the acquired facial video image sequence is input into the remote optical volumetric signal detection model obtained by the remote optical volumetric signal and heart rate detection model construction method described in this invention to obtain the predicted remote optical volumetric signal; the predicted remote optical volumetric signal is then subjected to bandpass filtering and energy spectral density conversion algorithm (PSD) to calculate the corresponding predicted heart rate value.

[0080] Remote photoplethysmography signal detection method for heart rate estimation applications.

[0081] It should be noted that this application compares the evaluation metrics of the proposed method with those of current mainstream methods on two datasets. The evaluation metrics include MAE (mean absolute error), RMSE (root mean square error), SD (standard deviation), and R (Pearson correlation coefficient). A smaller MAE indicates higher accuracy in predicting heart rate; a smaller RMSE indicates smaller error and more stable model; a smaller SD indicates that the predicted heart rate is closer to the average heart rate value; and a higher R value indicates higher correlation and better performance. Green, ICA, POS, and CHROM are traditional remote photoplethysmography methods; PhysNet, MSTmap+CVD, and Met are also mentioned. a-rPPG and ETA-rPPGNet are deep learning-based remote photoelectric volumetric imaging methods. First, the results on the private dataset are shown in Table 1, where our method achieved excellent performance. Second, the results on the UBFC dataset are shown in Table 2, where the central tendency fluctuates significantly. On the private dataset, the MAE is 1.11, RMSE is 2.7, SD is 2.46, and R is 0.94, achieving the best results. On the UBFC dataset, the MAE is 0.52, RMSE is 1.46, SD is 0.63, and R is 0.99, also achieving the best results. This indicates that our invention has better adaptability to the smaller UBFC dataset.

[0082] Table 1 shows a comparison of the evaluation metrics results of the method of this application with other methods on the private dataset, and Table 2 shows a comparison of the evaluation metrics results of the method of this application with other methods on the UBFC dataset.

[0083] Table 1: Comparison of evaluation metrics between the proposed method and other methods on private datasets

[0084]

[0085] Table 2: Comparison of evaluation metrics between the proposed method and other methods on the UBFC dataset.

[0086]

[0087] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention. The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the protection scope of the present invention.

Claims

1. A method for constructing a remote photoplethysmography signal and heart rate detection model, characterized in that, Includes the following steps: S1: Acquire face video image sequences and remote photoplethysmography signals. Preprocess the acquired face video images to obtain preprocessed face video image sequences. Use the preprocessed face video image sequences as the initial dataset and the acquired remote photoplethysmography signals as the label set. S2: Construct a remote photoplethysmography signal and heart rate detection model, using the initial dataset as input and the label set as output to train the remote photoplethysmography signal and heart rate detection model; S3: Input the initial dataset into the trained remote photoplethysmography signal and heart rate detection model, and output the final remote photoplethysmography signal; The remote photoplethysmography signal and heart rate detection model includes a video preprocessing module, a physiological feature extraction module, and a signal estimation module. The physiological feature extraction module is used to extract physiological features, and the signal estimation module is used to merge signals. The video preprocessing module is used to crop out the cheek containing rich physiological information from the face image sequence. The video preprocessing module separates the RGB three channels in the acquired face video image sequence and sets different weights for the RGB three channels. The physiological feature extraction module includes an encoder and a decoder, with three encoders and three decoders arranged in parallel. The encoders are used to extract the physiological features of the RGB three channels, and the decoders are used to recover the signals of the RGB three channels. The encoder is equipped with a multi-level feature fusion module to sequentially fuse the output features of each previous layer, and the decoder is equipped with a multi-scale enhancement module to achieve skip-layer connection and recover accurate remote optical volumetric recording signals. The encoder has four layers and the decoder has three layers. The encoder consists of a 1*5*5 three-dimensional convolutional block and three 3*3*3 spatiotemporal blocks. Each convolutional block is followed by a normalization layer and a ReLU layer. Each convolutional block is followed by a residual group, which is used to transmit physiological features lost during feature size changes, thereby improving the stability of the network. The multi-level feature fusion module is respectively set in the 2nd to 4th layers of the encoder to fuse the output features of each previous layer in sequence and enhance the spatiotemporal correlation of each channel feature. The decoder has three deconvolutional layers with a kernel size of 3*3*3. Each layer of the decoder has an added multi-scale enhancement module to enable skip-layer connections and reinforcement learning of facial physiological features of the upper layer to recover accurate remote photoplethysmography signals.

2. The method for constructing a remote photoplethysmography signal and heart rate detection model according to claim 1, characterized in that: The signal estimation module is used to merge the RGB three-channel signals from the three decoders, and filter the merged signal to output a remote optical volumetric recording signal.

3. The method for constructing a remote photoplethysmography signal and heart rate detection model according to claim 1, characterized in that: The multi-level feature fusion module is defined as follows: ; in, It is the encoder number Hidden features of the layer It is a feature enhanced through feature fusion. It is the front of the encoder Features fused by all multi-level feature fusion modules in the layer.

4. The method for constructing a remote photoplethysmography signal and heart rate detection model according to claim 1, characterized in that: The multi-scale enhancement module is defined as follows: Features from the previous layer Interpolation upsampling is performed using the output features of each layer of the encoder's residual group. It is enhanced to achieve skip-layer connections, and reinforced features are generated by refining the units. The formula is ; in, For interpolation upsampling operators, Indicates enhanced features, This represents a trainable refinement unit, each implemented using a set of residuals containing three residual blocks.

5. A method for detecting remote photoplethysmography signals and heart rate detection models, characterized in that, The method includes the following steps: Step 1: Acquire a sequence of facial video images; Step 2: After preprocessing the acquired face video image sequence, input it into the remote optical volumetric imaging signal and heart rate detection model constructed by the remote optical volumetric imaging signal and heart rate detection model construction method described in any of claims 1-4 to obtain the predicted remote optical volumetric imaging signal; Step 3: The predicted remote photoplethysmography signal is sequentially subjected to bandpass filtering and energy spectral density conversion algorithms to calculate the corresponding predicted heart rate value.

6. The remote photoplethysmography signal and heart rate detection model detection method according to claim 5, characterized in that: Step 3 also includes using a one-dimensional convolutional filter for filtering to better adapt to waveform distortion values, calculating the power spectral density of the obtained remote photoplethysmography signal, and the value with the highest corresponding frequency is the predicted heart rate value.

7. The remote photoplethysmography signal and heart rate detection model detection method according to claim 5 is applied to heart rate estimation.