Visible light and infrared image fusion method and comfort assessment system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By synchronizing and spatially aligning infrared and visible light images, extracting dual-modal temporal enhancement features and calculating associated weights, the temporal synchronization and spatial alignment problems in image fusion are solved, generating high-quality color fused images and achieving accurate assessment of thermal comfort.

CN122244619APending Publication Date: 2026-06-19XI'AN UNIVERSITY OF ARCHITECTURE AND TECHNOLOGY

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XI'AN UNIVERSITY OF ARCHITECTURE AND TECHNOLOGY
Filing Date: 2026-04-07
Publication Date: 2026-06-19

Smart Images

Figure CN122244619A_ABST

Patent Text Reader

Abstract

This application discloses a method for fusing visible light and infrared images and a comfort assessment system, relating to the field of image detection. It extracts and correlates features from infrared and visible light image sequences to obtain bimodal temporal augmentation features. These features are then segmented into different spatial nodes, and node features are extracted frame-by-frame, calculating the association weights of cross-modal node pairs. Based on the association weight matrix, the bimodal temporal augmentation features are spatially weighted. Feature fusion is then performed according to the proportion of different modalities in the bimodal temporal augmentation features to generate a fused image. Finally, a color fused image is generated, and a comfort score is assessed. This scheme effectively solves the problem of collaborative fusion of bimodal data in the temporal, spatial, and semantic dimensions by capturing the temporal and spatial correlation characteristics of bimodal sequences and performing spatial weighting and feature fusion on the bimodal temporal augmentation features. This avoids semantic misalignment and improves the accuracy and stability of the assessment.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image fusion, and in particular to a visible light and infrared image fusion method and a comfort assessment system. Background Technology

[0002] With the rise of smart home, building energy conservation, and human health monitoring technologies, thermal comfort assessment has become one of the core needs in environmental control and health monitoring. The core of thermal comfort assessment is to quantify the human body's perception of thermal comfort in the environment by sensing the distribution of ambient temperature and human behavioral characteristics, providing a basis for decision-making in scenarios such as intelligent control of air conditioning systems and optimization of indoor environments. Current technology has gradually shifted from single-sensor monitoring to multimodal image fusion. Infrared images can capture precise temperature distribution information, while visible light images can acquire rich details of human behavior and spatial structure. Combining the two is expected to achieve complementary advantages.

[0003] However, current technologies face multiple challenges in bimodal fusion, severely limiting the accuracy and practicality of assessments. In single-modal assessments, methods relying solely on infrared image temperature data ignore the impact of dynamic changes in human behavior on local temperature. For example, the heat dissipation effect caused by human movement cannot be effectively correlated, leading to a disconnect between assessment results and actual thermal perception, making it difficult to reflect the true comfort state in complex scenarios. At the bimodal fusion level, spatial alignment issues are particularly prominent. Due to the differences in imaging principles between infrared and visible light, original images often exhibit pixel-level positional deviations. Without precise calibration, misalignment of human contours or inaccurate region matching will occur after fusion, resulting in fundamental distortion of feature extraction. Simultaneously, the lack of a time synchronization mechanism hinders sequence processing in dynamic scenes. Inconsistent frame sequences in bimodal images cause temporal feature breaks, resulting in drastic fluctuations in assessment results during continuous monitoring and a significant decrease in stability. More critically, insufficient cross-modal semantic association modeling leads to a lack of effective correspondence between infrared thermal regions and visible light behavioral regions. For example, the human action region and the heat distribution region fail to establish a logical connection, causing semantic misalignment in the fused image, directly affecting the reliability of the thermal comfort score. In addition, temperature visualization solutions have obvious defects. Traditional color mapping methods are difficult to achieve an accurate correspondence between temperature and color, and are prone to color distortion or oversaturation, making it impossible for users to intuitively interpret the distribution of environmental thermal state.

[0004] To address the aforementioned issues, the relevant technologies urgently need improvement. Summary of the Invention

[0005] This application provides a visible light and infrared image fusion method and a comfort assessment system, which effectively solves the problem of collaborative fusion of dual-modal data in the time, spatial and semantic dimensions, improves the accuracy and reliability of fused images, and thus provides a precise basis for comfort assessment.

[0006] On one hand, this application provides a method for fusing visible light and infrared images, the method comprising: The dual-modal sequence containing both infrared and visible light image sequences is subjected to feature extraction and feature association to obtain dual-modal temporal enhancement features; the dual-modal temporal enhancement features include temporal action features extracted from visible light images and temporal thermal features extracted from infrared images; The dual-modal temporal enhancement features are segmented into different spatial nodes, and the node features of each node are extracted frame by frame. The association weights of cross-modal node pairs are calculated based on the node features. The association weights of all cross-modal node pairs are combined to form an association weight matrix. The bimodal temporal enhancement features are spatially weighted based on the association weight matrix, and feature fusion is performed according to the proportion of different modalities in the bimodal temporal enhancement features to generate a fused image; the fused image is used to convert and generate a color fused image, and to evaluate comfort scores.

[0007] Specifically, the infrared image sequence and the visible light image sequence are acquired based on an infrared camera and a visible light camera, respectively, and the field of view of the two cameras is consistent. The dual-modal sequence is obtained by performing frame synchronization sorting and spatial alignment between the infrared image sequence and the visible light image sequence.

[0008] Specifically, the frame synchronization and sorting operation includes: The infrared image and the visible light image are sorted in ascending order according to the sampling timestamp; the frames within the common time interval of the two sequences are extracted and frame synchronization matching is performed. Spatial alignment operations include: The SIFT algorithm is used to extract feature points from infrared and visible light images, and the spatial transformation matrix between the two modes is calculated based on feature point matching. Geometric correction and alignment are performed on the infrared or visible light image based on the spatial transformation matrix.

[0009] Specifically, the step of performing feature extraction and feature association on the aligned bimodal sequence to capture the temporal and spatial correlation characteristics of the sequence and obtain bimodal temporal enhancement features includes: High-dimensional features are extracted frame by frame from the aligned infrared image sequence to obtain an action feature map; high-dimensional features are extracted frame by frame from the aligned visible light image sequence to obtain a thermal feature map. The action feature map and the heat feature map are modeled in a time-space correlation. The correlation features of the bimodal sequence in time and space are captured by feature dimension rearrangement and matching to obtain the bimodal temporal enhancement features.

[0010] Specifically, the temporal action features and the temporal heat features are divided into corresponding spatial nodes according to the grid specifications, and the action node features and heat node features of each spatial node region are extracted frame by frame. The step of calculating the association weights of cross-modal node pairs based on the node features includes: Two spatial nodes at the same position in the features of two modal nodes are selected as the cross-modal node pair; The action node features and heat node features in the cross-modal node pair are concatenated and nonlinearly transformed, and the transformation result is used as the association weight between the cross-modal node pairs.

[0011] Specifically, the spatial weighting of the dual-modal temporal enhancement features based on the correlation weight matrix includes: The dual-modal temporal enhancement features are concatenated and nonlinearly transformed to generate a spatial attention mask corresponding to the spatial location. The mean weight of the association weight matrix is calculated, and the mean weight is used as a semantic prior and multiplied by the spatial attention mask to obtain an adjustment mask that incorporates semantic information. The adjustment mask is spatially weighted with the dual-modal temporal enhancement features to obtain the spatial features; the spatial features include action spatial features and heat spatial features.

[0012] Specifically, the step of feature fusion based on the proportion of different modalities in the dual-modal temporal enhancement features and generating a fused image includes: Calculate the modality ratio of different modalities in the final fusion of the dual-modal temporal enhancement features, and perform weighted fusion of the action space features and the heat space features according to the modality ratio, and output the fused features; The fusion features are decoded to generate the fused image frame by frame; the fused image integrates the temperature distribution information of the infrared mode with the behavior and spatial structure information of the visible light mode.

[0013] On the other hand, this application provides a comfort assessment system based on visible light and infrared image fusion. The system includes a forward processing module and a task-oriented post-processing module. The forward processing module acquires a dual-modal sequence containing an infrared image sequence and a visible light image sequence, and obtains undecoded fusion features based on a visible light and infrared image fusion method. The task-oriented post-processing module performs the following steps based on the undecoded fusion features: The fusion features are decoded frame by frame to generate a fused image, and MAGMA color mapping is performed based on the grayscale values of the fused image and a preset temperature range to generate a color fused image. Feature extraction is performed on each of the fused image sequences to obtain a global feature vector containing global information; The global feature vector is subjected to dimensionality expansion, regularization, and normalization operations, and the normalization output is determined as the thermal comfort score. Extract N consecutive frames of thermal comfort scores from the thermal comfort score sequence, and determine the average score as the comfort score; the comfort score is used to provide feedback and adjust the operating parameters of the terminal device.

[0014] Specifically, the system also includes an execution module, and the terminal device is a temperature control device; If the comfort score exceeds the maximum score threshold, the execution module generates and issues a command to control the terminal device to lower the temperature; If the comfort score is lower than the minimum score threshold, the execution module generates and issues a command to control the terminal device to raise the temperature; If the comfort score is within the score threshold range, the execution module maintains the operating parameters of the control terminal unchanged.

[0015] The beneficial effects of the technical solution provided in this application include at least the following: This application captures the temporal and spatial correlation characteristics of bimodal sequences to obtain bimodal temporal enhancement features, and performs spatial weighting and feature fusion based on the correlation weight matrix, which effectively solves the problem of collaborative fusion of bimodal data in the temporal, spatial and semantic dimensions, avoids semantic misalignment, and improves the accuracy and stability of evaluation. Attached Figure Description

[0016] Figure 1 This is a flowchart of the visible light and infrared image fusion method provided in the embodiments of this application; Figure 2 This is a schematic diagram of a comfort assessment system based on visible light and infrared image fusion provided in an embodiment of this application; Figure 3 The network model structure diagram of the infrared image fusion method and comfort assessment is shown; Figure 4 This illustration shows a schematic diagram of the data input and alignment module provided in an embodiment of this application; Figure 5 A schematic diagram of the structure of the initial feature extraction module provided in an embodiment of this application is shown; Figure 6 This illustration shows a schematic diagram of the action-heat correlation module provided in an embodiment of this application; Figure 7 This paper shows a schematic diagram of the structure of the dual-modal attention fusion module provided in an embodiment of this application; Figure 8 This illustration shows a schematic diagram of the structure of the task-oriented post-processing module provided in an embodiment of this application; Figure 9 This illustrates a possible form of color fusion image provided in an embodiment of this application. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0018] In this article, "multiple" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.

[0019] Traditional thermal comfort assessment technologies either rely solely on single-modal data, lacking complementary information such as human behavior and spatial structure, resulting in insufficient accuracy of assessment results; or, when fusing two modalities, they fail to effectively address the temporal and spatial alignment issues of image sequences and lack modeling of the semantic relationship between infrared thermality and visible light behavior, leading to spatial or semantic misalignment in the fused images, affecting the accuracy and stability of thermal comfort assessments, and the visualization effect also needs improvement.

[0020] To address this, this application provides a method for fusing visible light and infrared images, such as... Figure 1 As shown, the solution includes the following steps: S1. Extract and associate features from the dual-modal sequence containing infrared image sequences and visible light image sequences to obtain dual-modal temporal enhancement features; In this embodiment, the dual-mode sequence refers to a data stream composed of both infrared image sequences and visible light image sequences. The infrared image sequence records the temperature distribution information of the scene, while the visible light image sequence captures visual details, human behavior, and spatial structure information within the scene.

[0021] During the acquisition of dual-modal sequences, inconsistencies in the viewing angles of the infrared and visible light cameras, or the lack of frame synchronization and spatial alignment, can lead to spatial position deviations and temporal misalignments, affecting the accuracy and stability of subsequent feature association and fusion. Therefore, in this embodiment, the infrared image sequence and the visible light image sequence are acquired using the infrared camera and the visible light camera, respectively, with the viewing angles of both cameras remaining consistent and the timing also aligned. The infrared camera is specifically used to capture the thermal radiation information of objects, forming an infrared image sequence; the visible light camera is used to capture the reflected light information of objects within the visible spectrum, forming a visible light image sequence.

[0022] The bimodal temporal enhancement features include temporal action features extracted from visible light images and temporal thermal features extracted from infrared images. Specifically, independent feature extraction networks can be used to process the infrared image sequence and the visible light image sequence, extracting deep features from each frame of the image, capturing the temporal and spatial correlation characteristics of the sequence, and then simply concatenating these features of the corresponding modalities to form the preliminary bimodal temporal enhancement features.

[0023] S2. Divide the dual-modal temporal enhancement features into different spatial nodes, extract the node features of each node frame by frame, and calculate the association weight of cross-modal node pairs based on the node features. Specifically, the bimodal temporal augmentation feature map can be divided into several regular grid regions. For example, the feature map can be uniformly divided into 2x2 or 3x3 regions, with each region representing a spatial node. Then, average pooling or max pooling operations can be performed on the features within each spatial node to obtain the node features. For calculating the association weights of cross-modal node pairs, a simple similarity metric can be used. For example, the cosine similarity or Euclidean distance between infrared node features and visible light node features at the same spatial location can be calculated, and this similarity or distance value can be used as the association weight. These calculated association weights are then organized into a matrix, namely the association weight matrix.

[0024] S3. Spatially weight the bimodal temporal enhancement features based on the correlation weight matrix, and perform feature fusion according to the proportion of different modes in the bimodal temporal enhancement features, and generate a fused image; the fused image is used to convert and generate a color fused image, and to evaluate the comfort score.

[0025] Building upon this, spatial weighting can be achieved by simply multiplying each weight value in the association weight matrix element-wise with the corresponding bimodal temporal enhancement feature. During the feature fusion stage, a modality ratio can be calculated; for example, the features of the infrared and visible light modes can be weighted and summed at a 1:1 ratio or based on the calculated modality ratio. Subsequently, the fused features are processed through a decoder network, for example, using a series of deconvolutional or upsampling layers to progressively restore the spatial resolution of the image, ultimately generating the fused image.

[0026] The resulting fused image is used to generate a color fused image and to evaluate comfort scores. Specifically, the pixel values of the fused image can be directly mapped to a preset grayscale or pseudo-color range to generate a color fused image. For comfort score evaluation, a comprehensive comfort score can be calculated based on the overall brightness, contrast, or certain statistical characteristics of the fused image using a simple linear function, normalization, or lookup table method.

[0027] In summary, this application effectively addresses the temporal misalignment and insufficient spatial alignment issues inherent in traditional methods by extracting spatiotemporal correlation features from dual-modal sequences of visible and infrared light. Through refined spatial node association weight calculation and adaptive modal proportion fusion, semantic misalignment is avoided, key region features are strengthened, and the quality of the fused image and the accuracy of thermal comfort assessment are improved. Therefore, it is possible to more accurately quantify human thermal comfort perception of the environment, providing a reliable basis for intelligent environmental regulation.

[0028] As can be seen from the above embodiments, the consistency of the viewing angle range is the foundation for achieving accurate spatial alignment. It can effectively avoid image content deviations caused by different viewing angles, thus providing favorable conditions for subsequent image registration and fusion. Temporal alignment is the key to achieving dual-modal image frame matching and fusion. This is crucial for capturing the temporal correlation characteristics in dynamic scenes and can effectively avoid inaccurate fusion information caused by temporal misalignment. In other words, it is necessary to perform frame synchronization sorting and spatial alignment between the infrared image sequence and the visible light image sequence to obtain the dual-modal sequence.

[0029] In one possible implementation, frame synchronization sorting can sort infrared images and visible light images in ascending order according to sampling timestamps; extract frame images within the common time interval of the two sequences after sorting, and perform frame synchronization matching.

[0030] For example, infrared and visible light cameras typically record a precise sampling timestamp when acquiring each frame of an image. By reading these timestamps, the image sequences of their respective modalities can be sorted in ascending order of time. Based on this, frames within the common time interval of the two sorted sequences are extracted and frame synchronization matching is performed. In particular, it is necessary to ensure strict correspondence in timestamps and to discard missing or dropped frames.

[0031] Spatial alignment can be performed by using the SIFT algorithm to extract feature points from infrared and visible light images, and then calculating the spatial transformation matrix between the two modes based on feature point matching. After that, the infrared or visible light image is geometrically corrected and aligned based on the spatial transformation matrix.

[0032] The SIFT algorithm can detect key points in images and generate descriptors robust to changes in scale, rotation, and brightness. SIFT feature points and their descriptors are extracted from infrared and visible light images respectively, laying the foundation for subsequent matching. After mapping corresponding feature points between two frames, the spatial transformation matrix can be calculated. This matrix describes the geometric mapping relationship from one image coordinate system to another; common spatial transformations include affine transformations and perspective transformations.

[0033] In summary, the frame synchronization and sorting operation ensures the temporal consistency between the infrared and visible light image sequences, eliminating temporal misalignment caused by differences in acquisition devices or processing timing. This provides accurate and synchronized input for subsequent temporal feature extraction and fusion. The spatial alignment operation, through precise calculation and application of the spatial transformation matrix, eliminates spatial deviations between the infrared and visible light images. This ensures that objects in the same scene can accurately correspond in different modal images, avoiding "spatial misalignment" during fusion. This, in turn, ensures the accuracy and stability of the fused image, thereby improving the precision of comfort assessment.

[0034] In some embodiments, the process of extracting bimodal temporal enhancement features based on aligned bimodal sequences can be implemented through the following scheme: 1. Extract high-dimensional features frame by frame from the aligned infrared image sequence to obtain motion feature maps; extract high-dimensional features frame by frame from the aligned visible light image sequence to obtain thermal feature maps; In the process of extracting high-dimensional features frame by frame from aligned infrared image sequences to obtain motion feature maps, high-dimensional feature extraction aims to extract deep, abstract features from infrared image sequences. These features can characterize dynamic information or potential "action" patterns in the images. Even though infrared images mainly reflect temperature distribution, their changes over time can indirectly reflect action.

[0035] In the process of extracting high-dimensional features frame by frame from the aligned visible light image sequence to obtain thermal feature maps, high-dimensional feature extraction aims to extract high-dimensional features from the visible light image sequence. Although these features are not directly temperature, they can contain visual cues related to heat, such as human posture, activity area, and changes in ambient light. When these cues are combined with infrared information, they help to understand the semantics of "heat" more comprehensively.

[0036] 2. Perform temporal-spatial correlation modeling on the action feature map and the heat map, and capture the temporal and spatial correlation features of the bimodal sequence through feature dimension rearrangement and matching to obtain bimodal temporal enhancement features.

[0037] This modeling aims to establish deep temporal and spatial connections between action feature maps extracted from infrared image sequences and thermal feature maps extracted from visible light image sequences, thereby understanding the collaborative changes of bimodal data in dynamic scenes. Specifically, a cross-modal attention mechanism can be employed, using attention weights to weighted aggregate information from the thermal feature maps; alternatively, the spatial locations or semantic regions in the action and thermal feature maps can be abstracted as graph nodes to construct a bimodal graph structure. Information propagation and aggregation can then occur on this graph structure, enabling the modeling of temporal and spatial relationships between different modal features.

[0038] Feature dimensional rearrangement and matching aims to adjust the dimensional structure of different modal features so that they can be effectively matched and fused, thereby capturing the temporal and spatial correlation features of bimodal sequences.

[0039] This refined correlation modeling avoids the semantic misalignment problems caused by simple splicing or weighting in traditional methods, ensuring that the temperature distribution information of the infrared mode and the behavioral and spatial structure information of the visible light mode can be accurately corresponded and fused. The resulting bimodal temporal enhancement features not only contain rich temporal and spatial information, but also establish clear semantic relationships between modes, providing a more accurate and robust foundation for subsequent feature fusion and comfort assessment.

[0040] In the above embodiments, without a specific mechanism to ensure the accuracy of precise extraction of node features and weight calculation, semantic misalignment between behavioral regions and hot zones may occur, affecting the quality of the fused image and the accuracy of subsequent evaluation.

[0041] To this end, this embodiment divides the temporal action features and temporal heat features into corresponding spatial nodes according to a grid size (e.g., 4×4 grid), and extracts the action node features and heat node features of each spatial node region frame by frame. This grid division method can decompose complex scenes into smaller, easier-to-analyze local regions, thereby capturing more accurate dynamic behavior information and temperature distribution information in each local region.

[0042] Based on this, the process of calculating association weights can be summarized as follows: A. Select two spatial nodes at the same position in the features of two modal nodes as cross-modal node pairs; B. Perform feature concatenation and nonlinear transformation on the action node features and heat node features in the cross-modal node pair, and use the transformation result as the association weight between the cross-modal node pairs.

[0043] First, two spatial nodes at the same location in the two modal node features are selected as the cross-modal node pair. Since the infrared and visible light image sequences have already undergone frame synchronization sorting and spatial alignment in previous processing, nodes located at the same grid coordinates (e.g., the top-left node of the visible light feature map and the top-left node of the infrared feature map) are considered to be nodes at the "same location." This direct node pairing method effectively utilizes the pre-completed spatial alignment, ensuring the correspondence of the selected node pairs in physical space, thereby avoiding semantic mismatches caused by spatial misalignment.

[0044] Concatenating cross-modal node features combines local feature vectors from different modalities along a certain dimension. For example, action node feature vectors and heat node feature vectors can be connected along the channel dimension to form a longer, comprehensive feature vector containing bimodal information. Nonlinear transformation refers to applying one or more nonlinear functions to the concatenated feature vector, such as processing it through multiple convolutional layers with activation functions. This nonlinear transformation can learn and represent high-level features of cross-modal interactions, capturing the complex nonlinear relationship between actions and heat, thereby generating association weights that can accurately quantify the "behavior-heat zone" correspondence. Finally, the transformation results are normalized or quantified into specific numerical values as association weights. All association weights together form an association weight matrix; this matrix quantifies the semantic correspondence of "behavior-heat zone" at the spatial node level.

[0045] Through the above technical solution, this application can accurately calculate the semantic association weight between behavioral regions and hot zones by finely segmenting the feature space, extracting local features frame by frame, and performing feature concatenation and nonlinear transformation on cross-modal node pairs based on spatially aligned features. This effectively solves the semantic misalignment problem existing in traditional technologies, thereby ensuring the semantic accuracy in the subsequent fusion process.

[0046] Although the above embodiments propose spatial weighting based on the association weight matrix to fuse bimodal temporal features, in this process, the lack of effective integration of semantic prior information may cause the spatial attention mechanism to fail to fully capture cross-modal semantic associations, thereby causing semantic misalignment of the fused features and affecting the accuracy of subsequent image generation and evaluation.

[0047] Therefore, this application details a specific scheme for spatial weighting of dual-modal temporal enhancement features based on the correlation weight matrix in some possible embodiments, including the following: A. Perform feature concatenation and nonlinear transformation on the dual-modal temporal enhancement features to generate a spatial attention mask corresponding to the spatial location; This step aims to integrate feature information from different modalities and generate an attention map that can focus on key spatial regions through a transformation process, thereby avoiding irrelevant interference. For example, temporal action features and temporal popularity features can be concatenated along the channel dimension and then input into one or more convolutional or fully connected layers for nonlinear transformation, such as using the sigmoid activation function to generate an attention weight map between 0 and 1.

[0048] B. Calculate the mean weight of the association weight matrix, and use the mean weight as a semantic prior. Multiply it with the spatial attention mask to obtain an adjusted mask that incorporates semantic information. This semantic prior utilizes the overall statistical properties of cross-modal node association weights, representing global association information at the semantic level, such as the overall correspondence strength of "behavior-hotspot". For example, the global average can be calculated by summing all elements and dividing by the total number of elements, which reflects the overall cross-modal semantic association strength.

[0049] The adjusted mask is obtained by multiplying the weight mean (semantic prior) with the spatial attention mask. This operation achieves a deep fusion of semantic knowledge and spatial attention, generating an adjusted mask that considers both local spatial saliency and global semantic understanding, thereby strengthening the weight allocation of features in semantically relevant regions.

[0050] C. Spatially weight the adjusted mask with the dual-modal temporal enhancement features to obtain spatial features; the spatial features include action spatial features and heat spatial features.

[0051] This step specifically adjusts the contributions of action space features and heat space features to ensure that the fused features are consistent with their semantic association in spatial location, laying the foundation for the subsequent generation of high-quality fused images.

[0052] In some embodiments, the process of fusing features of dual-modal temporal enhancement features to generate a fused image can be achieved through the following steps: 1. Calculate the modality ratio of different modalities in the final fusion of the bimodal temporal enhancement features, and perform weighted fusion of action space features and heat space features based on the modality ratio to output the fused features; The purpose of this step is to determine the relative importance or contribution of infrared and visible light modes in the final fusion result, which can be weighted based on the proportion of dual-modal features in the global features.

[0053] 2. Decode the fusion features and generate a fused image frame by frame; the fused image integrates the temperature distribution information of the infrared mode with the behavior and spatial structure information of the visible light mode.

[0054] The purpose of decoding is to transform abstract, high-dimensional fused features back into a pixel-level representation suitable for image generation. This is typically the inverse process of feature extraction, such as using deconvolution operations for the decoding output. The decoder receives the fused features of each frame as input and independently generates the corresponding fused image. The fused image integrates the temperature distribution information of the infrared modality with the behavioral and spatial structure information of the visible light modality. This means that the generated fused image visually and semantically combines the thermal data of the infrared image with the visual details (such as motion, shape, and structure) of the visible light image. For example, in Figure 9 In the color fusion image shown, the human body outline and movement details are clearly visible, while the temperature distribution is intuitively reflected through the color depth or color changes in specific areas.

[0055] In summary, this application addresses the information imbalance problem in feature fusion by introducing modal proportion calculation and a weighted fusion mechanism, ensuring that the fused image accurately integrates complementary information from infrared and visible light, thereby avoiding semantic misalignment. The fused image combines temperature distribution information from the infrared mode with behavioral and spatial structure information from the visible light mode, ensuring that the output image simultaneously contains temperature and behavioral details, improving the reliability and intuitiveness of thermal comfort assessment. This adaptive modal fusion strategy allows the system to dynamically adjust the weights of infrared and visible light information according to actual needs in different scenarios, thereby generating a fused image with greater semantic accuracy and visual expressiveness, significantly improving the accuracy of thermal comfort assessment.

[0056] This application also discloses a comfort assessment system based on the fusion of visible light and infrared images, such as... Figure 2 As shown, the system includes a forward processing module and a task-oriented post-processing module. The forward processing module acquires a dual-modal sequence containing infrared and visible light image sequences, and obtains the undecoded fusion features based on a visible light and infrared image fusion method. The specific steps can be designed as follows: (This description should be consistent with the previous ones): S1. Decode the fusion features frame by frame to generate a fusion image, and perform MAGMA color mapping based on the grayscale values of the fusion image and the preset temperature range to generate a color fusion image; S2. Extract features from each of the fused image sequences to obtain a global feature vector containing global information; S3. Perform dimensional expansion, regularization, and normalization operations on the global feature vector, and determine the normalized output as the thermal comfort score. S4. Extract N consecutive frames of thermal comfort scores from the thermal comfort score sequence, and determine the average score as the comfort score; the comfort score is used to provide feedback for adjusting the operating parameters of the terminal device.

[0057] The undecoded fusion features acquired by the system can be obtained from the output of the forward processing module. This forward processing module can be a module unit that integrates visible light and infrared image fusion methods. This module unit uses dual cameras to acquire image sequences and outputs the undecoded fusion features. The task-oriented post-processing module performs relevant operations on the undecoded fusion features.

[0058] In S1, the preset temperature range can be set in advance by the system, such as the range of 19.0℃ to 35.0℃. After MAGMA color mapping, it is converted to generate a color fusion image.

[0059] MAGMA color mapping is a perceptually uniform color mapping scheme that avoids color distortion and perceptual unevenness that may occur with traditional color mapping (such as JET), thus reflecting temperature changes more accurately. The system maps the grayscale values of the fused image (usually associated with temperature information) to a preset temperature range, and then converts these temperature values into corresponding color pixels according to MAGMA color mapping rules to generate a color fused image. The preset temperature range can be configured according to the actual application scenario or user needs to ensure the effectiveness and accuracy of the color mapping.

[0060] In S2, the system further analyzes the fused image generated in each frame to extract its overall, high-level features, aiming to capture comprehensive information from all regions of the image, not just local details.

[0061] In S3, the extracted global feature vector undergoes a series of preprocessing steps to make it suitable as input for the thermal comfort score. Dimensionality unrolling converts multi-dimensional features into one-dimensional vectors for subsequent mathematical operations. Regularization operations may include scaling or transforming feature values to reduce overfitting risk or conform features to a specific distribution, such as L1 or L2 regularization. Normalization scales the feature vector values to a standard range (e.g., 0 to 1), eliminating dimensional differences between features and ensuring a fair contribution of all features to the score. Finally, the normalized value is determined as the thermal comfort score for the current frame, either directly or through a simple mapping function.

[0062] In S4, the thermal comfort scores of N consecutive frames are extracted from the thermal comfort score sequence, and the average score is determined as the comfort score. This means that, to improve the stability of the assessment, the system does not rely on the instantaneous score of a single frame, but considers the score trend over a period of time. The system maintains a sequence containing the thermal comfort scores of the most recent N frames and calculates the arithmetic mean of these scores, using this average as the final comfort score. This temporal smoothing helps filter out short-term score fluctuations, making the comfort score more representative of the overall comfort state of the current environment.

[0063] Comfort scores are used to provide feedback and adjust the operating parameters of terminal devices. The calculated comfort score serves as the basis for decision-making, automatically adjusting the operating parameters of environmental control equipment (such as air conditioners, fans, and heaters). For example, when the comfort score falls below a certain threshold, the system can send a command to increase the indoor temperature or fan speed; when the score exceeds another threshold, it can decrease the temperature or fan speed. This closed-loop feedback mechanism enables intelligent environmental control, aiming to continuously maintain or optimize the user's thermal comfort experience.

[0064] In some embodiments, the system is further provided with an execution module. Taking the terminal device as a temperature regulating device as an example, if the comfort score exceeds the maximum score threshold, the execution module generates and issues an instruction to control the terminal device to lower the temperature. If the comfort score is lower than the minimum score threshold, the execution module will generate and issue a command to control the terminal device to raise the temperature. If the comfort score is within the score threshold range, the execution module maintains the control terminal's operating parameters unchanged.

[0065] Furthermore, to facilitate observation and data storage, after acquiring the temperature heatmap, the execution module performs real-time temperature identification on the regions of the image. When an abnormal region exceeding the set temperature threshold is detected, it is marked and displayed in the color fusion map, and the color fusion map, comfort score, and abnormal record are stored.

[0066] This process can directly convert the grayscale or color value of each pixel in the color fusion map (i.e., temperature heatmap) into the corresponding temperature value using a pre-established temperature mapping table or function, thereby identifying the temperature of each point in the image. Alternatively, the system can perform region segmentation using the heatmap, for example, by identifying different objects or background regions in the image through image processing algorithms, and then calculating the average temperature, maximum temperature, or minimum temperature for each identified region, performing temperature identification on a region-by-region basis.

[0067] Once an anomaly is detected, the system immediately displays a clear annotation on the color-blended map and provides alerts through various means, ensuring users can promptly perceive and respond to potential comfort issues or safety hazards. Simultaneously, the system stores the color-blended map, comfort score, and detailed anomaly records. This not only provides valuable data support for subsequent optimization of the comfort assessment model and adjustment of environmental control strategies but also ensures the traceability of anomalies, significantly enhancing the system's intelligence, proactive early warning capabilities, and data management capabilities.

[0068] Based on the aforementioned visible light and infrared image fusion method and comfort assessment system, this application also designs a corresponding algorithm model, such as... Figure 3 The infrared image fusion method and the network model structure diagram for comfort assessment shown are divided into the following structures according to their functional structure: 1. Data Input and Alignment Module (see...) Figure 4 ) Input: Infrared (IR) image sequence and visible light (VIS) image sequence, both of which are single-channel grayscale images with dimensions [B,T,C,H,W] (B is the batch size, T is the number of time frames, C=1 is the number of channels, and H and W are the image height and width); preset temperature mapping range (default 19.0~35.0℃, adapted to human thermal comfort scenarios).

[0069] Function: Spatial alignment: The SIFT algorithm is used to extract feature points from the dual-modal image. The spatial transformation matrix is calculated by matching the feature points, and the infrared image is geometrically corrected to eliminate the spatial position deviation of the dual-modal image. Time alignment: Based on the image acquisition timestamp, the bimodal sequence is sorted for frame synchronization, retaining frame data within the common time interval to generate a time-synchronized bimodal sequence; Format validation and normalization: Map the aligned sequence pixel values to the 0~1 range to provide standardized data for subsequent processing.

[0070] Connection method: The input end receives the original bimodal sequence; the output end is connected to the initial feature extraction module to transmit the time- and space-aligned bimodal sequence.

[0071] 2. Initial Feature Extraction Module (see...) Figure 3 ) Structure: It consists of a series of “convolutional layer + batch normalization layer (BN) + ReLU activation layer + Dropout2D layer”; wherein, the parameters of the convolutional layer are: 1 input channel, 64 output channels, 3×3 kernel size, padding=1, stride 1, no bias; the deactivation probability of Dropout2D is p=0.2.

[0072] Function: Extract high-dimensional features frame by frame from the aligned infrared and visible light sequences, and output bimodal initial features (action_feats and heat_feats) with dimensions [B,T,64,H,W], ensuring the consistency of bimodal feature dimensions.

[0073] Connection method: The input end is connected to the data input and alignment module to receive the aligned bimodal sequence; the output end is connected to the timing association module to transmit the initial bimodal features.

[0074] 3. Timing correlation module (see Figure 5 ) Structure: It consists of two parts: "3D convolutional blocks" and "residual connections"; 3D convolutional block: Composed of two layers "Conv3D+BN3D+ReLU+Dropout3D" concatenated. The parameters of Conv3D are: 64 input channels, 64 output channels, kernel size 3×3×3, padding=1, stride 1, no bias; the deactivation probability of Dropout3D is p=0.2. Residual connections: Dimensions are matched using 1×1×1 3D convolutions (64 input channels, 64 output channels, no bias).

[0075] Function: Performs temporal-spatial correlation modeling on bimodal initial features. First, the feature dimensions are rearranged to [B, 64, T, H, W] to adapt to 3D convolution. After convolutional blocks and residuals are superimposed, the dimensions are restored to [B, T, 64, H, W]. Temporal enhancement features (action_feat_temporal, heat_feat_temporal) are output to strengthen the temporally aligned sequence correlation features.

[0076] Connection method: The input end is connected to the initial feature extraction module to receive bimodal initial features; the output end is connected to the action-heat correlation module to transmit temporal enhancement features.

[0077] 4. Action-Popularity Correlation Module (see...) Figure 6 ) Structure: Includes a "node extraction component" and a "graph attention component"; Node extraction component: It consists of "adaptive average pooling 2D (output size 1×1) + 1×1 convolution + Dropout2D + Flatten", with 64 input and output channels for the 1×1 convolution and a dropout2D inactivation probability of p=0.2. The graph attention component consists of two fully connected layers, ReLU, Dropout, and Sigmoid. The first fully connected layer has an input dimension of 128 (64×2) and an output dimension of 64. The second layer has an input dimension of 64 and an output dimension of 1. The Dropout deactivation probability is p=0.3.

[0078] Function: Divide the temporal enhancement features into 16 spatial nodes using a 4×4 grid, extract the feature vector of each node frame by frame, calculate the association weights of cross-modal node pairs using a graph attention component, and output the association weight matrix (corr_weight) with dimensions [B,T,16,16] and bimodal node features (action_nodes, heat_nodes). The spatially aligned features are used to improve the accuracy of association weight calculation.

[0079] Connection method: The input end is connected to the temporal correlation module to receive bimodal temporal enhancement features; the output end is connected to the bimodal attention fusion module to transmit the correlation weight matrix and node features.

[0080] 5. Dual-modal attention fusion module (see...) Figure 7 ) Structure: Includes a "spatial attention component" and a "modal fusion weight component"; Spatial attention component: It consists of “Conv2D+BN2D+ReLU+Dropout2D+Conv2D+Sigmoid”. The first layer of Conv2D has 128 input channels (64×2) and 64 output channels. The second layer of Conv2D has 64 input channels and 1 output channel. The Dropout2D deactivation probability is p=0.2. Modality fusion weight component: It consists of "two fully connected layers + ReLU + Dropout + Softmax". The first layer has an input dimension of 128 (64×2) and an output dimension of 64. The second layer has an input dimension of 64 and an output dimension of 2. The Dropout deactivation probability is p=0.3.

[0081] Function: The mean of the association weight matrix is used as a semantic prior and multiplied with the spatial attention mask to obtain an adjusted mask that incorporates semantic information. Spatially weighted bimodal temporal enhancement features are then applied. The modality proportion is calculated through the modality fusion weight component to achieve adaptive fusion of bimodal features. The output is a fused feature (fused_feat) with dimensions [B,T,64,H,W]. The fusion result benefits from the prior temporal and spatial alignment, avoiding misalignment interference.

[0082] Connection method: The input end is connected to the action-heat correlation module to receive the correlation weight matrix, bimodal node features and bimodal temporal enhancement features; the output end is connected to the task-oriented post-processing module to transmit the fused features.

[0083] The above functional modules constitute the forward processing module of the comfort assessment system.

[0084] 6. Task-oriented post-processing module (see...) Figure 8 ) Structure: Includes a "fusion image decoding component" and a "thermal comfort scoring component"; Fusion Image Decoding Component: This component can be composed of "Conv2D+BN2D+ReLU+Dropout2D+Conv2D+Sigmoid", with the first layer Conv2D having 64 input channels and 32 output channels, the second layer having 32 input channels and 1 output channel, and the Dropout2D deactivation probability p=0.2. Thermal comfort rating component: It consists of “Conv2D+ReLU+Adaptive average pooling 2D (output size 1×1)+Flatten+Dropout+Fully connected layer+Sigmoid”, with 1 input channel and 8 output channels in Conv2D, 8 input dimensions and 1 output dimension in the fully connected layer, and a Dropout inactivation probability of p=0.5.

[0085] Function: Decodes the fusion features frame by frame to generate a fusion image with dimensions [B,T,1,H,W]; at the same time, extracts temperature distribution features based on the fusion image and outputs a normalized thermal comfort score of 0~1 with dimensions [B,T,1].

[0086] Connection method: The input end is connected to the dual-modal attention fusion module to receive fused features; the output end is connected to the result output module to transmit the fused image and scoring sequence.

[0087] In addition to the components mentioned above, the task-oriented post-processing module also includes a "temperature visualization unit" and a "result storage unit"; Temperature visualization unit: Using the MAGMA color mapping scheme, the grayscale values of the fused image are accurately mapped to the preset temperature range to generate an infrared-style color fused image; Results storage unit: Supports the storage and export of fused images, high-temperature target area masks, and scoring sequences.

[0088] Function: Outputs visualized fused images (including color fused images) and thermal comfort score sequences to provide results support for subsequent applications.

[0089] 7. Execution Module This includes controllers or execution units that send parameter adjustment commands to terminal devices.

[0090] Example 1: Application of Human Thermal Comfort Assessment and Intelligent Air Conditioning Control in Indoor Office Scenarios This invention is applied to a 100㎡ open-plan office space, where a single intelligent inverter air conditioner (supporting Modbus communication protocol) is deployed. The system needs to monitor the thermal comfort status of office workers in real time and automatically adjust the air conditioner's airflow parameters (temperature, fan speed) to achieve a balance between energy saving and worker comfort. The system needs to process dual-modal time-series data under dynamic scenarios (changes in posture, slight movement), outputting accurate thermal comfort scores and visualization results to provide a basis for air conditioning control decisions.

[0091] 1. System Initialization: Start the edge terminal, dual cameras and air conditioner controller, and complete the equipment calibration -- adjust the position of the dual cameras through the checkerboard calibration board to ensure that the shooting fields of view are completely overlapped; initialize the thermal comfort assessment system of this invention, load the pre-trained model weights (trained based on 500 sets of dual-modal time-series data of office scenarios, 10 training rounds, using cosine annealing learning rate scheduling and supporting optimization strategies), and set the thermal comfort score threshold (low comfort threshold ≤ 3 points, high comfort threshold ≥ 7 points, full score 10 points).

[0092] 2. Data Acquisition and Triggering: When the human body sensor detects a person in the office area (distance sensor ≤ 5m), it triggers the dual cameras to synchronously acquire data, continuously acquiring 5 frames of infrared-visible light time-series images at an interval of 0.067s (matching the visible light camera's 30fps frame rate). The raw image data is then transmitted to the edge terminal via the camera SDK and stored in PNG grayscale format.

[0093] 3. Bimodal Alignment and Preprocessing: The edge terminal calls the data input and alignment module. First, it extracts feature points of the bimodal image using the SIFT algorithm (≥200 feature points per frame), performs feature point matching using the FLANN matcher, calculates the homography matrix (reprojection error ≤1.5 pixels), and performs geometric correction on the infrared image to eliminate spatial position deviation. Then, based on the camera acquisition timestamp, it performs frame synchronization sorting on the bimodal sequence, retains 5 frames of data within the common time interval, and removes abnormal frames (abnormal brightness, blurry frames). Finally, it normalizes the pixel values of the aligned image to the [0,1] interval, and outputs standardized bimodal time-series data with dimensions [2,5,1,256,256].

[0094] 4. Feature Extraction and Fusion Inference: Standardized data input to the initial feature extraction module extracts bimodal high-dimensional features frame by frame through a preset convolutional layer (3×3 convolutional kernel, 64 output channels), outputting initial features of [2,5,64,256,256]. Subsequently, the temporal correlation module (3D convolution + residual connection) captures the temporal-spatial correlation between frames, generating temporal enhanced features. The action-heat correlation module divides the features into 16 spatial nodes according to a 4×4 grid, calculates the cross-modal node correlation weights through a graph attention component, and outputs a correlation weight matrix and node features of [2,5,16,16]. The bimodal attention fusion module combines spatial attention mask and modal weights to achieve adaptive fusion, outputting fused features of [2,5,64,256,256]. Finally, the task-oriented post-processing module decodes the fused image (dimension [2,5,1,256,256]) and calculates the thermal comfort score sequence.

[0095] 5. Results Output and Visualization: The results output module adopts the MAGMA color mapping scheme to accurately map the grayscale values of the fused image to the temperature range of [19.0, 35.0]℃, generating a color fused image and storing the high-temperature area mask (areas with a temperature ≥ 32℃). The edge terminal calculates the average of the scores of 5 frames as the final thermal comfort score. If there are abnormal scores (deviating from the mean by more than ±1 point), they are automatically removed and the mean is recalculated to ensure the stability of the score.

[0096] 6. Intelligent Air Conditioning Control: The edge terminal transmits thermal comfort scores to the air conditioning controller via the Modbus protocol, enabling adaptive parameter adjustment: If the score is ≥7 (people feel too hot), a command is issued to lower the air conditioning supply temperature by 1℃ and increase the fan speed by 1 level (from automatic to medium); if the score is ≤3 (people feel too cold), a command is issued to raise the air conditioning supply temperature by 1℃ and decrease the fan speed by 1 level; if 3 < score < 7 (comfort range), the current air conditioning parameters are maintained. After the control command is issued, the air conditioning controller provides feedback on the execution result, and the edge terminal records the control log, corresponding score, and fused image.

[0097] 7. Cyclic monitoring: Repeat steps 2-6 every 3 minutes to achieve continuous monitoring of thermal comfort in the office environment and dynamic control of air conditioning. At the same time, the fused images, scoring sequences, and control logs are stored to the edge terminal, which supports querying and exporting by timestamp.

[0098] Example 2: Application of Human Thermal Comfort Assessment and Health Monitoring in Low-Light Nighttime Environments This embodiment is applied to a night duty room in a residential community (20㎡, 5m long × 4m wide × 2.8m high). In this scenario, the nighttime light intensity is low (≤5 lux), and the system of this invention is needed to monitor the thermal comfort status of the duty personnel. At the same time, based on the temperature distribution characteristics of the fused image, it helps to judge the health status of the personnel (such as an abnormal increase in local temperature may indicate discomfort). The system outputs a visual result for the duty management personnel to view, which solves the problems of insufficient accuracy and poor visualization effect of single-modal infrared assessment in low-light nighttime scenarios.

[0099] 1. System initialization: Start the edge terminal, dual cameras and display device, complete the dual camera calibration (calibrate the field of view overlap through SIFT feature matching, and the reprojection error is ≤2 pixels); load the pre-trained model weights (trained based on 300 sets of dual-modal time-series data of low-light night scenes, including cosine annealing and supporting optimization strategies to suppress low-light noise interference); set the display parameters (real-time refresh rate of 10fps) and the abnormal temperature alarm threshold.

[0100] 2. Data Acquisition and Preprocessing: When the low-light sensor detects an ambient light intensity ≤5 lux, it triggers the dual cameras to continuously acquire 5 frames of time-series images. The visible light camera automatically turns on the infrared fill light (the fill light intensity is adjusted to medium to avoid strong light interfering with the staff on duty), and the infrared camera simultaneously acquires temperature images at a time interval of 0.1s (matching the infrared camera's 10fps frame rate). After the raw data is transmitted to the edge terminal, the data input and alignment module first performs low-light enhancement on the visible light image (improving brightness and contrast through the Retinex algorithm), then achieves dual-modal spatial alignment through the SIFT algorithm, completes frame synchronization based on timestamps, and outputs standardized data of [1,5,1,256,256] after normalization.

[0101] 3. Fusion Reasoning and Result Generation: Standardized data undergoes initial feature extraction, temporal correlation modeling, cross-modal semantic correlation calculation, and bimodal attention fusion to generate fusion features; the task-oriented post-processing module decodes to obtain 5 frames of fusion images and thermal comfort score sequences, and the result output module generates a MAGMA color mapping color fusion map, while identifying abnormal areas with temperatures ≥37℃ and labeling the location and temperature value of the areas.

[0102] 4. Results Display and Alarm: The display device shows the color fusion image, current thermal comfort score, and abnormal temperature prompts in real time (if there are no abnormalities, it displays "normal temperature"; if there are abnormalities, the abnormal area is marked in red); at the same time, the fusion image, score sequence, and abnormal records are stored to the terminal, archived by date, and supported by administrators for retrospective query.

[0103] 5. Continuous monitoring: The system runs 24 hours a day without interruption, collecting data and completing inference every 2 minutes. In low-light scenarios, it automatically activates supplemental lighting and low-light enhancement processing. When the light intensity is ≥30 lux, it switches to normal mode to ensure the accuracy of the assessment at all times.

[0104] Both of the above embodiments verify the feasibility and superiority of the technical solution. Through core designs such as bimodal time-space alignment, temporal correlation modeling, and cross-modal semantic correlation, the solution effectively solves the problems of misalignment, scoring fluctuation, and insufficient accuracy of related technologies. It can be adapted to the human thermal comfort assessment needs of different scenarios. Moreover, the system configuration is universal and the operation process is standardized. Those skilled in the art can reproduce the technical effects of this solution based on the above parameters and steps.

[0105] This specific embodiment is merely an explanation of the present invention and is not intended to limit the invention. After reading this specification, those skilled in the art can make modifications to this embodiment without contributing any inventive step, but such modifications are protected by patent law as long as they fall within the scope of the claims of the present invention.

Claims

1. A method for fusing visible light and infrared images, characterized in that, The method includes: The dual-modal sequence containing both infrared and visible light image sequences is subjected to feature extraction and feature association to obtain dual-modal temporal enhancement features; the dual-modal temporal enhancement features include temporal action features extracted from visible light images and temporal thermal features extracted from infrared images; The dual-modal temporal enhancement features are segmented into different spatial nodes, and the node features of each node are extracted frame by frame. The association weights of cross-modal node pairs are calculated based on the node features. The association weights of all cross-modal node pairs are combined to form an association weight matrix. The bimodal temporal enhancement features are spatially weighted based on the association weight matrix, and feature fusion is performed according to the proportion of different modalities in the bimodal temporal enhancement features to generate a fused image; the fused image is used to convert and generate a color fused image, and to evaluate comfort scores.

2. The method according to claim 1, characterized in that, The infrared image sequence and the visible light image sequence are acquired based on an infrared camera and a visible light camera, respectively, and the field of view of the two cameras is kept consistent. The dual-modal sequence is obtained by performing frame synchronization sorting and spatial alignment between the infrared image sequence and the visible light image sequence.

3. The method according to claim 2, characterized in that, Frame synchronization and sorting operations include: The infrared image and the visible light image are sorted in ascending order according to the sampling timestamp; the frames within the common time interval of the two sequences are extracted and frame synchronization matching is performed. Spatial alignment operations include: The SIFT algorithm is used to extract feature points from infrared and visible light images, and the spatial transformation matrix between the two modes is calculated based on feature point matching. Geometric correction and alignment are performed on the infrared or visible light image based on the spatial transformation matrix.

4. The method according to claim 1, characterized in that, The step involves extracting and associating features from the aligned bimodal sequences to capture the temporal and spatial correlation characteristics of the sequences and obtain bimodal temporal enhancement features, including: High-dimensional features are extracted frame by frame from the aligned infrared image sequence to obtain an action feature map; high-dimensional features are extracted frame by frame from the aligned visible light image sequence to obtain a thermal feature map. The action feature map and the heat feature map are modeled in a time-space correlation. The correlation features of the bimodal sequence in time and space are captured by feature dimension rearrangement and matching to obtain the bimodal temporal enhancement features.

5. The method according to claim 1, characterized in that, The temporal action features and the temporal heat features are respectively divided into corresponding spatial nodes according to the grid specifications, and the action node features and heat node features of each spatial node region are extracted frame by frame. The step of calculating the association weights of cross-modal node pairs based on the node features includes: Two spatial nodes at the same position in the features of two modal nodes are selected as the cross-modal node pair; The action node features and heat node features in the cross-modal node pair are concatenated and nonlinearly transformed, and the transformation result is used as the association weight between the cross-modal node pairs.

6. The method according to claim 5, characterized in that, The spatial weighting of the dual-modal temporal enhancement features based on the correlation weight matrix includes: The dual-modal temporal enhancement features are concatenated and nonlinearly transformed to generate a spatial attention mask corresponding to the spatial location. The mean weight of the association weight matrix is calculated, and the mean weight is used as a semantic prior and multiplied by the spatial attention mask to obtain an adjustment mask that incorporates semantic information. The adjustment mask is spatially weighted with the dual-modal temporal enhancement features to obtain the spatial features; the spatial features include action spatial features and heat spatial features.

7. The method according to claim 6, characterized in that, The step of fusing features based on the proportion of different modalities in the dual-modal temporal enhancement features and generating a fused image includes: Calculate the modality ratio of different modalities in the final fusion of the dual-modal temporal enhancement features, and perform weighted fusion of the action space features and the heat space features according to the modality ratio, and output the fused features; The fusion features are decoded to generate the fused image frame by frame; the fused image integrates the temperature distribution information of the infrared mode with the behavior and spatial structure information of the visible light mode.

8. A comfort assessment system based on visible light and infrared image fusion, characterized in that, The system includes a forward processing module and a task-oriented post-processing module. The forward processing module acquires a dual-modal sequence comprising an infrared image sequence and a visible light image sequence, and obtains undecoded fusion features based on the visible light and infrared image fusion method described in claim 1. The task-oriented post-processing module performs the following steps based on the undecoded fusion features: The fusion features are decoded frame by frame to generate a fused image, and MAGMA color mapping is performed based on the grayscale values of the fused image and a preset temperature range to generate a color fused image. Feature extraction is performed on each of the fused image sequences to obtain a global feature vector containing global information; The global feature vector is subjected to dimensionality expansion, regularization, and normalization operations, and the normalization output is determined as the thermal comfort score. Extract N consecutive frames of thermal comfort scores from the thermal comfort score sequence, and determine the average score as the comfort score; the comfort score is used to provide feedback and adjust the operating parameters of the terminal device.

9. The comfort assessment system based on visible light and infrared image fusion according to claim 8, characterized in that, The system also includes an execution module, and the terminal device is a temperature control device; If the comfort score exceeds the maximum score threshold, the execution module generates and issues a command to control the terminal device to lower the temperature; If the comfort score is lower than the minimum score threshold, the execution module generates and issues a command to control the terminal device to raise the temperature; If the comfort score is within the score threshold range, the execution module maintains the operating parameters of the control terminal unchanged.