A pilot emotion monitoring method based on multi-modal analysis

By acquiring real-time video and audio streams in the aircraft cockpit, combining them with flight parameters to generate context vectors, and using a gated neural network to dynamically adjust modal weights, the problem of insufficient accuracy and robustness of multimodal emotion monitoring in the aircraft cockpit in existing technologies is solved, and the intelligence and adaptability of emotion monitoring are improved.

CN122241552APending Publication Date: 2026-06-19NANJING LUKOU INT AIRPORT AIRPORT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING LUKOU INT AIRPORT AIRPORT TECH CO LTD
Filing Date
2026-02-02
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for multimodal emotion monitoring in aircraft cockpits lack dynamic adjustment capabilities and cannot intelligently arbitrate the importance of visual and auditory modalities based on flight missions and environmental changes, resulting in insufficient monitoring accuracy and robustness in complex environments.

Method used

By acquiring real-time video and audio streams from the cockpit, as well as flight parameter data, a flight context vector is generated. A gated neural network is used to dynamically adjust the modal weights, and a cross-modal attention mechanism is combined to fuse features and generate emotional states.

Benefits of technology

It improves the accuracy and robustness of emotion monitoring in complex cockpit environments, can cope with modal information failure, provides key emotional state judgments unique to aviation, and enhances the reliability of flight safety management and fatigue risk warning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241552A_ABST
    Figure CN122241552A_ABST
Patent Text Reader

Abstract

This invention discloses a pilot emotion monitoring method based on multimodal analysis, comprising: acquiring flight parameter data synchronized with real-time video and audio streams and generating a flight context vector; extracting a first feature vector representing facial visual features and a second feature vector representing speech acoustic features from the video and audio streams, respectively; determining a dynamic modal weight set for fusion based on the flight context vector using a gated neural network; weighting and fusing the first and second feature vectors using the dynamic weight set to generate a fused feature vector; and determining the pilot's current emotional state based on the fused feature vector. This invention solves the problem of context-independent multimodal fusion strategies in existing technologies, which are unable to adapt to the dynamic cockpit environment. By introducing a dynamic weight mechanism driven by the flight context vector, it achieves intelligent arbitration of the importance of each modality's information, improving the accuracy and scene adaptability of monitoring.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of emotion monitoring technology under artificial intelligence, and in particular to a method for monitoring pilot emotions based on multimodal analysis. Background Technology

[0002] With the development of artificial intelligence technology, the automatic recognition of human emotions and other cognitive states using computers has become a research hotspot in the field of human-computer interaction. Emotion recognition technology, in particular, has demonstrated significant application value in key scenarios such as aviation safety and autonomous driving. Currently, emotion recognition technology mainly focuses on two modalities: visual and auditory. In the visual channel, mainstream methods use deep convolutional neural networks (CNNs) to analyze facial images, inferring emotional states by capturing changes in facial action units. This type of method is relatively mature in recognizing static expressions. In the auditory channel, acoustic features such as Mel-frequency cepstral coefficients (MFCCs), pitch, and speech rate are typically extracted from speech signals. Recurrent neural networks (RNNs) or long short-term memory networks (LSTMs) are then used to process temporal information to identify the emotions contained in speech. Simultaneously, to improve the robustness and accuracy of recognition models, research is gradually shifting from single-modality to multi-modal information fusion, combining complementary visual and auditory information to obtain more reliable analysis results in complex environments.

[0003] However, existing technologies still have significant limitations when applied to the highly dynamic and complex scenario of aircraft cockpits. First, most current multimodal fusion strategies are context-independent, typically using fixed weighting coefficients to sum the features of different modalities or employing standard attention mechanisms, failing to consider the dynamic impact of the flight mission itself on the reliability of modal information. For example, when pilots perform high-load maneuvering tasks such as manual approaches, the information value conveyed by their facial micro-expressions is far greater than that of potentially silent voice channels; and during radio communications, changes in voice tone become a key indicator of cognitive load. Second, existing methods lack an intelligent arbitration mechanism, failing to dynamically adjust the dependence on visual and auditory modalities based on real-time flight phases, automation levels, and maneuvering loads. This leads to lower-reliability modal information interfering with the analysis results in specific scenarios, thus limiting the accuracy and scenario adaptability of monitoring. More importantly, when the face is covered by an oxygen mask or there is non-task-related voice communication in the cockpit, the static fusion model cannot adaptively suppress the input of invalid modalities, resulting in insufficient robustness of the model. Summary of the Invention

[0004] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.

[0005] In view of the aforementioned existing problems, this invention is proposed. Therefore, this invention provides a pilot emotion monitoring method based on multimodal analysis to address the problems mentioned in the background art.

[0006] To address the aforementioned technical problems, this invention provides the following technical solution: a pilot emotion monitoring method based on multimodal analysis, comprising: Acquire real-time video and audio streams from the pilot in the cockpit, and acquire real-time flight parameter data synchronized with the real-time video and audio streams, and generate a flight context vector based on the real-time flight parameter data; A first feature vector is extracted from the real-time video stream, the first feature vector representing the pilot's facial visual features; and a second feature vector is extracted from the real-time audio stream, the second feature vector representing the pilot's speech acoustic features. Based on the flight context vector, a dynamic modal weight set is determined for fusing the first feature vector and the second feature vector, the dynamic modal weight set including a first weight and a second weight; The first feature vector and the second feature vector are weighted and fused using the dynamic modal weight set to generate a fused feature vector, and the pilot's current emotional state is determined based on the fused feature vector.

[0007] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the real-time flight parameter data includes at least one of flight phase information, aircraft automation level information, and pilot control load information.

[0008] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, determining the flight phase information includes: Based on the aircraft's altitude, speed, landing gear status, and flap position data; The current flight phase is categorized into one of the preset flight phase categories, which include takeoff, climb, cruise, descent, approach, and landing.

[0009] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the step of extracting a first feature vector from the real-time video stream includes: The pilot's facial region is detected and located in each frame of the real-time video stream; The facial region is input into a pre-trained visual analysis neural network to extract the first feature vector related to facial micro-expressions.

[0010] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the step of extracting the second feature vector from the real-time audio stream includes: The real-time audio stream is preprocessed to generate a time-spectrum graph; The time-spectrum is input into a pre-trained speech analysis neural network to extract the second feature vector related to emotional prosody.

[0011] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the step of determining the dynamic modal weight set includes: The flight context vector is input into a pre-defined gated neural network; The output of the gated neural network is used to calculate and generate the first weight and the second weight, which correspond to the first feature vector and the second feature vector, respectively.

[0012] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the step of weighted fusion using the dynamic modal weight set includes: Multiply the first feature vector by the first weight to obtain the first weighted feature; Multiply the second feature vector by the second weight to obtain the second weighted feature; The first weighted feature and the second weighted feature are summed by vector addition to generate the fused feature vector.

[0013] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the method further includes, before the weighted fusion: A cross-modal attention mechanism is used to interactively process the first and second feature vectors to achieve feature alignment and information complementarity in the time dimension, and to generate aligned first and second feature vectors.

[0014] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the step of determining the pilot's current emotional state includes: The fused feature vector is input into a classifier; The classifier outputs a classification result, which is selected from a predefined set of labels containing multiple basic emotional states and at least one aviation-specific emotional state.

[0015] As a preferred embodiment of the pilot emotion monitoring method based on multimodal analysis described in this invention, the method further includes: During non-critical flight phases, the pilot's first and second feature vectors are continuously collected to establish a personalized emotional baseline model for the pilot. Before performing weighted fusion, the currently extracted first and second feature vectors are calibrated using the personalized sentiment baseline model.

[0016] Compared with existing technologies, the beneficial effects of this solution are: 1. This invention abandons the context-independent static fusion strategy of traditional technologies and introduces a dynamic modal weight set driven by real-time flight parameters. Through a gated neural network, it can intelligently arbitrate the relative importance of visual and auditory modalities based on objective contexts such as flight phase, automation level, and control load. This allows the fusion process to be "adapted to local conditions," emphasizing modalities with richer information in different scenarios, effectively avoiding interference from low-quality or irrelevant modal information, and improving the accuracy and reliability of emotion monitoring in complex cockpit environments.

[0017] 2. The dynamic weighting mechanism of this invention can effectively handle special cases where modal information fails. For example, when the pilot's face is obscured by an oxygen mask, resulting in a decrease in visual feature quality, or when auditory information is missing during silent flight, the solution of this invention can automatically reduce the weight of the corresponding modality to an extremely low level, and instead rely on the effective modality for analysis. This ensures that emotion monitoring can still operate stably even when some sensory data is unavailable, enhancing the robustness of the model and its applicability across the entire flight profile.

[0018] 3. When determining emotional states, this invention not only includes general basic emotion categories, but also specifically adds a key state label of "high cognitive load / stress" unique to the aviation field. This makes the monitoring results no longer broad emotional descriptions, but key indicators that directly reflect the pilot's workload and core cognitive state, providing more targeted and actionable decision-making basis for flight safety management, fatigue risk warning, and pilot training assessment. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein: Figure 1 This is a flowchart illustrating the overall process of a pilot emotion monitoring method based on multimodal analysis, as described in one embodiment of the present invention. Detailed Implementation

[0020] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0021] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0022] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0023] This invention is described in detail with reference to the schematic diagrams. When detailing the embodiments of this invention, for ease of explanation, the cross-sectional views illustrating the device structure may be partially enlarged, not adhering to the usual scale. Furthermore, the schematic diagrams are merely examples and should not be construed as limiting the scope of protection of this invention. In actual fabrication, the three-dimensional spatial dimensions of length, width, and depth should be included.

[0024] Furthermore, in the description of this invention, it should be noted that the terms "upper," "lower," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. These terms are used solely for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. In addition, the terms "first," "second," or "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0025] Unless otherwise explicitly specified and limited, the terms "installation," "connection," and "joining" in this invention should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; similarly, they can refer to mechanical connections, electrical connections, or direct connections, or indirect connections through an intermediate medium, or internal connections between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0026] Example 1 Reference Figure 1 This is the first embodiment of the present invention, which provides a pilot emotion monitoring method based on multimodal analysis, including: S1. Acquire the real-time video stream and real-time audio stream of the pilot in the cockpit, and acquire real-time flight parameter data synchronized with the real-time video stream and real-time audio stream, and generate a flight context vector based on the real-time flight parameter data.

[0027] Furthermore, a wide-angle camera installed in the cockpit, positioned directly above the pilot's upper body, captures a real-time video stream with a resolution of at least 1920×1080 pixels and a frame rate of at least 30fps. This ensures clear and continuous capture of the pilot's facial expressions and subtle changes in head posture. Simultaneously, a dual-microphone array deployed in the cockpit, combined with beamforming and deep learning noise reduction algorithms, preprocesses the acquired audio signals to suppress interference from background noise such as engine and wind noise, ultimately outputting a clean real-time audio stream with a sampling rate of 16kHz and a bit depth of 16bit.

[0028] It is important to emphasize that, in order to achieve accurate alignment of multimodal data, this embodiment employs a unified system clock. This system clock is capable of marking each frame of the video stream, each data block of the audio stream, and each data packet acquired from the aircraft data bus with a high-precision (e.g., millisecond-level) synchronization timestamp at the time of acquisition.

[0029] Furthermore, this embodiment introduces the perception of flight status, quantifying the objective environment in which the pilot is located into a flight context vector. ; Specifically, by connecting to the aircraft's ARINC 429 and other avionics data buses, a series of parameters can be acquired in real time, including the aircraft's altitude, airspeed, Mach number, vertical speed, attitude angle, landing gear status, flap / slat position, autopilot (AP) and autothrottle (A / T) engagement status, and input deflection signals from the control stick / wheel and pedals.

[0030] Furthermore, the acquired raw real-time flight parameter data is processed and encoded to generate structured contextual information. This raw real-time flight parameter data includes at least one of the following: flight phase information, aircraft automation level information, and pilot control load information.

[0031] Specifically, for flight phase information, based on key parameters such as the aircraft's altitude, speed, landing gear, and flap status, a pre-defined decision tree or state machine model is used to categorize the current flight phase into one of six pre-defined categories: "takeoff," "climb," "cruise," "descent," "approach," and "landing." Subsequently, one-hot encoding is used to convert this classification result into a 6-dimensional vector, denoted as... For example, if it is determined to be the cruise phase, then It should be explained that the significance of using one-hot encoding is to avoid the model making meaningless ordinal relationship assumptions between different flight phases, so that each phase is treated as an independent conditional input.

[0032] Specifically, in this embodiment, a state machine model is used to illustrate the transformation logic of the above six preset categories using the following rule set: Landing to Takeoff: When the aircraft is on the ground (altitude <50 feet AGL, where AGL (Above Ground Level) refers to the vertical height of the aircraft relative to the ground), and the speed increases from below 40 knots to above 80 knots, and the engine thrust is >80%N1 (where N1 is the thrust indication parameter of a high-bypass turbofan engine, representing the fan speed), it is determined to enter the "takeoff" phase.

[0033] Takeoff to climb: When the landing gear is retracted, the altitude exceeds 1,500 feet AGL, and the vertical speed remains positive, the "climb" phase is determined.

[0034] Climb to cruise: When the aircraft reaches the preset cruise altitude and the vertical speed remains stable within ±100 feet / minute for more than 1 minute, it is determined to have entered the "cruise" phase.

[0035] Cruise to Descent: When the aircraft leaves the cruise altitude and its vertical speed remains negative by more than -500 feet per minute, it is determined to have entered the "descent" phase.

[0036] Descent to Approach: The “approach” phase is determined when flaps are deployed, or the altitude is below 10,000 feet and the speed is below 250 knots, or the Instrument Landing System (ILS) signal has been intercepted.

[0037] Approach to Landing: When the landing gear is down, the altitude is below 50 feet AGL, and the aircraft touches the ground, the "landing" phase is determined.

[0038] It should be emphasized that the above settings (such as altitude, speed, thrust, etc.) are all empirical values ​​preset in the model and can be adjusted according to different models.

[0039] Specifically, for aircraft automation level information: directly read the Boolean state values ​​of autopilot (AP) and autothrottle (A / T), encode them into a 2D binary vector, denoted as... ,in, This represents the Boolean state value for autonomous driving. This represents the Boolean state value for the automatic throttle. A Boolean state value of 1 indicates that the throttle is on, and 0 indicates that it is off.

[0040] Specifically, regarding pilot control load information: to quantify the intensity of pilot manual control, this embodiment calculates a control load index, denoted as... This control load index reflects the situation in the most recent time window. Inside (for example, This refers to the intensity of the pilot's input activities to the main control equipment (stick / wheel and pedals). It can be calculated using the following formula: in, Indicates the current moment The load index of the control. and These represent the input signal sequences for the control stick / wheel and pedals, respectively. Indicates past time window Within this range, the standard deviation corresponds to the input signal sequence. It is important to emphasize that the standard deviation is used here instead of the mean, in order to capture the "activity" or "instability" of the manipulation, rather than the absolute position of the manipulation. A high standard deviation means frequent or large-scale manipulation adjustments, which is directly related to a higher manipulation load. and It is a preset weighting coefficient used to balance the contribution of different operating devices to the overall load. Its value can be determined through expert experience or data statistics.

[0041] Specifically, the calculated Normalize the values ​​so that they fall within the range [0,1].

[0042] Furthermore, the results obtained from the above processing , and the normalized control load index By concatenating the vectors, a real-time, low-dimensional (6+2+1=9 dimensions in this embodiment) and information-dense flight context vector can be formed. This flight context vector constitutes a digital snapshot of the pilot's current working situation.

[0043] S2. Extract a first feature vector from the real-time video stream, which represents the pilot's facial visual features. Extract a second feature vector from the real-time audio stream, which represents the pilot's speech acoustic features.

[0044] It should be noted that the task of this step is to convert the high-dimensional, raw video and audio data collected in step S1 into two low-dimensional, information-dense feature vectors (facial visual features and speech acoustic features) through a deep neural network model. These two feature vectors encapsulate visual and acoustic information that is strongly correlated with the pilot's emotional state, respectively.

[0045] Furthermore, the first feature vector representing facial visual features is extracted. The extraction process is as follows: For each frame of real-time video image acquired from the camera, this embodiment employs a lightweight and efficient object detection model (e.g., a variant of YOLOv8n-face) for processing. This model is specifically optimized to stably detect and output the bounding box coordinates of the pilot's facial region under complex lighting conditions within the cockpit, partial occlusion, and even when the pilot is wearing sunglasses or headphones. It should be noted that a single-stage detector like YOLOv8n is chosen to balance detection accuracy with inference speed, ensuring the real-time performance of the entire process.

[0046] The detected facial region image is cropped from the original frame and subjected to a series of normalization preprocessing operations, including: scaling it to a fixed size (e.g., 224×224 pixels); and normalizing it to linearly map pixel values ​​from the [0,255] interval to the [0,1] or [-1,1] interval. It should be noted that this preprocessing aims to eliminate scale inconsistencies caused by distance and pose variations, and to ensure the input data meets the requirements of the neural network.

[0047] The preprocessed facial image is input into a pre-trained visual analysis neural network for extracting facial micro-expression features. In this embodiment, the neural network can be a Transformer-based model (such as VisionTransformer, ViT) or an optimized convolutional neural network (such as ResNet50, EfficientNet). These neural networks have been pre-trained on large facial expression datasets (such as RAF-DB, AffectNet) and already possess powerful facial representation learning capabilities. After receiving the facial image, the neural network performs multiple nonlinear transformations and finally outputs a high-dimensional feature vector, denoted as the first feature vector, before the classification head. : in, This is the extracted first feature vector, which typically has a fixed dimension, such as 768 or 2048. This vector semantically encodes subtle features of facial muscle movements related to emotions, such as raising eyebrows, changing the corners of the mouth, and contracting the eyes, rather than just macroscopic expression categories. This represents the visual analysis neural network described above. This represents a pre-processed facial image.

[0048] Furthermore, before segmenting the real-time audio stream, a Voice Activity Detection (VAD) module can be used. This VAD module (a deep learning model based on energy, zero-crossing rate, or pre-trained) is used to determine whether the current audio segment contains human speech. If speech activity is detected, subsequent operations are performed to generate a second feature vector. If no speech activity is detected (i.e., silence or only background noise), the second feature vector for that time segment can be directly set to a zero vector or marked as invalid. Simultaneously, this "silent" state information can be used as additional input (a Boolean state value) into the flight context vector, providing a more direct basis for the gated neural network, enabling it to decisively reduce the second weight to near zero, thus better handling silent segments.

[0049] Furthermore, a second feature vector representing the acoustic features of the speech is extracted. The extraction process is as follows: The clean real-time audio stream obtained in step S1 is segmented. Specifically, a sliding window of fixed length (e.g., 25 milliseconds) is used to slice the audio stream with a certain step size (e.g., 10 milliseconds), forming a series of overlapping audio frames. A Hamming window is applied to each frame of audio signal to reduce spectral leakage. It should be explained that this frame-segmentation and windowing process is designed to satisfy the short-time stationarity assumption when performing the Fourier transform.

[0050] For each windowed audio frame, a Short-Time Fourier Transform (STFT) is performed to obtain its spectral representation. Then, the energy of the spectrum is passed through a series of Mel filter banks and logarithmic operations are performed to finally generate a Log-Mel spectrogram. The advantage of the Mel spectrogram is that its frequency scale is closer to the auditory perception characteristics of the human ear, and it can more effectively highlight the prosodic and timbre features of emotion-related speech.

[0051] The generated Mel spectrogram sequence is input into a pre-trained speech analysis neural network for extracting speech emotion features. In this embodiment, a pre-trained model based on self-supervised learning, such as Wav2Vec 2.0 or its fine-tuned variant for emotion recognition (e.g., Emotion2Vec), is preferably used. It is important to emphasize that such models, by learning on massive amounts of unlabeled speech data, can capture more fundamental acoustic representations of speech signals that are independent of specific text content. After processing the Mel spectrogram sequence, the model outputs a fixed-dimensional feature vector corresponding to that speech segment, denoted as the second feature vector. : in, It is the extracted second feature vector, whose dimension is the same as that of the first feature vector. Maintaining consistency (e.g., 768 dimensions) facilitates fusion. This vector encodes complex prosodic information in speech, such as pitch contours, energy variations, speech rate rhythm, and formant structure. This represents a neural network for speech analysis. The sequence of Mel spectrograms represents the input.

[0052] It should be noted that, through the parallel processing operations described above for feature extraction, raw signals from different sensors with vastly different physical properties can be successfully converted into comparable feature vectors within the same high-dimensional semantic space. and .

[0053] S3. Based on the flight context vector, determine a dynamic modal weight set for fusing the first feature vector and the second feature vector. The dynamic modal weight set includes a first weight and a second weight.

[0054] It should be noted that the purpose of this step is to abandon the fixed-weight or context-independent fusion strategies used in traditional multimodal fusion and instead adopt an intelligent dynamic arbitration mechanism. This mechanism can determine the fusion method based on the pilot's real-time, objective flight scenario (i.e., the flight context vector generated in step S1). Dynamically evaluate visual modalities (first feature vector) ) and auditory modality (second feature vector) The relative importance or reliability of each modality is determined, and a set of dynamically changing modal weights is generated. This ensures that subsequent feature fusion can be "place-specific," focusing on more informative modalities in different flight scenarios, thereby significantly improving the accuracy and robustness of emotion monitoring.

[0055] Furthermore, a gated neural network is introduced as a dynamic weight generator; Specifically, this embodiment employs a pre-defined, lightweight gated neural network specifically designed to learn from flight context vectors. The mapping relationship between the context vector and the modal weight set is established. This gated neural network can consist of several fully connected layers and a final activation function layer. In this embodiment, the structure of the gated neural network can be designed as follows: an input layer with 9 neurons receives a 9-dimensional flight context vector; followed by two hidden layers, each consisting of 16 neurons and a ReLU (Rectified LinearUnit) activation function; and finally, an output layer with 2 neurons, which outputs the original weight vector directly without an activation function for subsequent normalization using the Softmax function. This structural design is sufficient to enable the neural network to learn the nonlinear mapping relationship between the context vector and the modal weights while maintaining lightweight design. It should be explained that the reason for choosing a gated neural network is that its structure can simulate a "gate" mechanism, through which a control signal (in this case, the context vector) is input. This determines the extent to which information flows (in this case, each modal feature) pass through.

[0056] Specifically, the weight generation process is as follows: The 9-dimensional flight context vector generated in step S1 This flight context vector serves as the input to the gated neural network. After undergoing a series of nonlinear transformations within the neural network, it ultimately outputs a two-dimensional original weight vector. : in, At any moment The flight context vector. This represents the gated neural network described above. It is a two-dimensional raw output vector whose value range is undefined in the initial stage.

[0057] Specifically, in order to give the generated weights a clear physical meaning (i.e., to represent the relative contribution ratio of each mode), the original weight vector needs to be modified. Normalization is performed to determine its range. This embodiment uses the Softmax function to accomplish this process, ensuring that the two weight values ​​are non-negative and sum to 1. Ultimately, the resulting dynamic modal weight set is... .in, The first weight corresponds to the first feature vector (visual modality). This corresponds to the second weight of the second feature vector (auditory modality). Furthermore, .

[0058] It should be noted that during the entire training process of the gated neural network, end-to-end joint optimization is performed together with the final emotion classification task through the backpropagation algorithm. This means that the neural network will spontaneously learn an intelligent decision-making logic, which will be illustrated in this embodiment using three typical scenarios.

[0059] Specifically, the first scenario is the manual approach phase. When the context vector... The flight is in the "approach" phase. The corresponding position in the middle is 1), and the automation level is low ( (AP / AT is 0), and the control load is 0. At higher altitudes, pilots typically focus on instruments and external visuals, with less verbal activity. In this context, facial micro-expressions (such as tension or focus) become crucial for assessing their state. A well-trained gated neural network will tend to output a larger first weight. and a smaller second weight (For example, This gives visual features a greater influence during fusion.

[0060] Specifically, the second scenario occurs during radio communication. When a pilot is detected engaging in ATC (Air Traffic Control) communication (which can be determined with the aid of Voice Activity Detection (VAD)), their facial expression may be relatively neutral, but changes in speech rate, pitch, and rhythm can effectively reflect their cognitive load and emotional state (e.g., faster speech and higher pitch when nervous). At this time, the gating neural network dynamically increases the second weight based on the context (e.g., cruise phase, low control load but with voice activity). The value (for example, This focuses on analyzing more informative speech features.

[0061] Specifically, the third scenario is modality failure. When the face is obscured by an oxygen mask, causing a sharp decline in the quality of visual feature extraction (this can be detected by adding a quality assessment step to the feature extraction module and inputting the quality signal as part of the context), the gated neural network can learn the weights of that modality. It drops to almost zero, relying entirely on the auditory modality, thus ensuring the robustness of the system.

[0062] It should be noted that this invention, by introducing a gated neural network driven by flight context vectors, enables intelligent and dynamic arbitration of multimodal information flows. This allows the entire emotion monitoring method to adaptively cope with the complex and ever-changing working scenarios within the cockpit, ensuring the accuracy, reliability, and scenario adaptability of the final analysis results, and overcoming the fundamental shortcomings of existing static fusion strategies.

[0063] S4. Use the dynamic modal weight set to weight and fuse the first feature vector and the second feature vector to generate a fused feature vector, and determine the pilot's current emotional state based on the fused feature vector.

[0064] It should be noted that this step aims to utilize the extracted modal features and dynamically generated fusion weights to arrive at an accurate judgment of the pilot's emotional state through a fusion and classification process.

[0065] Furthermore, the context-aware weighted fusion is performed, and the specific process is as follows: The first feature vector extracted in step S2 (Visual modality) and second feature vector (Auditory modality), and the corresponding first weight generated in step S3. Second weight Perform scalar-vector multiplication: in, It is after dynamic weighting The adjusted first weighted feature. It is after dynamic weighting The adjusted second weighted feature.

[0066] It is important to note that this multiplication operation can be physically understood as scaling the "confidence" of each modal information channel based on the current flight context vector. When a modality is judged to be more important in the current context (the current weight value is high, i.e., close to 0.8), its feature vector's "voice" in the fusion will be amplified; conversely, it will be suppressed.

[0067] Furthermore, the first weighted feature obtained Second weighted features Perform vector summation to generate the final fused feature vector. : It should be noted that, due to the extraction in step S2... and Since they have the same dimensions, element-wise addition can be performed directly. It is a highly condensed and comprehensive feature representation that incorporates visual and auditory information after contextual intelligent arbitration. Unlike traditional simple concatenation or averaging fusion methods, the final fused feature vector not only contains information from both modalities, but also implies the relative importance of these two modalities in a specific flight scenario.

[0068] Furthermore, based on the generated fused feature vector, the pilot's current emotional state is determined. The specific process is as follows: This embodiment uses a classifier to process the fused feature vector. The algorithm decodes the data and outputs the final sentiment category. This classifier can be a simple multilayer perceptron (MLP) consisting of one or two fully connected layers and a softmax output layer. in, This represents the classifier. It is a probability distribution vector, whose dimension is equal to the number of predefined sentiment labels. Each value in the vector represents a fused feature. The probability of belonging to the corresponding emotion category.

[0069] Furthermore, to make the monitoring results more targeted and practical, the predefined emotion tag set used in this invention not only includes general basic emotional states (e.g., neutral, happy, sad, angry, surprised, fearful), but also adds at least one key emotional state specific to aviation. In this embodiment, this state is defined as "high cognitive load / stress".

[0070] It should be explained that in the field of aviation safety, identifying "happiness" or "sadness" has limited significance. However, identifying whether a pilot is in a state of "high cognitive load / stress" caused by challenging maneuvers, complex emergency handling, or information overload is of crucial early warning value for predicting operational errors and preventing a decline in situational awareness. Therefore, the establishment of this aviation-specific label enables this methodology to directly output conclusions that are most instructive for flight safety.

[0071] Furthermore, from the probability distribution vector The category with the highest probability value is selected as the final determined pilot's current emotional state: in, That is, at time The emotional state labels detected.

[0072] It should be noted that the weighted fusion operation ensures that the features input to the classifier are preserved. This information has already undergone "preprocessing" and "refinement," resulting in high-quality data. Introducing aviation-specific emotion tags ensures that the method's output directly serves the core needs of aviation safety management. For example, when the system outputs "high cognitive load / stress," it can trigger corresponding alerts from the cockpit intelligent system, or, in post-event analysis, mark that period as a key point requiring focused review.

[0073] Furthermore, before performing weighted fusion, a cross-modal attention mechanism can be used to... and By mutually "following" each other, information complementarity and alignment at the feature level are achieved before fusion, generating more relevant feature vectors, which are then weighted and fused.

[0074] Specifically, the first and second feature vector sequences extracted in step S2 are input into a cross-modal attention module, where facial visual features are the query, and speech acoustic features are the key and value. The attention of speech acoustic features to facial visual features is then calculated. Conversely, using speech acoustic features as the query and facial visual features as the key and value, the attention of facial visual features to speech acoustic features is calculated: in, This is the standard scaled dot product attention function.

[0075] It should be noted that after processing by the cross-modal attention module, the aligned first and second feature vectors are obtained (either from the last time step of the sequence or after pooling). These two new feature vectors not only contain their original information but also incorporate contextual information from the other modality.

[0076] Specifically, by continuously collecting data during non-critical flight phases (such as smooth cruise), a personalized emotion baseline model is established for each pilot. In subsequent monitoring, this baseline model is first used to calibrate the features extracted in real time (e.g., by subtracting the baseline mean) to effectively eliminate noise caused by differences in individual expression habits, making emotion judgments more accurate.

[0077] Specifically, when first applied to a particular pilot, the first and second feature vectors of that pilot under a "neutral" mood are continuously collected during non-critical flight phases (such as long-duration smooth cruise). The mean vector of each of the collected feature vectors (e.g., more than 1000) is then calculated. and and standard deviation vector and These statistics constitute the pilot's personalized emotional baseline model. After step S2 extracts the original first and second feature vectors, the following calibration operation is performed (in this embodiment, z-score normalization is used): in, It is the first feature vector after calibration. It is the calibrated second feature vector.

[0078] It should be noted that the calibrated feature vector obtained through this calibration operation and This eliminates static biases caused by differences in individual facial structure or pronunciation habits, allowing the judgment of emotions to be based on changes relative to the pilot's "normal" state, rather than absolute values.

[0079] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of this application can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.

[0080] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0081] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0082] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0083] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.

[0084] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A pilot emotion monitoring method based on multimodal analysis, characterized in that, include: Acquire real-time video and audio streams from the pilot in the cockpit, and acquire real-time flight parameter data synchronized with the real-time video and audio streams, and generate a flight context vector based on the real-time flight parameter data; A first feature vector is extracted from the real-time video stream, the first feature vector representing the pilot's facial visual features; and a second feature vector is extracted from the real-time audio stream, the second feature vector representing the pilot's speech acoustic features. Based on the flight context vector, a dynamic modal weight set is determined for fusing the first feature vector and the second feature vector, the dynamic modal weight set including a first weight and a second weight; The first feature vector and the second feature vector are weighted and fused using the dynamic modal weight set to generate a fused feature vector, and the pilot's current emotional state is determined based on the fused feature vector.

2. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, The real-time flight parameter data includes at least one of the following: flight phase information, aircraft automation level information, and pilot control load information.

3. The pilot emotion monitoring method based on multimodal analysis as described in claim 2, characterized in that, Determining the flight phase information includes: Based on the aircraft's altitude, speed, landing gear status, and flap position data; The current flight phase is categorized into one of the preset flight phase categories, which include takeoff, climb, cruise, descent, approach, and landing.

4. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, The step of extracting the first feature vector from the real-time video stream includes: The pilot's facial region is detected and located in each frame of the real-time video stream; The facial region is input into a pre-trained visual analysis neural network to extract the first feature vector related to facial micro-expressions.

5. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, The step of extracting the second feature vector from the real-time audio stream includes: The real-time audio stream is preprocessed to generate a time-spectrum graph; The time-spectrum is input into a pre-trained speech analysis neural network to extract the second feature vector related to emotional prosody.

6. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, The steps for determining the dynamic modal weight set include: The flight context vector is input into a pre-defined gated neural network; The output of the gated neural network is used to calculate and generate the first weight and the second weight, which correspond to the first feature vector and the second feature vector, respectively.

7. The pilot emotion monitoring method based on multimodal analysis as described in claim 6, characterized in that, The steps of weighted fusion using the dynamic modal weight set include: Multiply the first feature vector by the first weight to obtain the first weighted feature; Multiply the second feature vector by the second weight to obtain the second weighted feature; The first weighted feature and the second weighted feature are summed by vector addition to generate the fused feature vector.

8. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, Prior to the weighted fusion, the following is also included: A cross-modal attention mechanism is used to interactively process the first and second feature vectors to achieve feature alignment and information complementarity in the time dimension, and to generate aligned first and second feature vectors.

9. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, The steps for determining the pilot's current emotional state include: The fused feature vector is input into a classifier; The classifier outputs a classification result, which is selected from a predefined set of labels containing multiple basic emotional states and at least one aviation-specific emotional state.

10. The pilot emotion monitoring method based on multimodal analysis as described in claim 1, characterized in that, The method further includes: During non-critical flight phases, the pilot's first and second feature vectors are continuously collected to establish a personalized emotional baseline model for the pilot. Before performing weighted fusion, the currently extracted first and second feature vectors are calibrated using the personalized sentiment baseline model.