A perinatal fetal risk assessment method based on multi-modal data fusion

By combining PatchTST and GAF ​​image representations with a multimodal fusion Transformer based on maternal data, a MIRF-Net network was constructed. This solved the problems of insufficient CTG information in single-modal models and inefficiency in long-term dependent modeling, and enabled a comprehensive and robust assessment of fetal risk.

CN122245781APending Publication Date: 2026-06-19GUANGZHOU LIAN MED TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU LIAN MED TECH CO LTD
Filing Date
2026-04-07
Publication Date
2026-06-19

Smart Images

  • Figure CN122245781A_ABST
    Figure CN122245781A_ABST
Patent Text Reader

Abstract

This invention discloses a method for intrapartum fetal risk assessment based on multimodal data fusion, comprising: S1, collecting intrapartum CTG data and preprocessing it to obtain CTG time-series data including fetal heart rate (FHR) and uterine contraction signal (UC) and maternal structured metadata; S2, constructing a MIRF-Net network framework composed of a CTG signal encoder, an image encoder, a structured data encoder, and a Transformer-based multimodal fusion module, and outputting risk prediction results through a fully connected classification head; S3, training the MIRF-Net network framework to obtain an intrapartum fetal risk assessment model based on multimodal data fusion, wherein the intrapartum fetal risk assessment model based on multimodal data fusion is used to perform intrapartum fetal risk assessment based on the input intrapartum CTG data. This invention achieves collaborative modeling of multi-source heterogeneous data, realizing cross-modal feature interaction and joint representation learning within a unified feature space, and is used for intrapartum fetal hypoxia risk prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of fetal risk assessment technology, and more specifically, to a method for intrapartum fetal risk assessment based on multimodal data fusion. Background Technology

[0002] Continuous assessment of fetal health during pregnancy is crucial for ensuring maternal and infant safety during delivery and labor. The core objective of intrapartum fetal monitoring is to identify potential hypoxia-related risk states before irreversible fetal injury occurs, providing a basis for clinical intervention. Fetal heart rate (FHR) and uterine contraction (UC) signals are acquired from the pregnant woman's abdomen using an ultrasound probe, forming cardiac-to-uterine contraction (CTG) signals, which are currently the primary means of intrapartum fetal monitoring in clinical practice. The physiological patterns reflected in the CTG signal provide important information for intrapartum fetal risk assessment, and the timeliness and accuracy of the assessment results directly affect the rationality of obstetric intervention decisions and the safety of maternal and infant outcomes. Abnormal CTG waveforms usually indicate fetal hypoxemia; failure to identify related risks at an appropriate time may significantly increase the probability of long-term complications for both mother and fetus. In cases where no significant fetal risk is observed, unnecessary emergency operative delivery should be avoided, especially in late labor, as this procedure itself carries high risks for both mother and fetus. Therefore, developing reliable methods for intrapartum fetal risk assessment is of significant clinical importance.

[0003] In actual clinical practice, fetal risk assessment still primarily relies on obstetricians' manual judgment based on CTG waveforms. This process is inevitably affected by individual differences in experience, workload, and subjective perception, leading to uncertainty and time delays in risk signal identification. Computerized CTG (cCTG) analysis systems are unaffected by human factors such as fatigue and distraction, can operate around the clock, and support more systematic signal analysis, thus providing assistance for clinical decision-making. Existing cCTG methods can be mainly divided into two categories: machine learning (ML) based methods and deep learning (DL) based methods.

[0004] Early studies largely employed machine learning (ML) methods, manually extracting time-domain, frequency-domain, and time-frequency-domain features from CTG or FHR signals and combining them with classifiers to determine fetal status. Saleem et al. constructed a robust automatic classification model for fetal dynamic status analysis using time-frequency domain features of fetal heart rate signals. Ramla et al. used a decision tree model to predict high-risk pregnancies based on fetal health status. Zeng et al. addressed the non-stationarity and class imbalance of CTG signals by combining linear fetal heart rate features with time-frequency features to train a cost-sensitive support vector machine classifier. Thippa et al. trained a fetal status classification model using nine CTG features based on Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). However, these methods largely rely on manual feature design, with feature definition and selection limited by domain experience, and may introduce non-negligible measurement errors during secondary feature extraction. Furthermore, their performance is highly dependent on sample size, signal quality in the database, and the performance of the classifier itself, limiting the model's full utilization of the complex dynamic information in CTG signals.

[0005] To overcome the limitations of relying on manual features, researchers have begun to use deep learning models for end-to-end automatic modeling of CTG signals, reducing feature engineering requirements and improving the model's generalization ability. Existing research has modeled fetal states using convolutional neural networks (CNNs), recurrent neural networks (RNNs), and their variants. Liang et al. proposed a hybrid model combining a one-dimensional CNN and a bidirectional gated recurrent unit (BiGRU) to achieve fetal health status detection based on FHR signals. Mujun et al. proposed a CNN-BiLSTM network incorporating an attention mechanism and introducing discrete wavelet transform to enhance feature extraction capabilities for fetal acidosis classification. Fei et al. further proposed a multimodal bidirectional GRU network that fuses FHR, uterine contractions, and fetal movement signals to achieve end-to-end modeling of CTG data, achieving good performance in both classification accuracy and overall effectiveness. Xiao et al. proposed a DeepFeature Fusion Network (DFFN) that extracts complex features from FHR signals using a multi-scale CNN-BiLSTM network and fuses them with statistical features. However, the aforementioned methods still have certain limitations in modeling long-term dependencies, and most studies set the modeling goal as the prediction of fetal acidosis or similar postpartum outcomes. These outcome indicators (such as umbilical cord blood pH or Apgar score) are usually only available after delivery, belonging to "post-hoc prediction," and are difficult to use directly for real-time intrapartum risk assessment. In addition, existing methods are mostly focused on a single modality, mainly relying on FHR signals, and do not adequately consider uterine contraction information and broader clinical context, limiting the model's ability to characterize complex intrapartum risk states, and therefore cannot effectively mine complementary information from multimodal features.

[0006] In recent years, the introduction of self-attention and Transformer architecture has provided stronger global dependency modeling capabilities for time series classification. Compared with traditional CNN and RNN, self-attention-based models can dynamically focus on and identify the most relevant time segments of the task across the entire sequence, and more effectively capture complex long-range dependencies. In response to the long-term, non-stationary and highly variable characteristics of CTG signals, some studies have attempted to introduce attention mechanisms into CTG analysis tasks. For example, Asfaw et al.

[21] proposed the Gated Convolutional Multi-Head Attention (GCMHA) model, which combines CNN with attention mechanisms to enhance the ability to model time dependencies. Wu et al. further proposed a hybrid model combining Transformer and convolutional network (ETCNN), which divides the fetal heart rate pattern into acceleration and deceleration stages to capture both short-term and long-term features, thereby improving the feature expression capability to a certain extent. However, the computational complexity of the attention function in the standard Transformer increases quadratically with sequence length, often resulting in high computational and storage overhead when dealing with high-sampling-rate, long-recorded CTG data, thus limiting its potential for real-time applications. Nie et al. proposed a patch-based time-series Transformer framework (PatchTST), which effectively compresses the original sequence by dividing long-term sequences into fixed-length subsequence segments and using patches as modeling tokens input to the attention network. Their research shows that this strategy significantly reduces computational complexity while preserving local semantic features and modeling long-range temporal dependencies across segments, thus achieving superior performance in long-sequence prediction tasks. Based on this idea, introducing patch-based Transformers (such as PatchTST) into intrapartum fetal risk assessment scenarios is expected to more efficiently characterize multi-timescale dynamic patterns and long-term dependency structures in FHR and UC signals.

[0007] From a signal representation perspective, FHR signals exhibit significant non-stationarity and complex temporal structures, making it difficult to fully characterize the underlying patterns using only one-dimensional temporal representations. Fortunately, multimodal learning provides a feasible method for utilizing multimodal features. In recent years, time-series to image conversion methods have offered new insights for modeling complex physiological signals. Among them, Gramian Angular Field (GAF) can map one-dimensional time-series signals into two-dimensional image forms, thus explicitly characterizing the spatial correlations between time samples and providing supplementary information to the model that differs from traditional time-domain representations, achieving good performance in many time-series tasks. Therefore, image-based modeling of FHR signals using the GAF method helps enhance the ability to characterize complex heart rate dynamic patterns from a structural perspective, complementing patch-based temporal modeling.

[0008] On the other hand, fetal risk during labor is not solely determined by the fetus's own physiological signals. The mother's baseline health condition significantly impacts fetal oxygenation levels and tolerance to labor stress during delivery. In 2020, approximately 287,000 women worldwide died annually from pregnancy and childbirth-related complications, indicating that maternal health remains a crucial factor influencing maternal and infant outcomes. During labor, the fetus's oxygen supply depends entirely on the mother's respiratory and circulatory systems, and maternal health is closely related to the fetus's tolerance to hypoxia and labor stress. The American College of Obstetricians and Gynecologists (ACOG) points out that factors such as hypertension, Rh incompatibility, a history of infertility, advanced maternal age, and primiparity can significantly increase fetal health risks. However, most previous studies have failed to systematically incorporate structured maternal clinical information into fetal status assessments, resulting in limitations in the comprehensiveness and clinical applicability of risk assessments.

[0009] In summary, existing technologies generally suffer from drawbacks such as insufficient single-modal CTG information, inefficient long-term time-dependent modeling, and poor performance of simple multimodal fusion. Therefore, it is indeed necessary to develop a method for intrapartum fetal risk assessment based on multimodal data fusion. Summary of the Invention

[0010] The purpose of this invention is to provide a method for intrapartum fetal risk assessment based on multimodal data fusion, so as to overcome the technical problems existing in the prior art.

[0011] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A method for intrapartum fetal risk assessment based on multimodal data fusion includes the following steps: S1. Collect CTG data during labor and preprocess it to obtain CTG timing data and maternal structured metadata, including fetal heart rate (FHR) and uterine contraction signal (UC). S2. Construct a MIRF-Net network framework consisting of a CTG signal encoder, an image encoder, a structured data encoder, and a Transformer-based multimodal fusion module, and output risk prediction results through a fully connected classification head; The input modalities of the MIRF-Net network framework include a CTG time-series signal containing fetal heart rate (FHR) and uterine contraction signal (UC), a GADF image generated from the fetal heart rate (FHR), and maternal structured metadata. In the CTG signal encoder, a time-series Transformer based on PatchTST is used to perform multi-scale modeling of fetal heart rate (FHR) and uterine contraction signal (UC). Local morphological changes and long-term dependencies are captured simultaneously through patch partitioning and self-attention mechanism to obtain a temporal embedding representation. In the image encoder, the fetal heart rate (FHR) is mapped to a two-dimensional image through GADF transformation to explicitly encode the global relevance structure, and then input into a pre-trained ResNet101 to extract texture and structural features. In the structured data encoder, MAE is introduced to perform nonlinear compression and representation learning on structured features to obtain compact latent variables to retain key clinical priors and suppress redundancy and noise. In the Transformer-based multimodal fusion module, the three embeddings are first linearly projected to a unified dimension and used as token sequences to input the multimodal fusion Transformer. Through multi-head self-attention modeling of cross-modal interactions, the complementary information between temporal dynamics, image texture structure and maternal prior is learned to obtain a fused global representation. The fused representation is then used by a fully connected classifier to perform binary classification of the fetal state and output the corresponding original score logits and predicted probability.

[0012] S3. The MIRF-Net network framework is trained using CTG time series data including fetal heart rate (FHR) and uterine contraction signal (UC) obtained in step S1 and maternal structured metadata to obtain a multimodal data fusion-based intrapartum fetal risk assessment model. The multimodal data fusion-based intrapartum fetal risk assessment model is used to perform intrapartum fetal risk assessment based on the input intrapartum CTG data.

[0013] Furthermore, step S1 specifically involves: acquiring CTG data based on the CTU-UHB public dataset; firstly, identifying and removing invalid data points through outlier detection; then, reconstructing the missing signal using a linear interpolation method between valid data points; and finally, smoothing the data.

[0014] Furthermore, the processing steps of the CTG signal encoder in step S2 specifically include: Suppose a single CTG sample contains K channels, and the one-dimensional time series representation of each channel is as follows:

[0015] In the formula, L represents the length of the time series; For each channel, instance normalization is performed separately, using the following formula:

[0016] In the formula, and Let represent the mean and standard deviation of the k-th channel time series, respectively; PatchTST employs a channel-independent modeling strategy, treating different channels in a multi-channel CTG signal as independent one-dimensional time series and modeling them separately. For each channel's one-dimensional time series, a patch is created. Given the patch length P and step size S, the i-th patch is defined as:

[0017] By padding the end of the sequence with zeros to ensure complete coverage, each channel can ultimately be represented as a patch sequence consisting of N patches, where:

[0018] For each patch, perform a linear embedding and project it onto a d-dimensional feature space:

[0019] In the formula, and To preserve the relative position information of the patch in the original time series, learnable positional encoding is introduced as a learnable parameter. The final token input to the Transformer is represented as:

[0020] The embedded patch sequence is input into the Transformer encoder. Through the stacking of multi-head self-attention mechanism and feedforward network, the joint modeling of local dynamic change features and long-term dependencies in CTG signals is achieved. The computational form of a single self-attention layer is as follows:

[0021] In the formula, , and These represent the query, key, and value matrices obtained by linear mapping of the input features, respectively. Used to scale the inner product to stabilize the training process; After the output of the last layer of the Transformer encoder, global average pooling is performed on the patch dimension to obtain the overall temporal representation of each channel:

[0022] The feature representations of each channel are concatenated to form the final CTG embedding representation: .

[0023] Furthermore, the image encoder processing steps in step S2 specifically include: Let a single fetal heart rate (FHR) sequence be of length . time series

[0024] Normalize the time series to the interval [a, b], where -1 ≤ a <b≤1;

[0025] The normalized sequence is mapped to polar coordinate space, where the angle corresponding to each sampling point is represented as:

[0026] The angle difference between any two time points is mapped to matrix elements using the GADF (Graphical Approach Array) method.

[0027] In the formula, GADF image matrix

[0028] The GADF image matrix is ​​scaled to a fixed resolution using bilinear interpolation:

[0029] In the formula, This represents the final GADF image input. , For the pixel index of the GADF image; A pre-trained ResNet101 is used as the image encoder to extract hierarchical visual features from the GADF image matrix. Let the output of ResNet101 after removing the final classification layer be the feature vector. Then the image encoding process can be represented as follows:

[0030] In the formula, This represents the image modality embedding representation.

[0031] Furthermore, the processing steps of the structured data encoder in step S2 are as follows: Let the metadata input be:

[0032] In the formula, For age, For the first pregnancy, For the first and second births, As a binary indicator variable for gestational diabetes mellitus, for any one of its numerical characteristics Its standardized form is: , and Let the mean and standard deviation of the training set be the values, and then apply the standardization operation to them. Afterwards, the preprocessed metadata vector can be obtained as follows:

[0033] A structured data encoder is used to process the input vector. Mapping to latent variables Introduced by the decoder The auxiliary reconstruction branch forms a reconstruction of the input. ;

[0034] In the formula, Let represent the dimension of the latent variables; where both the encoder and decoder adopt a lightweight multilayer perceptron structure, the encoder consists of two fully connected layers and ReLU activation, and its forward mapping is represented as: ,

[0035] In the formula, , , , The decoder employs a symmetrical structure, and its reconstruction process is represented as follows: , , In the formula, , , , .

[0036] Furthermore, the processing steps of the Transformer-based multimodal fusion module in step S2 are as follows: The CTG timing branch output is generated using three independent linear mappings. Image branch output and metadata branch latent variables Projecting to a unified dimension In the shared space, three modal tokens are obtained; , , , In the formula, , , , , , and corresponding bias , , These are learnable parameters; Stack the three modal tokens in a fixed order to form an input sequence of length 3:

[0037] right Superimposed learnable position / modal embeddings The final input is obtained:

[0038] Multimodal fusion Transformer is composed of The standard Transformer encoder block consists of layers, each containing a multi-head self-attention network and a feedforward network, and employs residual connections and layer normalization to stabilize training. The calculation of a layer can be represented as: , , In the formula, For the first Layer output: The core of MHSA is the self-attention mechanism, and its single-head form is as follows:

[0039] In the formula, , , , For single-head attention, multi-head attention is achieved by computing multiple attention heads in parallel, concatenating them, and then performing a linear transformation.

[0040] After After layer fusion encoding, the result is The average pooling of the token dimension yields a globally fused representation:

[0041] Finally, the fully connected classifier outputs the fetal risk prediction probability:

[0042] In the formula, For the Sigmoid function, and These are learnable parameters.

[0043] Furthermore, step S3 specifically includes: End-to-end training is performed using a joint objective of primary task classification loss and auxiliary reconstruction loss, given a mini-batch. In the formula, Indicates CTG timing input, This represents the GADF image input generated by FHR. Indicates the structured clinical features of the mother; The model's forward output includes the logits of the classification head. Reconstructed vectors of metadata autoencoders The overall training objective is defined as:

[0044] In the formula, Losses are categorized by primary task. Loss due to metadata reconstruction Its weighting coefficient; Using label smooth cross-entropy as the classification loss, for the th The true label of each sample The one-hot distribution is smoothed to obtain soft labels. :

[0045] In the formula, For smoothing coefficients; logits The class probability distribution is obtained by softmax:

[0046] The LSCE on a mini-batch is defined as follows: . During the training phase, a reconstruction constraint is introduced, such that the reconstructed vector output by the parent structured metadata autoencoder is... The reconstruction loss is then calculated using the mean square error:

[0047] During the training phase, the model parameters are minimized by the joint loss. Update, among which The gradient is simultaneously updated across the three encoders, the fusion module, and the classification head; The gradient primarily acts on the metadata encoder / decoder branch to provide additional structural constraints; only the classification path is retained during the inference phase: the model outputs class logits and predicted probabilities for risk discrimination; the metadata reconstruction output... It does not participate in the final prediction, but is only used as an auxiliary regularization term during the training phase.

[0048] Compared with the prior art, the advantages of the present invention are as follows: This invention achieves collaborative modeling of multi-source heterogeneous data by jointly modeling the temporal signals of fetal heart rate and uterine contractions, the image representation of fetal heart rate signals, and maternal electronic medical record information, providing a more comprehensive and robust feature representation for complex intrapartum risk states; This invention introduces and applies the PatchTST time-series Transformer structure to model fetal heart rate (FHR) and uterine contraction (UC) signals. This method effectively captures local dynamic changes and long-term temporal dependencies in CTG signals through patch segmentation, instance normalization, and channel-independent processing. This invention uses a GAF-based image representation method to convert one-dimensional fetal heart rate signals into two-dimensional images, supplementing the modeling ability of complex heart rate dynamic patterns from the perspective of spatial correlation and enhancing the characterization of non-stationary signal features. This invention applies a Transformer-based multimodal feature fusion module to achieve cross-modal feature interaction and joint representation learning within a unified feature space, which is used for predicting the risk of fetal hypoxia during labor. Attached Figure Description

[0049] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0050] Figure 1 This is a flowchart of the intrapartum fetal risk assessment method based on multimodal data fusion according to the present invention; Figure 2 This is a schematic diagram of the MIRF-Net network framework in this invention; Figure 3 This is a schematic diagram of the CTG signal encoder based on PatchTST in this invention. Figure 4 Examples of normal and abnormal fetal heart rate (FHR) segments and their corresponding GADF images in this invention; Figure 5 This is a schematic diagram of the parent structured metadata encoder in this invention; Figure 6 These are ROC curves for different methods on the test set; Figure 7 This is a comparison chart of QI (%) in the input modal ablation experiment of the MIRF-Net network framework; Figure 8 This is a graph showing the QI (%) results of the comparison of multimodal feature fusion strategies; Figure 9 This is a comparison chart of QI (%) under a single metadata feature input; Figure 10 This is a comparison chart of QI (%) under different metadata combinations. Detailed Implementation

[0051] The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby providing a clearer and more explicit definition of the scope of protection of the present invention.

[0052] See Figure 1 As shown, this invention provides a method for intrapartum fetal risk assessment based on multimodal data fusion, comprising the following steps: Step S1: Collect CTG data during labor and preprocess it to obtain CTG timing data and maternal structured metadata, including fetal heart rate (FHR) and uterine contraction signal (UC).

[0053] In this embodiment, fetal status assessment was based on 552 CTG records from the CTU-UHB public dataset. These 552 records were selected from 9,164 intrapartum CTG records collected by Brno University Hospital in the Czech Republic, using STAN S21 and S31 electronic fetal monitoring devices (Neoventa Medical, Mölndal, Sweden), and were collected between April 2010 and August 2012. The original CTG records are available on the PhysioNet website (https: / / physionet.org / content / ctu-uhb-ctg-db / 1.0.0 / ). In addition, the CTU-UHB database also contains maternal electronic medical record statistics and data related to delivery outcomes. The CTG signal was sampled at a frequency of 4 Hz, and all records were from singleton pregnancies with a gestational age exceeding 36 weeks.

[0054] The CTU-UHB dataset, a high-quality dataset with expert annotations, was labeled by nine senior obstetricians from six hospitals in the Czech Republic according to FIGO guidelines, with an average clinical experience of 15 years. In this embodiment, CTG data from the 30 minutes before the end of the first stage of labor were used to build a classification model. Of the 552 records ultimately included, 330 were abnormal and 222 were normal. For data partitioning, the data was stratified and randomly divided according to fetal status labels, and the training, validation, and test sets were constructed in a 6:2:2 ratio to maintain consistency in the class ratios among the subsets. Specifically, the training set contained 331 samples (198 abnormal, 133 normal), the validation set contained 110 samples (66 abnormal, 44 normal), and the test set contained 111 samples (66 abnormal, 45 normal).

[0055] The preprocessing flow in this embodiment mainly includes three steps: outlier detection, linear interpolation, and signal smoothing. First, invalid data points are identified and removed through outlier detection; then, the missing signal is reconstructed using linear interpolation between valid data points; finally, signal quality is further improved through smoothing.

[0056] Step S2: Construct the MIRF-Net network framework, which consists of a CTG signal encoder, an image encoder, a structured data encoder, and a Transformer-based multimodal fusion module. Output the risk prediction results through a fully connected classification head. The MIRF-Net network framework adopts the overall approach of "multimodal representation learning + cross-modal joint fusion" to integrate multi-source heterogeneous information in the intrapartum scenario and achieve a comprehensive assessment of fetal hypoxia-related risks.

[0057] The input modalities of the MIRF-Net network framework include a CTG time-series signal containing fetal heart rate (FHR) and uterine contraction signal (UC), a GADF image generated from the fetal heart rate (FHR), and maternal structured metadata.

[0058] In the CTG signal encoder, a PatchTST-based temporal series Transformer is used to perform multi-scale modeling of fetal heart rate (FHR) and uterine contraction signal (UC). Through patch partitioning and a self-attention mechanism, local morphological changes and long-term dependencies are simultaneously captured to obtain a temporal embedding representation. In the image encoder, the fetal heart rate (FHR) is mapped to a two-dimensional image via GADF transformation to explicitly encode the globally relevant structure, and a pre-trained ResNet101 is input to extract texture and structural features. In the structured data encoder, MAE is introduced to perform nonlinear compression and representation learning of structured features, obtaining compact latent variables to retain key clinical priors and suppress redundancy and noise. In the Transformer-based multimodal fusion module, the three embeddings are first linearly projected to a unified dimension and used as token sequences input to the multimodal fusion Transformer. Through multi-head self-attention modeling of cross-modal interactions, complementary information between temporal dynamics, image texture structure, and maternal priors is learned to obtain a fused global representation. This fused representation is then used by a fully connected classifier to perform binary classification of fetal state, outputting the corresponding raw scores (logits) and predicted probabilities (e.g., normal / hypoxic).

[0059] See Figure 2 The diagram shows the principle of the MIRF-Net network framework. The CTG time series signals (FHR, UC), the GADF image generated by FHR, and the parent structured metadata are extracted and embedded by dedicated encoders. The three features are aligned to a unified dimension by linear projection and then form a modal token sequence, which is input to the multimodal fusion Transformer to model cross-modal interactions. The fused representation is output by a fully connected classifier as normal / hypoxia logits and predicted probabilities.

[0060] See Figure 3The diagram shows the principle of a CTG signal encoder based on PatchTST. The input CTG sequence consists of two channels: fetal heart rate (FHR) and uterine contraction (UC). The model first performs instance normalization on each channel and divides the normalized sequence into several patches. Then, each patch is linearly projected and superimposed with positional encoding before being input into the Transformer encoder for feature modeling. Finally, global average pooling is used to obtain channel-level feature representations, and the channel representations are concatenated to form the final CTG embedding representation.

[0061] The core of intrapartum fetal risk assessment lies in the accurate modeling of complex temporal patterns in CTG signals, such as fetal heart rate (FHR) and uterine contraction (UC). CTG signals typically exhibit significant non-stationarity, multi-timescale characteristics, and strong individual variability. Their dynamic changes not only reflect the current physiological state of the fetus but are also closely related to the continuous intrauterine stress during labor. Therefore, effectively modeling long-term dependencies while capturing local dynamic changes is one of the key challenges in CTG signal analysis. To address this, this embodiment introduces the Patch Time Series Transformer (PatchTST) as a CTG signal encoder for efficient and robust temporal feature representation learning of FHR and UC signals. This encoder employs a channel-independent strategy to model FHR and UC signals separately and shares Transformer encoder parameters to enhance generalization ability.

[0062] Suppose a single CTG sample contains K channels (K=2 in this paper, corresponding to FHR and UC respectively), and the one-dimensional time series of each channel is represented as:

[0063] In the formula, L represents the length of the time series.

[0064] Considering the significant differences in baseline levels and amplitude distribution of CTG signals among different pregnant women, this embodiment performs instance normalization processing on each channel separately, using the following formula:

[0065] In the formula, and represents the mean and standard deviation of the k-th channel time series, respectively; this normalization operation is performed independently within each sample and each channel, which helps to reduce the impact of individual differences, device differences, and baseline drift on model training.

[0066] Subsequently, PatchTST employs a channel-independent modeling strategy, treating different channels in a multi-channel CTG signal as independent one-dimensional time series and modeling them separately. In CTG applications, this design has clear physiological justification: fetal heart rate and uterine contraction signals originate from different physiological mechanisms, and their statistical characteristics and dynamic change patterns differ significantly. By performing time-series modeling on the FHR and UC signals separately, unreasonable cross-channel attention interference can be avoided, thus enabling more accurate learning of their respective temporal dynamic characteristics. It should be noted that although each channel is independent during forward propagation, they share the same set of Transformer encoder parameters to improve generalization ability while controlling model complexity.

[0067] After normalization and channel splitting, the one-dimensional time series of each channel is divided into patches. Given the patch length P and step size S, the i-th patch is defined as:

[0068] By padding the end of the sequence with zeros to ensure complete coverage, each channel can ultimately be represented as a patch sequence consisting of N patches, where:

[0069] Through the patching operation described above, the number of input tokens for the Transformer is reduced from the original time series length L to N, thereby reducing the computational complexity of the Transformer's self-attention function from... Reduced to This significantly improves the computational efficiency of the model in long-term CTG signal modeling.

[0070] Then, a linear embedding is performed on each patch, projecting it onto a d-dimensional feature space:

[0071] In the formula, and To preserve the relative position information of the patch in the original time series, learnable positional encoding is introduced as a learnable parameter. The final token input to the Transformer is represented as:

[0072] The embedded patch sequence is input into the Transformer encoder. Through the stacking of multi-head self-attention mechanism and feedforward network, the joint modeling of local dynamic change features and long-term dependencies in CTG signals is achieved. The computational form of a single self-attention layer is as follows:

[0073] In the formula, , and Let represent the query, key, and value matrices obtained by linear mapping of the input features, respectively. Used to scale the inner product to stabilize the training process.

[0074] After the output of the last layer of the Transformer encoder, global average pooling is performed on the patch dimension to obtain the overall temporal representation of each channel:

[0075] Finally, the feature representations of each channel are concatenated to form the final CTG embedding representation: .

[0076] It is important to emphasize that the original PatchTST model is primarily designed for time series prediction tasks, while in this embodiment, PatchTST is used as a temporal feature encoder for CTG signals. Therefore, this embodiment does not use its prediction output head, but instead uses the high-level implicit representation output by the Transformer encoder as the CTG modal feature. This feature is then input along with the image features of fetal heart rate and the structured clinical information of the pregnant woman into the subsequent multimodal feature fusion module for a comprehensive assessment of the risk of fetal hypoxia during labor.

[0077] See Figure 4 The image shows examples of normal and abnormal FHR segments and their corresponding GADF images. The top row shows the FHR waveform (unit: bpm), and the bottom row shows the GADF images generated from the same FHR segment; the left column represents normal samples (label=0), and the right column represents abnormal samples (label=1). The horizontal axis of the GADF image... With the vertical axis For pixel index (corresponding to the time index after resampling), pixel value Represents the angular difference between two moments. Main diagonal This indicates the pairing relationship at the same time.

[0078] In intrapartum fetal risk assessment, waveform morphological changes in fetal heart rate (FHR) signals often contain important pathological information, such as baseline drift, acceleration, deceleration, and variability. Traditional methods typically model directly based on one-dimensional time series, but this representation struggles to explicitly characterize the global correlation structure between different time sampling points, potentially overlooking the underlying nonlinear dynamic relationships in the FHR signal. To further enhance the model's ability to express complex temporal morphological patterns, this paper introduces the Gramian AngularField (GAF) method, mapping one-dimensional FHR time series to a two-dimensional image representation, and utilizing a deep convolutional network to extract its spatial structural features, thereby forming image modal information complementary to CTG temporal features.

[0079] Gramian Angular Field (GAF) is a commonly used method for converting time series data into images, and it has achieved good results in various time series analysis tasks. GAF images are generated by mapping one-dimensional time series signals, and can explicitly encode the correlation information between time sampling points in two-dimensional space. Specifically, by calculating the inverse cosine function (arccos) of the normalized fetal heart rate (FHR) signal, each time point in the sequence can be mapped to a polar coordinate system, thus forming a new time series representation. The Gramian matrix constructed on this basis can characterize the angular relationships between different time points, allowing GAF images to not only retain temporal order information but also express the spatial structural features between sample points.

[0080] Let a single fetal heart rate (FHR) sequence be of length . Time series:

[0081] First, to ensure the effectiveness of the polar coordinate mapping, the time series is normalized to the interval [a, b], where -1 ≤ a <b≤1;

[0082] This normalization operation ensures that the sequence values ​​fall within the range [-1, 1], thus facilitating subsequent inverse cosine function mapping and improving the information resolution of the GAF image. The normalized sequence is then mapped to polar coordinate space, where the angle corresponding to each sampling point is represented as:

[0083] After completing the polar coordinate mapping, the GAF method transforms the time series into a two-dimensional image by constructing a Gramian matrix. In this embodiment, the GADF (Gramian Angular Difference Field) form is used to map the angle difference between any two time points into matrix elements:

[0084] In the formula, The GADF image matrix explicitly describes the angular difference relationship between sampling points at different times, thus enabling the encoding of global structural information of time-series signals in two-dimensional space.

[0085] Since the directly generated GADF matrix has a size of When the sequence is long, it will significantly increase the computational cost of subsequent convolutional networks. To unify the input scale and improve computational efficiency, this embodiment uses bilinear interpolation to scale the GADF image matrix to a fixed resolution:

[0086] In the formula, This represents the final GADF image input. , The pixel index of the GADF image (corresponding to the time index after resampling).

[0087] Figure 4 shows an example of a normal (label=0) and an abnormal (label=1) FHR segment and their corresponding GADF image. Each pixel in the GADF image... The sequence was characterized at two time points (corresponding indices). and Angular difference relationship between ) This allows for explicit encoding of the global temporal correlation structure in a two-dimensional space; where the main diagonal... = This corresponds to the pairing relationship with itself at the same time. (By...) Figure 4 It is evident that anomalous samples, during time intervals of significant deceleration or baseline abrupt changes, tend to exhibit stronger structured texture changes (e.g., high-contrast stripes or blocky regions) in GADF images, while normal samples show a relatively smoother texture distribution and weaker, more uniform structural differences. This suggests that GADF representation can transform temporal morphological changes in one-dimensional waveforms into two-dimensional spatial patterns that are more easily captured by convolutional networks, providing complementary information for subsequent image coding and multimodal fusion.

[0088] After obtaining the GADF image, this embodiment uses a pre-trained ResNet101 as the image encoder to extract hierarchical visual features from the GADF image matrix. The choice of ResNet101 is primarily based on the following considerations: First, GADF representations mainly encode structural and texture-like patterns, rather than natural image semantic categories; therefore, under limited data conditions, a CNN-based architecture is more suitable than a transformer. Second, compared to shallower networks such as ResNet50, ResNet101 provides stronger hierarchical feature representation capabilities, while the residual structure ensures stable optimization. Third, under limited data conditions, a pre-trained CNN model can provide more robust initialization. Therefore, ResNet101 achieves a good balance between representational power and training stability.

[0089] If the output of ResNet101 after removing the final classification layer is a feature vector, then the image encoding process can be represented as:

[0090] In the formula, It provides image modal embedding representation. ResNet101 can effectively extract texture and structural patterns in GADF images by stacking residual block structures, thereby capturing the discriminative information of FHR signals in the spatial correlation domain.

[0091] Ultimately, the obtained image modal features This will be compared with the CTG time-series features obtained in the previous section. The structured clinical characteristics of pregnant and postpartum women are input together into the subsequent multimodal fusion module to achieve cross-modal information interaction and joint representation learning, thereby supporting the comprehensive assessment of the risk of fetal hypoxia during labor.

[0092] See Figure 5 The diagram shows the structure of the parent structured metadata encoder. The solid lines represent the backbone embedding paths used for multimodal fusion. The dashed line represents the auxiliary reconstruction path during the training phase. ).

[0093] In intrapartum fetal risk assessment, relying solely on CTG signals such as fetal heart rate (FHR) and uterine contractions (UC) often fails to adequately cover the risk heterogeneity caused by differences in maternal baseline condition, pregnancy history, and metabolic environment. Clinical practice bulletins related to perinatal monitoring indicate that one of the important goals of fetal monitoring is to identify potential adverse fetal outcomes during high-risk pregnancies and deliveries, and maternal complications (such as diabetes) are a significant background factor driving the development of monitoring and intensive risk assessment. On the other hand, from the perspective of hypoxia-related pathophysiology, when fetal oxygen supply decreases, aerobic metabolism can be maintained to some extent through compensatory mechanisms. However, when compensatory capacity is exhausted or oxygen reserves are insufficient, anaerobic metabolism will occur, leading to adverse conditions such as acidosis. In pregnancies with abnormal glucose metabolism, fetal hyperglycemia and hyperinsulinemia can increase fetal oxygen consumption and induce chronic intrauterine hypoxia, providing a mechanistic explanation for the link between "maternal metabolic environment and fetal hypoxia susceptibility." Therefore, this embodiment introduces maternal electronic medical records (MEMRs) as a structured metadata modality into the multimodal framework to supplement maternal background risk information that is difficult to directly express using a single CTG modality, such as... Figure 5 .

[0094] Based on data availability and perinatal risk evidence, this embodiment selects four maternal structural features in MEMRs: maternal age, gravidity, parity, and gestational diabetes.

[0095] The processing steps of a structured data encoder are as follows: At the data level, this paper represents the parent structured features of each sample as a fixed-dimensional vector, and aligns it one-to-one with the corresponding CTG segments according to record identifiers. Let the metadata input be:

[0096] In the formula, For age, For the first pregnancy, For the first and second births, As a binary indicator variable for gestational diabetes mellitus, and due to significant differences in the dimensions and distributions of the metadata variables, this paper employs z-score standardization based on training set statistics for the numerical variables (Age, Gravidity, Parity) to improve training stability. For any of these numerical features... Its standardized form is:

[0097] In the formula, and Let the mean and standard deviation of the training set be the values, and then apply the standardization operation to them. Afterwards, the preprocessed metadata vector can be obtained as follows:

[0098] Although metadata has a low dimensionality, its variable types are heterogeneous, and it may contain noise and redundant correlations. Furthermore, directly using the original features in small sample scenarios can easily lead to the model ignoring or overfitting the modality. To obtain a more robust and fusionable metadata representation, this embodiment designs a metadata autoencoder (MAE) to... Nonlinear compression is performed to learn a compact latent variable representation. The overall structure and information flow of MAE are shown in Figure 5: Input vector via encoder Mapping to latent variables This latent variable serves as the backbone output of the metadata branch for subsequent fusion with modalities such as CTG / images. Simultaneously, to constrain representation fidelity during joint training and mitigate the problem of metadata modalities being ignored or experiencing representation degradation during fusion training, the model introduces a method derived from the decoder... The auxiliary reconstruction branch forms a reconstruction of the input. It is important to emphasize that the reconstructed branches only provide auxiliary constraints during the training phase; they are only used during the inference phase. Participate in integration and final prediction.

[0099] Formally, the encoder maps the input to low-dimensional latent variables: , In the formula, Representing the dimensions of latent variables (as defined in this paper) =32). The decoder generates a reconstruction of the input based on the latent variables: . In the engineering implementation, both the encoder and decoder adopt a lightweight multilayer perceptron structure. The encoder consists of two fully connected layers and ReLU activation, and its forward mapping is represented as: ,

[0100] In the formula, , , , The decoder employs a symmetrical structure, and its reconstruction process is represented as follows: , , In the formula, , , , The reconstruction branch is used to constrain joint training. Maintain the fidelity of the parent structured information and avoid the metadata modality being ignored or degraded in multimodal fusion; use only latent variables during the inference phase. Participate in integration, rebuild output These constraints are used only as auxiliary constraints during the training phase and do not participate in the final prediction. Through the aforementioned nonlinear compression process, latent variables... It can suppress redundant correlations while retaining information related to the parent risk, thereby obtaining a more robust and structured representation that is more suitable for subsequent multimodal modeling and fusion.

[0101] In this embodiment, the processing steps of the Transformer-based multimodal fusion module in step S2 are as follows: In intrauterine fetal risk assessment tasks, CTG temporal signals (FHR / UC) primarily characterize the immediate response and dynamic changes of the fetus to intrauterine stimuli, while GADF-based image modalities can explicitly express the globally relevant structure of FHR sequences. Maternal structured clinical information provides background risk priors such as the pregnant woman's baseline condition, pregnancy history, and metabolic environment. These three modalities exhibit significant heterogeneity in information type and statistical characteristics: CTG temporal features are more inclined towards dynamic representations of local-long-range dependencies, GADF image features are more inclined towards spatial texture and structural patterns, and maternal metadata consists of low-dimensional, heterogeneous, and long-term risk-related static variables. Therefore, effectively aligning and interactively modeling these three modalities within the same framework is crucial for improving the model's comprehensive evaluation capabilities. Because different modalities differ significantly in information sources, statistical characteristics, and expression spaces, simply using concatenation or fixed-weight summation for fusion often fails to explicitly model the adaptive complementary relationship where "a certain modality is more critical on specific samples," and may lead to weak modalities being masked by the dominant modality during training. To this end, this embodiment introduces a multimodal fusion module based on Transformer Encoder in the fusion stage, which learns cross-modal interaction dependencies through a multi-head self-attention mechanism, thereby forming a more discriminative joint representation.

[0102] Because the dimensions and statistical distributions of different modal features differ, this embodiment uses three independent linear mappings to output the CTG time-series branches. Image branch output and metadata branch latent variables Projecting to a unified dimension In the shared space, three modal tokens are obtained; , , , In the formula, , , , , , and corresponding bias , , These are learnable parameters.

[0103] Stack the three modal tokens in a fixed order to form an input sequence of length 3:

[0104] To preserve the order information of the tokens and distinguish the modal types, this embodiment... Superimposed learnable position / modal embeddings The final input is obtained:

[0105] Multimodal fusion Transformer is composed of The standard Transformer encoder block consists of layers, each containing a multi-head self-attention network and a feedforward network, and employs residual connections and layer normalization to stabilize training. The calculation of a layer can be represented as: , , In the formula, For the first Layer output: The core of MHSA is the self-attention mechanism, and its single-head form is as follows:

[0106] In the formula, , , , For single-head attention, multi-head attention is achieved by computing multiple attention heads in parallel, concatenating them, and then performing a linear transformation.

[0107] Through the self-attention calculation described above, the model can explicitly learn the interdependencies between the three tokens (CTG, image, and metadata). For example, in some samples, it may rely more on the temporal changes of CTG, while increasing the weight of image texture or metadata tokens when there is significant deceleration or high metabolic risk, thereby achieving adaptive cross-modal information fusion.

[0108] After After layer fusion encoding, the result is The average pooling of the token dimension yields a globally fused representation:

[0109] Finally, the fully connected classifier outputs the fetal risk prediction probability:

[0110] In the formula, For the Sigmoid function, and These are learnable parameters.

[0111] Step S3: The MIRF-Net network framework is trained using CTG time series data including fetal heart rate (FHR) and uterine contraction signal (UC) obtained in step S1, along with maternal structured metadata, to obtain a multimodal data fusion-based intrapartum fetal risk assessment model. This multimodal data fusion-based intrapartum fetal risk assessment model is used to assess intrapartum fetal risk based on the input intrapartum CTG data.

[0112] Specifically, to simultaneously optimize intrapartum fetal risk classification performance and constrain the fidelity of maternal structured metadata representation, a joint objective of the main task classification loss and auxiliary reconstruction loss is adopted for end-to-end training, given a mini-batch. In the formula, Indicates CTG timing input, This represents the GADF image input generated by FHR. This indicates the structured clinical features of the mother. Category labels (when classifying in binary classification) ).

[0113] The model's forward output includes the logits of the classification head. Reconstructed vectors of metadata autoencoders The overall training objective is defined as:

[0114] In the formula, Losses are categorized by primary task. Loss due to metadata reconstruction Its weighting coefficient (taken in implementation) ).

[0115] Considering that models are prone to overconfident predictions in scenarios with small sample sizes and noisy labeling, this embodiment uses label smoothing cross-entropy as the classification loss. The true label of each sample The one-hot distribution is smoothed to obtain soft labels. :

[0116] In the formula, The smoothing coefficient (in the experiment) ).

[0117] logits The class probability distribution is obtained by softmax:

[0118] The LSCE on a mini-batch is then defined as: . Label smoothing mitigates overfitting and improves robustness to out-of-distribution or noisy samples by suppressing extreme peak distributions by assigning a small amount of probability mass to non-true categories.

[0119] The parent structured features have low dimensionality but high heterogeneity, and are at risk of being "overwhelmed" by dominant modalities (CTG temporal / image) in multimodal fusion. To maintain the ability of the metadata branches to express prior clinical information, this embodiment introduces reconstruction constraints during the training phase, making the reconstruction vector output by the parent structured metadata autoencoder... The reconstruction loss is then calculated using the mean square error:

[0120] This loss enables the encoded metadata representation to retain key clinical information while being compressed, thereby suppressing representation collapse of metadata modalities during joint training and enhancing their usability in fused attention.

[0121] During the training phase, the model parameters are minimized by the joint loss. Update, among which The gradient is simultaneously updated across the three encoders, the fusion module, and the classification head; The gradient primarily acts on the metadata encoder / decoder branch to provide additional structural constraints; only the classification path is retained during the inference phase: the model outputs class logits and predicted probabilities for risk discrimination; the metadata reconstruction output... It does not participate in the final prediction, but is only used as an auxiliary regularization term during the training phase.

[0122] This embodiment was implemented and trained in an AutoDL cloud environment, using Python 3.8 (Ubuntu 20.04), PyTorch 2.0.0, and CUDA 11.8. Model training was performed on a single vGPU (48 GB VRAM). The optimizer used was Adam, with an initial learning rate of 1×10⁻³, a weight decay of 1×10⁻⁴, a batch size of 64, and a maximum of 100 training epochs. Adam was chosen because of its good adaptability and optimization efficiency across various deep learning tasks; a learning rate of 1×10⁻³ is a commonly used initial setting, typically achieving a good trade-off between convergence speed and training stability; and a weight decay of 1×10⁻⁴ is used to suppress unconstrained parameter growth, thereby mitigating overfitting to some extent. For model selection, this embodiment used the checkpoint, which had the best performance on the validation set, as the final model.

[0123] To improve the robustness and reproducibility of the experimental results, the CTU-UHB dataset was stratified and randomly partitioned according to fetal status labels, and the training, validation, and test sets were constructed in a 6:2:2 ratio. To address the potential for randomness in a single random partition, this embodiment used three different random seeds (0, 42, and 3407) to conduct independent experiments repeatedly. Specifically, each random seed corresponded to an independent data partitioning, model training, and evaluation process, while the remaining experimental configurations remained consistent. The final results are reported as the mean ± standard deviation of the three experiments.

[0124] Table 1 below shows the comparison results of different models: Table 1

[0125] To evaluate the effectiveness of MIRF-Net in fetal risk assessment during labor, this embodiment compares it with several representative baseline methods under the same data partitioning and evaluation protocol. As shown in Table 1, the baseline models include CNN-RNN hybrid architectures (CNN-BiLSTM, CNN-RNN), convolutional-based models (CTGNet, LARA, 1D-SEResNet50), Transformer variants (Medformer), and a multimodal method (MMDLA) that integrates electronic medical record information. Overall, MIRF-Net achieves best or near-best performance on most metrics, indicating that joint modeling of CTG temporal signals (FHR+UC), GADF-based image representation, and maternal structured metadata can provide complementary information, thereby achieving more robust risk assessment.

[0126] Specifically, MIRF-Net achieved an ACC of 74.63% (±6.05) and a QI of 74.76% (±6.40), indicating a more balanced performance between sensitivity and specificity. Furthermore, MIRF-Net achieved the highest F1 score (75.03% ±6.20) and AUC (0.7413 ±0.0663) among the comparative methods, demonstrating the strongest overall correlation at MCC = 0.4740 (±0.1058). In addition, MIRF-Net had the lowest Brier Score (BS = 0.2537 ±0.0605), indicating better probabilistic calibration capabilities compared to the baseline model (see Table 1).

[0127] Compared to the best-performing single-modal CTG baseline, MIRF-Net still demonstrates a stable advantage. For example, Xiao et al. (2022)'s CNN-BiLSTM achieved ACC = 70.28% and AUC = 0.6800, while MIRF-Net improved to 74.63% and 0.7413, respectively. Lin et al. (2024)'s LARA has extremely high sensitivity (SEN = 98.12%), but extremely low specificity (SPE = 28.23%), resulting in poor overall balance (QI = 52.56%); in contrast, MIRF-Net maintains a more symmetrical tradeoff (SEN = 74.86%, SPE = 74.72%), thus achieving a significantly higher QI. Similar patterns have been observed in other models that only use CTG (such as CTGNet, CNN-RNN, Medformer, and 1D-SEResNet50): MIRF-Net provides more comprehensive overall performance rather than being biased towards a particular class of samples.

[0128] It is worth noting that although Cao et al.'s (2023) MMDLA incorporates electronic medical record information, its performance is limited by an extreme imbalance between sensitivity and specificity (e.g., SEN = 19.24% while SPE = 87.85%), resulting in a low QI of 41.09%. In contrast, MIRF-Net employs a dedicated metadata encoder and combines it with a Transformer-based multimodal fusion strategy, achieving a more robust and balanced risk assessment. The above results validate the effectiveness of the trimodal design and fusion strategy proposed by MIRF-Net in fetal risk assessment during labor.

[0129] To more intuitively evaluate the discriminative performance of each model, Figure 6The ROC curves of different methods on the test set are shown. It can be seen that MIRF-Net (red curve) is above other baselines in most threshold ranges, and generally closer to the upper left corner, with an AUC of 0.7413, significantly higher than the other comparison methods. This indicates that the model in this paper can achieve a better tradeoff between true positive and false positive rates under different decision thresholds, demonstrating more stable discrimination ability and better generalization robustness, providing a basis for subsequent clinical decision support applications.

[0130] Table 2 below shows the input characterization ablation experiment, i.e. the effect of different combinations of input modalities on fetal state classification performance.

[0131] Table 2

[0132] To verify the contribution of each modal input in MIRF-Net to the performance of intrapartum fetal risk assessment and to analyze the effectiveness of multimodal fusion, this embodiment conducts modal ablation experiments under a unified data partitioning, training strategy, and evaluation process. Specifically, based on the complete model (three-modal fusion: CTG temporal signal + GADF image + structured metadata), one modal input is removed to form three ablation settings: no signal modality (removing the CTG temporal branch), no image modality (removing the GADF image branch), and no structured metadata modality (removing the maternal metadata branch). The results of each model on the test set are shown in Table 2. To more intuitively demonstrate the impact of different modalities on the overall consistency index, Figure 7 A comparison of the QI (%) for each ablation setting is further provided.

[0133] Overall, the trimodal fusion model achieves the best and most balanced comprehensive performance: ACC = 74.63%, SEN = 74.86%, SPE = 74.72%, QI = 74.76%, F1 = 75.03%, AUC = 0.7413, MCC = 0.4740, and simultaneously obtains the lowest BS = 0.2537 (Table 2). This result demonstrates the significant complementarity of trimodal information; multimodal fusion not only improves classification performance but also enhances the reliability and calibration of risk probability outputs.

[0134] After removing the CTG time-series signal (no-signal mode), the model performance significantly declined, particularly in its ability to identify positive (abnormal CTG) samples: SEN decreased from 74.86% to 36.67%, leading to a drop in QI from 74.76% to 53.67% (a decrease of 21.09 percentage points), AUC from 0.7413 to 0.5818 (a decrease of 0.1595), MCC from 0.4740 to 0.1680 (a decrease of 0.3060), while BS increased from 0.2537 to 0.3581 (an increase of 0.1044). Notably, SPE remained high under this setting (78.64%), indicating a stronger bias towards the negative category and a more conservative prediction tendency. This further illustrates that the CTG time-series signal remains the core information source for intrapartum risk assessment, and its absence makes it difficult for the model to stably capture key dynamic features related to fetal hypoxia.

[0135] When the GADF image modality was removed (without image modality), the model still experienced performance degradation compared to the complete model, but the decline was relatively mild: QI decreased from 74.76% to 59.24% (a decrease of 15.52 percentage points), AUC decreased from 0.7413 to 0.6141 (a decrease of 0.1272), MCC decreased from 0.4740 to 0.1777 (a decrease of 0.2963), and BS increased from 0.2537 to 0.3919 (an increase of 0.1382). This indicates that although the image branch is not the only decisive source of information, the two-dimensional global structural representation it provides can supplement the shortcomings of pure temporal modeling in characterizing complex nonlinear morphological patterns, especially helping to improve overall discriminative ability and stability.

[0136] When the structured metadata modality is removed (unstructured modality), the model performance degrades most significantly, particularly in the overall consistency metrics: QI drops from 74.76% to 52.97% (a decrease of 21.79 percentage points, the largest among the three ablation methods), AUC drops from 0.7413 to 0.5642 (a decrease of 0.1771, the largest among the three ablation methods), MCC drops from 0.4740 to 0.0561 (a decrease of 0.4179, the largest among the three ablation methods), and BS increases from 0.2537 to 0.4562 (an increase of 0.2025, the largest among the three ablation methods). This result demonstrates that the maternal structured clinical information (Age, Gravidity, Parity, Gestational diabetes) plays an indispensable "prior constraint" role in risk assessment tasks: it is not only used to improve a single indicator, but also significantly improves the model's balance between positive and negative classes (QI, MCC) and the calibration of probability output (BS) by providing basic risk background and clinical priors, thereby enhancing the model's overall discriminative quality and robustness.

[0137] Based on the above ablation results, two conclusions can be drawn: First, the CTG temporal modality is the core information source for intrapartum risk assessment, and its absence will significantly reduce the model's recall of abnormal samples. Second, GADF images and maternal structured metadata provide complementary information for the model, with the metadata modality being particularly crucial for improving the comprehensive indicators (QI, MCC) and calibration indicators (BS). The joint fusion of the three modalities can maintain good specificity while retaining high sensitivity, enabling the model to achieve more stable, balanced, and reliable predictive performance in intrapartum fetal risk assessment scenarios.

[0138] Input signal composition ablation experiment: fetal heart rate (FHR) and fetal heart rate (FHR) + uterine contraction signal (UC)

[0139] To analyze the impact of CTG signal composition on the risk assessment performance of MIRF-Net, this embodiment compares the model performance using only fetal heart rate (FHR) and using a combination of fetal heart rate and uterine contractions (FHR+UC) as input settings. The results are shown in Table 3.

[0140] Overall, the model showed consistent improvement across all metrics after incorporating the uterine contraction signal (UC), indicating that the UC provides crucial physiological information complementary to fetal heart rate (FHR) for intrapartum fetal risk assessment. When only FHR was input, the model achieved ACC = 67.88%, QI = 68.48%, F1 = 68.45%, AUC = 0.7023, MCC = 0.3520, and BS = 0.3212. When the input was expanded to include both FHR and UC, the performance significantly improved to ACC = 74.63%, QI = 74.76%, F1 = 75.03%, AUC = 0.7413, MCC = 0.4740, and BS = 0.2537. Among them, the overall quality index (QI) increased by 6.28 percentage points (68.48%→74.76%), the AUC increased by 0.0390 (0.7023→0.7413), and the MCC increased by 0.1220 (0.3520→0.4740), while the BS decreased by 0.0675 (0.3212→0.2537). This indicates that adding the UC not only improved the discrimination ability but also significantly improved the calibration and stability of probability prediction.

[0141] Key metrics derived from the confusion matrix show that the introduction of the uterine contraction signal UC enables the model to achieve a more balanced identification between positive and negative classes: with FHR alone, SEN = 69.10% and SPE = 68.75%, while with FHR+UC, these figures increase to SEN = 74.86% and SPE = 74.72%, respectively. This means that the UC signal can provide the model with clearer clues about the "contraction-heart rate response" association, helping the model to more stably capture patterns related to hypoxia risk (e.g., deceleration patterns after contraction triggering, recovery speed, and variability changes at different stages), thereby reducing ambiguity caused by "FHR morphology alone," improving the ability to simultaneously cover both abnormal and normal classes, and ultimately driving the synchronous growth of comprehensive indicators such as QI and F1.

[0142] Table 3. Impact of different CTG signal compositions on MIRF-Net performance

[0143] Comparison of different fusion strategies

[0144] To verify the crucial role of the cross-modal fusion module in intrapartum fetal risk assessment, this embodiment, while maintaining the three-channel encoder structure and training settings, compared four typical fusion strategies: feature concatenation (Concat-Fusion), element-wise addition (Add-Fusion), multilayer perceptron nonlinear combination (MLP-Fusion), and the Multimodal Fusion Transformer module used in this paper. The results of each method under the same data partitioning and evaluation process are summarized in Table 4. To more intuitively compare the impact of different fusion strategies on the overall consistency index, Figure 8 A comparison of QI for each method is presented.

[0145] Overall, the Multimodal Fusion Transformer module achieved the best comprehensive performance, demonstrating outstanding performance in key metrics such as ACC, SEN, QI, F1, AUC, MCC, and BS: ACC=74.63%, QI=74.76%, F1=75.03%, AUC=0.7413, MCC=0.4740, while achieving the lowest BS=0.2537 (Table 3). This indicates that the attention-based fusion mechanism can more fully exploit the complementary information between "CTG temporal sequence—GADF image—parent structured features," achieving more reliable risk probability estimation and more balanced classification performance.

[0146] In comparison, while Concat-Fusion achieved the highest SPE (80.44%), its SEN was significantly lower (52.71%), resulting in a QI of only 65.04%. This indicates that simple concatenation tends to form a "conservative" decision boundary: it is more friendly to normal samples but fails to detect high-risk (positive) samples, easily leading to missed detections. Add-Fusion's results were slightly better than concatenation (QI=65.46%), but it is still essentially a linear fusion, making it difficult to characterize the complex conditional dependencies between modalities. Therefore, it still lags significantly behind Multimodal Fusion Transformer overall. MLP-Fusion, as a nonlinear combination strategy, still has some competitiveness in AUC (0.7172), but it failed to improve in QI (64.13%) and MCC (0.2897), suggesting that under the background of sample size and noise annotation, pure MLP fusion may be more prone to unstable feature coupling or overfitting, thus limiting generalization ability.

[0147] The results above show that traditional fusion methods (Concat / Add / MLP) either only perform coarse-grained integration or struggle to stably learn cross-modal complementary relationships. In contrast, the Multimodal Fusion Transformer achieves adaptive weighting and information interaction of cross-modal tokens through multi-head self-attention, significantly improving the overall discriminative ability and the balance between SEN and SPE, resulting in the maximum improvement in comprehensive quality indicators, such as QI. Therefore, this paper ultimately adopts the Multimodal Fusion Transformer as the default fusion scheme for MIRF-Net.

[0148] Table 4 Comparison Experiments of Multimodal Feature Fusion Strategies

[0149] To systematically evaluate the contribution of maternal structured metadata in assessing the gain effect of maternal structured metadata in intrapartum fetal hypoxia risk assessment, this embodiment kept the CTG temporal branch (FHR+UC) and GADF image branch of MIRF-Net constant in the experimental design, only changing the set of input variables for the metadata branch (MAE), and comparing model performance under the same data partitioning and evaluation process (Tables 5-6). Therefore, Tables 5 and 6 reflect the "marginal contribution" of maternal metadata as third-modal prior information to the multimodal fusion discrimination process.

[0150] First, the univariate ablation results (Table 5) show that the contributions of different metadata variables to model performance are significantly heterogeneous. To intuitively demonstrate the impact of a single metadata variable on the overall consistency index, Figure 9 A comparison of QI under different single-feature inputs is presented. When gestational diabetes mellitus (GDM) is used as the sole metadata input, it provides the most stable and largest improvement across multiple metrics (e.g., QI, F1, AUC, and MCC are all at their highest levels, while BS is lower), indicating that this variable has stronger risk indicative power in this task and can provide the model with more identifiable maternal priors related to hypoxia. In contrast, "parity" shows the weakest overall performance when used as a single input, even exhibiting a negative MCC, suggesting that this variable lacks sufficient separability and limited help in class differentiation when used alone. "Maternal age" and "parity" show moderate predictive ability, providing some risk priors, but still insufficient to support reliable independent discrimination.

[0151] Furthermore, the clinically-based combination experiments (Table 6) revealed that the utility of metadata depends not only on the variables themselves but also on the impact of variable combinations on the model's decision threshold and error type. Accordingly, Figure 10 summarizes the QI comparison under different metadata combination settings to visually characterize the impact of combined priors on fusion discrimination. When only "reproductive history" (parity + parity) is introduced, the model exhibits a significant bias: higher SPE and lower SEN (69.32% vs. 29.03%), with a QI of only 44.82%, indicating that this combination tends to classify samples as low-risk during the fusion stage, leading to increased missed detections of high-risk samples. Conversely, "basic information + comorbidities / medical history" (age + Gestational diabetes) shows a higher SEN but a lower SPE (72.22% vs. 35.42%), suggesting that this combination makes the model more sensitive to identifying potential risks but also introduces more false positives. It is worth noting that directly combining reproductive history with comorbidity information did not bring about a stable improvement (QI=40.93%, MCC was negative), suggesting that certain variable combinations may introduce redundancy or noisy priors under constraints of sample size and variable correlation, thereby weakening the effectiveness of the fusion module in modeling cross-modal complementary relationships. A relatively more balanced approach was "reproductive history + basic information" (parity + parity + age), with a closer SEN / SPE ratio (57.08% / 55.91%) and an improved QI of 56.41%. This indicates that the combination of "reproductive background + basic characteristics" is more conducive to forming a neutral baseline risk profile, but it is still insufficient to fully realize the potential of the metadata branch.

[0152] The most crucial finding is that when the metadata branch adopts all maternal features (parity + parity + age + Gestational diabetes), the model achieves optimal overall performance (e.g., simultaneous improvement in QI, F1, AUC, and MCC with the lowest BS). Under the premise that CTG and image modality remain unchanged, this result strongly suggests that multivariate maternal information forms a more complete risk prior in the fusion Transformer, enabling the model to perform individualized risk correction under similar CTG dynamic patterns, thereby simultaneously improving discriminative ability (QI, AUC, MCC) and probabilistic reliability (lower BS).

[0153] Table 5. Impact of Single Metadata Feature Input from the Parent Model on Multimodal Model Performance

[0154] Table 6. Impact of different combinations of parent metadata on multimodal model performance

[0155] Hyperparameter sensitivity analysis of Patch scale and stride: To analyze the hyperparameter sensitivity of PatchTST in CTG signal encoding, this embodiment conducts a comparative experiment on patch length and stride, while keeping other training settings, data partitioning, and multimodal structures (image branch and parent metadata branch) consistent. Performance results under different configurations are summarized in Table 7.

[0156] The results show that the model has a clear optimal range for patch scale. The optimal and most balanced overall performance was achieved with patch=64 and stride=32 (ACC=74.63%, QI=74.76%, F1=75.03%, AUC=0.7413, MCC=0.4740, BS=0.2537), indicating that this setting is superior in both discriminative ability and probabilistic prediction reliability. In contrast, excessive overlap leads to a significant performance drop. For example, with patch=64 and stride=16, the QI drops to 49.92%, the AUC drops to 0.5383, and the MCC is approximately 0, suggesting that highly correlated redundant tokens may increase optimization difficulty and weaken the focus of attention on key morphological patterns. On the other hand, increasing the patch size does not necessarily bring benefits: although patch=128 and stride=32 still maintain a high level (QI=71.46%, AUC=0.7165), the overall volatility is greater; while patch=128 and stride=24 show a significant SEN / SPE imbalance (SEN=37.43%, SPE=59.13%), suggesting that an excessively large patch size may reduce the ability to resolve short-term morphological changes (such as deceleration / variability details).

[0157] In summary, the hyperparameters of PatchTST need to achieve a reasonable balance between local morphological resolution, modeling stability, and redundancy control. Based on the above results, this embodiment uses patch = 64 and stride = 32 as the default configuration for the CTG signal encoding branch in subsequent experiments.

[0158] Table 7. Effects of Patch Length and Sliding Step Size on Model Performance in PatchTST Temporal Encoders

[0159] Hyperparameter sensitivity analysis of ResNet backbone network depth: In order to examine the sensitivity of MIRF-Net to the depth of image encoder, this embodiment uses ResNet-50 / ResNet-101 / ResNet-152 as image encoders while keeping the rest of the MIRF-Net structure, training strategy and data partition unchanged, and compares their performance in the intrapartum fetal risk assessment task (Table 8).

[0160] The results show that ResNet-101 significantly outperforms other depth configurations in terms of overall metrics, achieving the most balanced and optimal overall performance (ACC=74.63%, QI=74.76%, F1=75.03%, AUC=0.7413, MCC=0.4740). In contrast, ResNet-50 has a significantly lower SEN (23.61%). Although its SPE is higher (78.64%), it exhibits an imbalance of "high specificity and low sensitivity," leading to a significant decrease in QI (43.08%) and AUC (0.4937), indicating that shallower networks are insufficient in extracting fine-grained textures and structural patterns from GADF images. On the other hand, deepening to ResNet-152 did not bring performance improvements. Although the SPE increased to 86.66%, the SEN was only 37.43%, also showing an imbalance between biased prediction and discrimination. The AUC and MCC were also lower than those of ResNet-101. This phenomenon indicates that, given the data scale and noise level of this task, an excessively deep backbone may introduce greater optimization difficulty or an overfitting tendency, thereby weakening the stable capture of abnormal risk patterns.

[0161] In summary, this embodiment uses ResNet-101 as the image encoder for MIRF-Net by default in subsequent experiments. It achieves a better trade-off between sensitivity and specificity and exhibits the most stable and strongest overall discrimination performance.

[0162] Table 8. Impact of ResNet backbone depth on model performance in the image coding branch.

[0163] Sensitivity analysis of hyperparameters for the number of attention heads in Multimodal Fusion Transformer: To evaluate the sensitivity of MIRF-Net to multi-head self-attention configuration in the cross-modal fusion stage, this embodiment fixed the number of FusionTransformer layers at 2, only changing the number of attention heads (2 / 4 / 8 / 16), while keeping other training settings consistent. The results are shown in Table 9. Overall, the number of attention heads significantly affects the fusion effect, with 4 heads showing the best and most balanced overall performance (ACC = 74.63%, QI = 74.76%, F1 = 75.03%, MCC = 0.4740, BS = 0.2537), indicating that this configuration achieves the best trade-off between intermodal interaction modeling ability and generalization stability. Too few heads (2 heads) will limit cross-modal information interaction, leading to a significant decrease in overall discrimination performance (QI = 56.70%, MCC = 0.1401). When the number of heads increased to 8, the AUC reached its highest value (0.7586), but the QI, MCC, and calibration metrics did not improve proportionally (QI = 69.11%, MCC = 0.3856, BS = 0.2807), suggesting that its ranking and discrimination advantage was not fully translated into more robust overall decision quality. Further increasing to 16 heads resulted in a significant degradation (QI = 55.62%, BS = 0.4329), possibly related to representation fragmentation and training instability. Therefore, subsequent experiments used a 2-layer, 4-head fusion configuration as the default.

[0164] Table 9. Impact of the number of attention heads on model performance in a fused Transformer (fixed number of layers)

[0165] Hyperparameter sensitivity analysis of Fusion Transformer layer number: To analyze the impact of fusion module depth on MIRF-Net cross-modal interaction modeling, this embodiment keeps the number of attention heads fixed at 4 (Heads = 4) and other training and data partitioning settings consistent, only changing the number of Fusion Transformer layers, and comparing model performance. The results are shown in Table 10. When the number of layers is 2, the model achieves the best and most stable overall performance (ACC = 74.63%, QI = 74.76%, F1 = 75.03%, AUC = 0.7413, MCC = 0.4740, and the lowest BS is 0.2537). This indicates that under the current data scale and task difficulty, shallow fusion is sufficient to capture the key complementary relationships between CTG temporal embedding, GADF image texture structure, and parent metadata priors, and achieves a good balance between discriminative ability and probability calibration. In contrast, increasing the number of layers to 3 or 4 leads to a significant performance degradation: the ACC drops to approximately 57%, the QI decreases to 47.90%–57.93%, while the BS increases to 0.4243–0.4267, indicating that deeper stacking may introduce redundant interactions or noise propagation, and exacerbate optimization instability and overfitting risks in small sample sizes. Further increasing to 5 layers results in a slight improvement in model performance (ACC = 69.77%, QI = 65.18%, AUC = 0.7418), but overall it remains significantly lower than the 2-layer setting, and the calibration metric (BS = 0.3023) is also inferior to the optimal configuration. Therefore, this embodiment will default to a 2-layer (Heads = 4) fusion module configuration.

[0166] Table 10. Impact of the number of layers in a fused Transformer on model performance (fixed attention head count)

[0167] Hyperparameter sensitivity analysis of the label smoothing coefficient: To analyze the impact of the label smoothing coefficient on model performance, this embodiment sets different label smoothing coefficients for comparative experiments, and the results are shown in Table 11. The results show that the label smoothing coefficient has a significant impact on the overall discriminative ability and class balance of the model.

[0168] Table 11 Hyperparameter sensitivity analysis of label smoothing coefficient

[0169] When the label smoothing coefficient is 0.1, the model achieves the best overall results, with ACC, QI, F1, AUC, and MCC reaching 74.63%, 74.76%, 75.03%, 0.7413, and 0.4740, respectively, while BS is the lowest at 0.2537. When the label smoothing coefficient increases to 0.2 and 0.3, the model performance significantly deteriorates. At 0.2, SEN decreases sharply while SPE increases significantly, indicating that the model is overly strong in identifying normal samples but insufficient in detecting abnormal samples. At 0.3, all indicators further worsen, showing that an excessively large label smoothing coefficient weakens the model's ability to learn class boundaries.

[0170] In summary, moderate label smoothing helps improve the model's generalization performance, but an excessively large smoothing coefficient can lead to a decrease in discriminative ability. Experimental results show that the model performs best when the label smoothing coefficient is 0.1; therefore, 0.1 is used as the default setting in subsequent experiments in this embodiment.

[0171] Sensitivity analysis of hyperparameters for reconstruction loss weights λ: To analyze the impact of reconstruction loss weights λ on model performance, this embodiment sets different values ​​of λ and conducts comparative experiments. The results are shown in Table 12. Overall, changes in λ significantly affect the model's classification performance and training stability.

[0172] When λ is small, the constraint effect of the reconstruction branch is weak, and the auxiliary supervision provided by structured metadata is limited, resulting in low overall model performance. For example, when λ=0.1 and λ=0.3, ACC, QI, AUC, and MCC are all significantly lower than the optimal results. Although the sensitivity increases to 78.19% when λ=0.3, the specificity is only 54.00%, indicating that the model has a significant bias towards anomalous samples and insufficient overall class balance. When λ=0.5, the model achieves the best overall performance, with ACC, QI, F1, AUC, and MCC reaching 74.63%, 74.76%, 75.03%, 0.7413, and 0.4740, respectively, while the BS is the lowest at 0.2537. This indicates that at this value, a good balance is achieved between classification loss and reconstruction loss, ensuring both the discriminative ability of the main task and fully utilizing the auxiliary role of the metadata reconstruction branch. When λ continues to increase to 0.7 and 1.0, the model performance actually decreases, and the fluctuations of multiple indicators increase significantly. This indicates that excessively large reconstruction loss weights will cause the model to focus too much on the reconstruction task, thereby weakening the optimization effect on the main classification task and reducing training stability.

[0173] In summary, the reconstruction loss weight λ has a significant impact on model performance. Experimental results show that the model performs best in terms of overall discriminative ability, class balance, and output stability when λ=0.5. Therefore, subsequent experiments in this embodiment use λ=0.5 as the default setting.

[0174] Table 12 Hyperparameter sensitivity analysis of reconstruction loss weight λ

[0175] This invention proposes a multimodal data fusion framework, MIRF-Net. This framework is the first to achieve joint modeling of CTG temporal signals (FHR and UC), fetal heart rate GADF spatial representation, and maternal heterogeneous structured information within a unified end-to-end architecture. Methodologically, MIRF-Net employs a PatchTST-based temporal Transformer to capture the multi-timescale dynamic patterns and long-range dependencies of CTG, introduces a pre-trained ResNet to learn the global correlation structure of GADF images, and designs a metadata autoencoder (MAE) to robustly compress and represent low-dimensional heterogeneous MEMRs features. Subsequently, a multimodal fusion Transformer is used to model cross-modal interactions, achieving adaptive fusion and joint discrimination of different information sources. On the CTU-UHB dataset, MIRF-Net achieves superior overall performance and more reliable probability output compared to representative baselines. Ablation and contrast experiments further demonstrate that CTG temporal modality provides primary discriminative evidence, while GADF images and maternal metadata provide complementary information; the introduction of UC and attention-based fusion mechanisms can further improve sensitivity-specificity balance and calibration quality.

[0176] Although embodiments of the present invention have been described in conjunction with the accompanying drawings, the patent owner may make various modifications or alterations within the scope of the appended claims, as long as they do not exceed the protection scope described in the claims of the present invention, they shall be within the protection scope of the present invention.

Claims

1. A perinatal fetal risk assessment method based on multi-modal data fusion, characterized in that, Includes the following steps: S1. Collect CTG data during labor and preprocess it to obtain CTG timing data and maternal structured metadata, including fetal heart rate (FHR) and uterine contraction signal (UC). S2. Construct a MIRF-Net network framework consisting of a CTG signal encoder, an image encoder, a structured data encoder, and a Transformer-based multimodal fusion module, and output risk prediction results through a fully connected classification head; The input modalities of the MIRF-Net network framework include a CTG time-series signal containing fetal heart rate (FHR) and uterine contraction signal (UC), a GADF image generated from the fetal heart rate (FHR), and maternal structured metadata. In the CTG signal encoder, a time-series Transformer based on PatchTST is used to perform multi-scale modeling of fetal heart rate (FHR) and uterine contraction signal (UC). Local morphological changes and long-term dependencies are captured simultaneously through patch partitioning and self-attention mechanism to obtain a temporal embedding representation. In the image encoder, the fetal heart rate (FHR) is mapped to a two-dimensional image through GADF transformation to explicitly encode the global relevance structure, and then input into a pre-trained ResNet101 to extract texture and structural features. In the structured data encoder, MAE is introduced to perform nonlinear compression and representation learning on structured features to obtain compact latent variables to retain key clinical priors and suppress redundancy and noise. In the Transformer-based multimodal fusion module, the three embeddings are first linearly projected to a unified dimension and used as token sequences to input the multimodal fusion Transformer. Through multi-head self-attention modeling of cross-modal interactions, the complementary information between temporal dynamics, image texture structure and maternal prior is learned to obtain a fused global representation. The fused representation is then used by a fully connected classifier to perform binary classification of the fetal state and output the corresponding original score logits and predicted probability. S3. The MIRF-Net network framework is trained using CTG time series data including fetal heart rate (FHR) and uterine contraction signal (UC) obtained in step S1 and maternal structured metadata to obtain a multimodal data fusion-based intrapartum fetal risk assessment model. The multimodal data fusion-based intrapartum fetal risk assessment model is used to perform intrapartum fetal risk assessment based on the input intrapartum CTG data.

2. The method for perinatal fetal risk assessment based on multi-modal data fusion according to claim 1, characterized in that, Step S1 specifically involves: acquiring CTG data from the CTU-UHB public dataset; firstly, identifying and removing invalid data points through outlier detection; then, reconstructing the missing signal using a linear interpolation method between valid data points; and finally, smoothing the data.

3. The method for intrapartum fetal risk assessment based on multi-modal data fusion according to claim 2, characterized in that, The processing steps of the CTG signal encoder in step S2 specifically include: Suppose a single CTG sample contains K channels, and the one-dimensional time series representation of each channel is as follows: In the formula, L represents the length of the time series; For each channel, instance normalization is performed separately, using the following formula: wherein and denote the mean and standard deviation of the kth channel time series, respectively. PatchTST employs a channel-independent modeling strategy, treating different channels in a multi-channel CTG signal as independent one-dimensional time series and modeling them separately. For each channel's one-dimensional time series, a patch is created. Given the patch length P and step size S, the i-th patch is defined as: By padding the end of the sequence with zeros to ensure complete coverage, each channel can ultimately be represented as a patch sequence consisting of N patches, where: For each patch, perform a linear embedding and project it onto a d-dimensional feature space: where, and are learnable parameters, to preserve the relative position information of the patch in the original time series, a learnable position encoding is introduced The token representation of the final input to the Transformer is: The embedded patch sequence is input into the Transformer encoder. Through the stacking of multi-head self-attention mechanism and feedforward network, the joint modeling of local dynamic change features and long-term dependencies in CTG signals is achieved. The computational form of a single self-attention layer is as follows: wherein, , and denote the query, key and value matrices resulting from the linear mapping of the input features, for scaling the inner product to stabilize the training process; After the output of the last layer of the Transformer encoder, global average pooling is performed on the patch dimension to obtain the overall temporal representation of each channel: The feature representations of each channel are concatenated to form the final CTG embedding representation: 。 4. The method for perinatal fetal risk assessment based on multi-modal data fusion according to claim 3, characterized in that, The image encoder processing steps in step S2 specifically include: Let a single fetal heart rate FHR sequence be a time series of length of the form Normalize the time series to the interval [a, b], where -1 ≤ a <b≤1; The normalized sequence is mapped to polar coordinate space, where the angle corresponding to each sampling point is represented as: The angle difference between any two time points is mapped to matrix elements using the GADF (Graphical Approach Array) method. In the formulae, GADF image matrix The GADF image matrix is ​​scaled to a fixed resolution using bilinear interpolation: wherein, denotes the resulting GADF image input, , is the pixel index of the GADF image; A pre-trained ResNet101 is used as the image encoder to extract hierarchical visual features from the GADF image matrix. Let the output of ResNet101 after removing the final classification layer be the feature vector. Then the image encoding process can be represented as follows: In the formulae, is an image modality embedding representation.

5. The intrapartum fetal risk assessment method based on multimodal data fusion according to claim 4, characterized in that, The processing steps of the structured data encoder in step S2 are as follows: Let the metadata input be: In the formula, For age, For the first pregnancy, For the first and second births, As a binary indicator variable for gestational diabetes mellitus, for any one of its numerical characteristics Its standardized form is: , and Let the mean and standard deviation of the training set be the values, and then apply the standardization operation to them. Afterwards, the preprocessed metadata vector can be obtained as follows: A structured data encoder is used to process the input vector. Mapping to latent variables Introduced by the decoder The auxiliary reconstruction branch forms a reconstruction of the input. ; In the formula, Let represent the dimension of the latent variables; where both the encoder and decoder adopt a lightweight multilayer perceptron structure, the encoder consists of two fully connected layers and ReLU activation, and its forward mapping is represented as: , In the formula, , , , The decoder employs a symmetrical structure, and its reconstruction process is represented as follows: , , In the formula, , , , .

6. The intrapartum fetal risk assessment method based on multimodal data fusion according to claim 5, characterized in that, The processing steps of the Transformer-based multimodal fusion module in step S2 are as follows: The CTG timing branch output is generated using three independent linear mappings. Image branch output and metadata branch latent variables Projecting to a unified dimension In the shared space, three modal tokens are obtained; , , , In the formula, , , , , , and corresponding bias , , These are learnable parameters; Stack the three modal tokens in a fixed order to form an input sequence of length 3: right Superimposed learnable position / modal embeddings The final input is obtained: Multimodal fusion Transformer is composed of The standard Transformer encoder block consists of layers, each containing a multi-head self-attention network and a feedforward network, and employs residual connections and layer normalization to stabilize training. The calculation of a layer can be represented as: , , In the formula, For the first Layer output: The core of MHSA is the self-attention mechanism, and its single-head form is as follows: In the formula, , , , For single-head attention, multi-head attention is achieved by computing multiple attention heads in parallel, concatenating them, and then performing a linear transformation. After After layer fusion encoding, the result is The average pooling of the token dimension yields a globally fused representation: Finally, the fully connected classifier outputs the fetal risk prediction probability: In the formula, For the Sigmoid function, and These are learnable parameters.

7. The intrapartum fetal risk assessment method based on multimodal data fusion according to claim 6, characterized in that, The specific steps of step S3 include: End-to-end training is performed using a joint objective of primary task classification loss and auxiliary reconstruction loss, given a mini-batch. In the formula, Indicates CTG timing input, This represents the GADF image input generated by FHR. Indicates the structured clinical features of the mother; The model's forward output includes the logits of the classification head. Reconstructed vectors of metadata autoencoders The overall training objective is defined as: In the formula, Losses are categorized by primary task. For metadata reconstruction loss, Its weighting coefficient; Using label smooth cross-entropy as the classification loss, for the th The true label of each sample The one-hot distribution is smoothed to obtain soft labels. : In the formula, For smoothing coefficients; logits The class probability distribution is obtained by softmax: The LSCE on a mini-batch is defined as follows: . During the training phase, a reconstruction constraint is introduced, such that the reconstructed vector output by the parent structured metadata autoencoder is... The reconstruction loss is calculated using the mean square error: During the training phase, the model parameters are minimized by the joint loss. Update, among which The gradient is simultaneously updated across the three encoders, the fusion module, and the classification head; The gradient primarily acts on the metadata encoder / decoder branch to provide additional structural constraints; only the classification path is retained during the inference phase: the model outputs class logits and predicted probabilities for risk discrimination; the metadata reconstruction output... It does not participate in the final prediction, but is only used as an auxiliary regularization term during the training phase.