Biomedical model training method based on length and communication awareness

By using the length-aware bucketing dynamic filling and adaptive recomputation of the BEAM-Pipe framework, combined with communication-aware scheduling and hierarchical synchronization, the problems of uneven load and insufficient communication in biomedical model training are solved, improving training efficiency and GPU utilization, and reducing memory pressure.

CN122241236APending Publication Date: 2026-06-19CHINA UNIV OF PETROLEUM (EAST CHINA)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA UNIV OF PETROLEUM (EAST CHINA)
Filing Date
2026-05-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing pipelined parallel training methods suffer from uneven load distribution, pipeline bubbles, and insufficient overlap between communication and computation in biomedical model training, especially in scenarios with long-tail length distribution and significant intra-batch heterogeneity, resulting in low training efficiency.

Method used

The BEAM-Pipe integrated pipelined parallel optimization framework is adopted, which combines length-aware bucket dynamic filling, length-adaptive partial recomputation, communication-aware scheduling, hierarchical all-reduce, and phase shift and wavefront alignment to optimize the training process of biomedical models.

Benefits of technology

Without altering the model structure and training semantics, it improves GPU utilization, shortens iteration time, reduces peak GPU memory usage, enhances training efficiency and stability, and adapts to the long-tailed input distribution and significant intra-batch heterogeneity of biomedical models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241236A_ABST
    Figure CN122241236A_ABST
Patent Text Reader

Abstract

This invention discloses a biomedical model training method based on length and communication awareness, relating to the field of artificial intelligence technology. The method includes the following steps: acquiring raw corpus and constructing training samples based on the maximum sequence length; selecting consecutive layers within each stage to selectively recalculate the dynamically filled training set based on the attention mask of the initial training sample set, obtaining an adaptively recalculated activation tensor; jointly modeling the collected hierarchical computation time, hierarchical communication time, and activation memory usage data, minimizing stage latency differences under memory constraints, and generating a balanced pipeline stage partitioning scheme; performing gradient synchronization optimization on the adaptively recalculated activation tensor and the balanced pipeline stage partitioning scheme, and performing the first round of phase offset and waveform alignment to generate steady-state pipeline execution timing data; and collaboratively inputting the adaptively recalculated activation tensor, the balanced pipeline stage partitioning scheme, and the steady-state pipeline execution timing data to conduct pipeline parallel training.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method for training biomedical models based on length and communication perception. Background Technology

[0002] In recent years, pre-trained models have seen rapid and continuous expansion in natural language processing and related fields, with significant increases in parameter size and context length. In the biomedical field, this trend has also spurred the development of a series of domain-specific models with continuously enhanced capabilities, such as the Biomedical Bidirectional Encoder Representation Model (BioBERT), the Biomedical Generative Pre-trained Transformer Model (BioGPT), and the Scientific Text-to-Text Transfer Transformer Model (SciFive). As model size, the number of training tokens, and context length expand simultaneously, activation values, optimizer states, and communication overhead rise rapidly, while the secondary complexity of self-attention makes single-device training increasingly impractical.

[0003] These trends make distributed training essential. Data parallelism (DP) offers strong scalability but is limited by single-device memory and global all-reduce overhead; tensor parallelism (TP) can overcome single-device capacity constraints but introduces frequent fine-grained cross-device communication; pipelined parallelism (PP) divides the model into multiple stages and executes them in a streaming manner, providing a practical trade-off between memory usage and communication costs. Among these paradigms, PP is particularly attractive for training biomedical models because it can scale deep models without requiring all parameters or activations to reside on a single device.

[0004] However, most existing pipelined parallel training methods are designed for general-domain tasks, where input lengths are relatively concentrated and batch composition is usually more uniform. In such cases, static padding, fixed micro-batch sizes, static stage partitioning, and fixed recomputation ratios can often achieve a reasonable compromise between throughput and memory usage. But biomedical workloads are significantly different: titles, short questions, and structured fields generate a large number of short samples, while a small number of longer evidence fragments, full-text excerpts, or clinical records dominate the total number of tokens and peak memory usage. Therefore, biomedical inputs typically exhibit both long-tail length distributions and significant intra-batch heterogeneity, leading to uneven load distribution, pipeline bubbles, and insufficient overlap between communication and computation. Summary of the Invention

[0005] To address the aforementioned issues, this invention provides an integrated pipelined parallel optimization framework (BEAM-Pipe) for biomedical models. BEAM-Pipe combines three-factor stage partitioning with length-aware bucket dynamic filling and length-adaptive partial recomputation, and further improves execution efficiency through communication-aware scheduling, hierarchical all-reduce, and wavefront alignment with phase shift.

[0006] The biomedical model training method based on length and communication awareness provided by this invention includes the following steps: S1: Obtain the original corpus, perform text cleaning on the original corpus, and concatenate them into blocks to generate block-level biomedical corpus; then construct training samples according to the preset maximum sequence length to generate an initial training sample set; S2: Based on the attention mask of the initial training sample set, calculate the number of valid tokens for each sample and take the maximum valid length of the current training batch, and map it to a preset length bucket set to obtain the running bucket length; perform uniform dynamic padding on the samples according to the running bucket length to generate a dynamically padded training set; S3: Based on the length of the running bucket, calculate the recalculation ratio of the stage target, and determine the number of recalculation layers in combination with the number of pipeline stage layers; and select continuous blocks in each stage to perform selective recalculation on the dynamically filled training set to obtain the adaptive recalculated activation tensor. S4: Perform lightweight runtime profiling on the adaptively recalculated activation tensor, collect hierarchical computation time, hierarchical communication time and activation memory usage data; jointly model the collected hierarchical computation time, hierarchical communication time and activation memory usage data, minimize the stage delay difference under memory constraints, and generate a balanced pipeline stage partitioning scheme. S5: Based on the balanced pipeline stage division scheme, perform communication-aware scheduling and gradient synchronization optimization; wherein, for the forward activation tensor and backward gradient tensor between adjacent pipeline stages, a grouped asynchronous receiving / sending method is adopted and transmitted through the GPU-side point-to-point communication path; for the gradient tensor generated by the parallel replicas of data within the same pipeline stage, the node-wide all-reduce reduction, the node coordinator cross-node all-reduce reduction, and the node-wide broadcast are executed sequentially according to the gradient sub-blocks to complete the bucketed layered gradient synchronization; and the first round of phase offset and waveform alignment is performed to generate steady-state pipeline execution timing data. S6: Input the adaptive recalculated activation tensor, the balanced pipeline stage partitioning scheme, and the steady-state pipeline execution timing data together to carry out pipeline parallel training.

[0007] In summary, the present invention has at least the following beneficial effects: 1. Without changing the model structure, loss definition, and basic training semantics, this invention integrates input length distribution, stage division, memory control, communication organization, and bidirectional pipeline stabilization into a single system design: it reduces the critical path pressure of the bottleneck stage by using three-factor fusion stage division, achieves on-demand adaptation to long-tail input distribution through length-aware bucket dynamic filling and length-adaptive partial recalculation, and further enhances computation and communication overlap, compresses the warm-up stage and waiting and bubbles at stage boundaries by combining communication-aware scheduling, hierarchical all-reduce, and first-round lightweight phase stabilization.

[0008] 2. This invention is based on lightweight runtime profiling, jointly collecting information such as layer-by-layer forward / backward time, stage boundary communication overhead, and activation lifecycle, and solving the minimum-to-maximum duration segmentation problem that simultaneously considers boundary costs and memory constraints to automatically generate a more balanced stage partitioning scheme. Building upon this, addressing the long-tailed distribution and batch heterogeneity commonly found in biomedical inputs at the data block granularity, this invention maps the current batch to discrete-length buckets based on the effective input length and performs dynamic filling according to the bucket length; then, it calculates the recalculation ratio based on the running bucket length and performs selective recalculation within each stage under budget constraints. To reduce scheduling overhead and maintain pipeline execution stability, the selected recalculation layers are organized as continuous blocks and aligned with stage boundaries to avoid cross-stage dependencies and additional communication overhead.

[0009] 3. This invention reduces scheduling bubbles in the initial and steady-state phases through offset and waveform alignment. The overall framework requires no modification to the model structure or loss definition, is compatible with strategies such as 1F1B and Chimera, and achieves higher GPU utilization, shorter iteration times, and lower peak memory usage in training large biomedical models. For certain bidirectional pipeline waveforms, such as Chimera, controllable initial offsets are applied to the upper and lower pipelines through offset and waveform alignment, causing the front and rear bubbles to be filled by the opposing phase, forming a smoother steady-state waveform, which reduces the potential bubble risk from the beginning of the run. Attached Figure Description

[0010] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0011] Figure 1 This is a flowchart of the mapping of inputs to buckets and the runtime decision-making process in this invention; Figure 2This is a schematic diagram of the communication-aware pipeline scheduling and intra-stage data parallel synchronization structure provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of length adaptive recalculation and layer selection in this invention; Figure 4 This is an end-to-end performance comparison chart of the PubMedGPT-2.7B scenario dominated by the long bucket of this invention; Figure 5 This is an end-to-end performance comparison chart of the PubMedGPT-2.7B scenario dominated by the short bucket of this invention; Figure 6 This is an end-to-end performance comparison chart of the PubMedBERT-base scenario of this invention; Figure 7 This is an end-to-end performance comparison chart of the BioBERT-large scenario of this invention; Figure 8 This is a comparison chart showing the impact of the three-factor stage division in this invention on iteration time and GPU utilization. Figure 9 This is a comparison chart of the impact of the three-factor stage division in this invention on peak memory and average memory. Figure 10 This is a diagram showing the experimental results of communication-aware scheduling in this invention; Figure 11 This is a diagram showing the experimental results of the hierarchical synchronization mechanism in this invention; Figure 12 This is a memory-time tradeoff diagram for the length adaptive recalculation under dynamic filling conditions in this invention. Detailed Implementation

[0012] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0013] The following is in conjunction with the appendix Figures 1 to 12 The present invention will be described in further detail below.

[0014] A biomedical model training method based on length and communication awareness includes the following steps: Step 1: Preprocessing of the raw corpus.

[0015] The PubMed title-summary corpus was used. After parsing the original PubMed database using XML, the title and summary fields were extracted. After unified cleaning and WordPiece word segmentation, sentence-level segmentation was no longer used. Instead, block-level training samples were directly constructed according to the preset maximum sequence length, which is 512.

[0016] It should be noted that the length setting here refers to the upper limit of the externally configured length, rather than the actual running length of each iteration step; during training, the actual filling length is dynamically determined by mapping the effective length of the current batch to the corresponding bucket length.

[0017] Step 2: Dynamic binning of training samples.

[0018] Unlike fixed-length padding, the goal of this invention is not merely to reduce the invalid computation caused by padding tokens, but to preserve and utilize the inherent long-tail characteristics of biomedical inputs at the system level. This prevents the training load from being uniformly compressed into a single fixed shape, allowing it to adapt to the input distribution. This enables the "majority of short samples, minority of long samples" characteristic at the data distribution level to be truly transmitted to the runtime behavior of the training system, providing a more granular basis for memory management and scheduling optimization.

[0019] Specifically, let the attention mask for the nth sample in a batch be... Its valid token count is defined as: ; In the formula, This indicates the maximum sequence length allowed by the model.

[0020] For the current batch, let the upper bound L of the effective length during runtime be: ; In the formula, N represents the total number of samples in the batch.

[0021] While directly using a completely continuous dynamic tensor size can maximize the fit to the input length, it significantly increases the complexity of communication organization and scheduling in pipelined parallel scenarios. To balance system controllability and input adaptability, this invention introduces a finite-length bucket set: ; In the formula, M represents the number of elements in the length bucket, for example, {64,128,256,512}.

[0022] For any given batch, its running length is mapped to the minimum bucket length that can accommodate the current maximum effective length, i.e.: ; Subsequently, within the current training iteration step, uniformly follow... The process involves padding, forward and backward computation, and cross-stage communication. In other words, the training process no longer always runs at the global maximum length S, but instead constrains each iteration step to a discrete runtime length level. This design has two implications: first, a large number of short samples are no longer forced to pad to the global upper limit determined by at least a few long samples, thus effectively reducing the ineffective computation and memory waste introduced by fixed padding; second, the input length distribution is explicitly projected as a runtime variable in the training system, making the typical long-tail structure in biomedical scenarios visible at the system level. The mapping relationship from length distribution to bucket allocation is as follows: Figure 1 As shown.

[0023] Step 3: Selective recalculation of bucket length.

[0024] After adopting bucket dynamic filling, the runtime of different training iteration steps The input length is no longer constant, therefore the activation tensor size, peak memory usage, and single-step computation cost will all dynamically change with the input length. In particular, although long-bucket samples account for a relatively small percentage in number, they often dominate the peak memory usage and tail latency during training. In this case, a fixed recalculation ratio is difficult to balance short and long samples simultaneously: if a uniformly large recalculation ratio is used, unnecessary extra computation will be introduced on a large number of short samples; if a uniformly small recalculation ratio is used, it is difficult to adequately alleviate memory pressure in long-bucket scenarios. Therefore, this invention further introduces a length-adaptive recalculation ratio function. The intensity is dynamically adjusted and recalculated based on the current length of the running bucket. This can be represented in the following segmented form: ; In the formula, Indicates the first length threshold. This represents the second length threshold. Indicates the proportion of short inputs to be recalculated. This indicates a moderate input recalculation rate. This indicates the recalculation ratio for long inputs.

[0025] This design allows short-bucket samples to typically use a lower recalculation ratio, or even no recalculation at all, in order to maintain high throughput; while long-bucket samples use a higher recalculation ratio in exchange for more memory reclamation space and reduced OOM risk.

[0026] It should be noted that the length adaptation here does not perform arbitrary fine-grained continuous tensor size adjustment on the input, but rather makes online decisions within the discrete operating space defined by the finite bucket, combined with the statistics of the effective length within the batch. Therefore, it can achieve a balance between system complexity and input adaptability.

[0027] Simply providing the recalculation ratio is insufficient to directly guide the specific execution in pipeline parallelism, because the computational load, activation lifecycle, and memory budget vary across different stages. Therefore, this invention further concretizes length-adaptive recalculation as an intra-stage layer selection problem. Suppose a pipeline stage contains m local layers, then the recalculation ratio r(m / m) corresponding to the current input length is... Under this condition, the target recalculation layer number k in this stage can be defined as: ; Furthermore, it combines phased budgets with boundary constraints to tailor the budget to the legal scope. After determining the number of recalculation layers k, this invention does not employ a unified recalculation across all stages, but instead performs selective recalculation within each stage: prioritizing the selection of a continuous layer block from the local layer set of that stage as a candidate recalculation region, rather than performing discretized and dispersed selection across stages. The layer selection method within each stage is as follows: Figure 3 As shown.

[0028] From a system perspective, this mechanism transforms the long tail of biomedical input length from a "data distribution characteristic" into a "running variable in the training system." With dynamic bucket filling, the runtime length, activation scale, and memory pressure of different iteration steps dynamically change with the input distribution. With length-adaptive recalculation, the memory risk brought by long-bucket samples can be specifically mitigated within a stage, while short-bucket samples do not incur the additional computational overhead introduced by uniform recalculation. More importantly, this method is less invasive: it does not change model parameters, loss definitions, or basic training interfaces; it only adjusts the runtime level at the iteration step level and adjusts the local "save / recalculate" strategy within a stage, without introducing additional cross-stage synchronization dependencies. Therefore, this mechanism can be naturally embedded into the subsequent stage partitioning, communication-aware scheduling, and waveform alignment framework: the input length distribution determines the runtime shape and memory budget of the current iteration step; stage partitioning is responsible for balancing the time and memory pressure of each stage globally; and communication overlap, offset alignment, and hierarchical synchronization further absorb the rhythm fluctuations caused by switching between different-length buckets.

[0029] Step 4: Joint modeling of the three factors.

[0030] During the runtime of this model, a single analysis is performed to collect the steady-state median forward / backward time by layer. The acquisition of communication time is also designed to closely resemble actual operation: This invention records the average time from submission to completion of the sending end using "stage boundary × direction (forward / backward) × message type" as the granularity, and archives it along with the number of message bytes; during training, the average value of the boundary is converted into the communication time of the layer. .

[0031] It should be noted that this step does not deduct the part that may overlap with the calculation. The communication time is included in this model as an explicit term alongside the calculation time, which is consistent with the actual measurement method used.

[0032] Overall phase time : ; In the formula, This represents the set of model layers protected in the p-th stage. Indicates the first Forward computation time of the layer Indicates the first Layer reverse computation time, Indicates the first Layer-level communication time.

[0033] Set up layers The output activation size is Under the adopted scheduling (e.g., 1F1B), the effective time period from forward generation to reverse use and release is called the "activation lifetime". This invention uses "lifetime scan" to approximate a simple peak memory: ; ; In the formula, Indicates the first The size of the layer output activation. express Whether the layer activation remains active at time t. This indicates the peak memory limit for the p-stage.

[0034] The stage division is expressed in an intuitive form. While satisfying memory constraints, the goal is to make the slowest stage as fast as possible.

[0035] ; ; In the formula, This indicates the maximum execution time across all stages.

[0036] Based on the obtained time and memory data, the final stage division uses a two-step method, which is both reproducible and highly efficient. First, a comprehensive weight is defined, and the weights are accumulated from front to back. When the accumulated weights approach the weight limit without triggering it, the stage is segmented. If adding a next layer would cause the current stage to exceed its limits, that layer is placed into the next stage early, resulting in a feasible and roughly balanced initial score. Afterward, small-scale adjustments and swaps are made between adjacent stages. If the swaps result in... If the change decreases but the constraints are still met, then the update is accepted.

[0037] Step 5: Layered synchronization of tensor gradients.

[0038] like Figure 2 As shown, this invention optimizes both inter-stage point-to-point communication paths and intra-stage data parallel synchronization paths in communication-aware scheduling. For forward activation transmission and backward gradient transmission between adjacent pipeline stages, a grouped asynchronous receive / send method is used to organize communication events, and tensors are preferentially transmitted through the GPU-side point-to-point communication path to reduce the additional overhead caused by CPU relay and distributed communication calls. For data parallel replicas within the same pipeline stage, an intra-stage hierarchical synchronization structure is adopted. Local reduction is first performed within the node, then cross-node reduction is performed by the node coordinator, and finally the reduction result is broadcast back to the intra-node replica, thereby reducing cross-node communication traffic and lowering synchronization tail latency.

[0039] To balance high-performance execution with compatibility with general deployment, the system adopts a dual-branch design at the communication backend level. For the point-to-point backend supporting efficient device-side communication, the system keeps tensors resident on the GPU as much as possible and organizes send and receive operations in groups to reduce latency disturbances caused by host involvement and additional copying. For the more general communication backend, a message-by-message transmission path based on message identifiers is retained to ensure stability and availability under different operating environments. The two implementation paths maintain consistency in training semantics, i.e., they do not change the forward dependencies, backward gradient propagation order, or parameter update timing. However, their differences in event submission granularity, data residence location, and runtime call patterns provide optimization space for the coordinated scheduling of communication and computation.

[0040] At the path layer, this invention integrates the four types of point-to-point events in a single iteration—forward reception, forward transmission, reverse reception, and reverse transmission—into grouped processing. Specifically, the receiving side pre-attaches reception requests to the tensors required for the current iteration step and records handles. The sending side, while ensuring continuous tensor layout and consistent data types, organizes the tensors to be sent within the same stage into a communication operation list and submits them uniformly in the form of group operations. This "attach reception first, then batch send" processing method reduces the idle overhead caused by waiting for the other side and converges the communication events originally initiated at the tensor granularity into a small number of group-level operations, thereby reducing the runtime system call frequency and scheduling jitter.

[0041] For device-side communication paths, the entire data transmission process is completed on the GPU side as much as possible to avoid additional copying, layout disturbances, and synchronization delays caused by CPU relay. For general communication paths, a blocking message-by-message implementation is retained to ensure stable execution capability of the system under heterogeneous network conditions. Through this unified and hierarchical event organization method, communication is transformed from a "discrete additional process" into an explicit execution phase that is "schedulable, overlapping, and optimizable".

[0042] Furthermore, to address the issue that data parallel gradient synchronization in multi-node training can easily become a tail bottleneck, this invention introduces a hierarchical synchronization mechanism for data parallel replicas within a stage, while keeping the gradient reduction result unchanged.

[0043] It's important to note that in data-parallel scenarios, if replicas within the same pipeline stage can be completely localized within a single node, their gradient synchronization is naturally limited by the high-speed interconnects within that node, typically resulting in lower communication costs and better execution stability. Therefore, under the premise of satisfying pipeline mapping, stage partitioning, and load balancing constraints, the system prioritizes preserving this node locality to reduce unnecessary cross-node synchronization overhead. Only when limitations imposed by the total number of cards, stages, or replica placement inevitably lead to replicas of the same stage being distributed across nodes does cross-node gradient synchronization become a significant factor affecting tail latency. In this case, further enabling hierarchical all-reduce to enhance intra-stage DP synchronization efficiency is necessary.

[0044] Specifically, layered synchronization consists of a two-level reduction process. First, an intra-node all-reduce is executed within the data parallel synchronization group of each pipeline stage to obtain the local reduction result on the current node. Then, each node selects a master process to form a master process group, and an inter-node all-reduce is executed within this group to complete the global reduction. Finally, the global reduction result is distributed to other replicas within the same node via intra-node broadcast. This two-level structure of "intra-node reduction - inter-node reduction by master process - intra-node broadcast" restricts most data exchange to high-speed links within the node, exposing only the necessary global reduction path to the cross-node network. This effectively reduces cross-node bandwidth pressure, alleviates synchronization tail congestion, and improves inter-step latency jitter in multi-node scenarios.

[0045] To further improve throughput and reduce communication overhead, this invention introduces a bucketing mechanism in hierarchical synchronization. Large gradient vectors are divided into multiple sub-blocks, and a reduction process of "intra-node reduction—inter-node reduction in the main process—intra-node broadcast" is executed sequentially according to the bucket granularity. Compared to centralized synchronization of the entire gradient vector at once, the bucketing strategy can initiate the communication process of some gradients earlier, forming a closer pipeline overlap with subsequent backpropagation, thereby improving communication hiding and shortening the critical path length. Simultaneously, sub-block-level synchronization also helps reduce the instantaneous peak pressure of a single reduction, enhancing latency smoothness during operation. In summary, this invention does not simply pursue local optimization at the communication bandwidth level, but rather, through the design principles of "locality priority, cross-node hierarchical, and full overlap," integrates point-to-point transmission and gradient synchronization into a unified scheduling system, achieving more stable and efficient end-to-end execution without altering the training semantics.

[0046] Step 6: First-round phase offset and waveform alignment.

[0047] In a "bidirectional pipeline" scenario—an execution model where two opposing waveforms propagate simultaneously across stages within the same training iteration (represented by Chimera)—the timing assumptions differ from unidirectional 1F1B or interleaved 1F1B. In the same stage, the uplink / downlink waveforms share the same GPU computing resources and the same adjacent stage P2P links, making the startup phase more prone to mutual waiting and boundary cavitation. To address this, this invention introduces a limited phase offset in the first iteration. By gating the forward transmission rates of the two waveforms, controllable peak shifts are created in critical stages, allowing opposing computation to cover the peer communication window, reducing startup warm-up cavitation and facilitating a faster transition to steady state. The offset only takes effect in the first iteration; subsequent iterations revert to offset-free execution, thus not imposing additional burden on long-term throughput and complementing the grouping of communication in the previous step.

[0048] The specific implementation is based on runtime-generated symbol stream scheduling. This is achieved by configuring an offset parameter. Realization, its absolute value Define the maximum number of micro-batch transmission clock cycles that can be delayed in a certain direction during the first round. If... This indicates that the uplink deliberately lags behind by k clock cycles in the first round of micro-batch transmissions; if it is... The downlink pipeline lags by k clock cycles. Since the offset only changes the initial transmit timing without altering the dependent topology, operator set, and communication pairing relationships (message identifier, step number), the event types and communication rounds remain unchanged during training. Simultaneously, the computational segment of one waveform, being more continuous, can cover the communication segment of another waveform over a larger area. Combined with the path layer grouping (first receive, batch transmit, unified waiting) from the previous section, the covered communication segments exhibit better continuity and predictability, effectively suppressing peer-to-peer equality and short-cycle jitter during the startup phase.

[0049] In terms of parameterization and constraints, It can be set according to a global constant, or given as a list by node or direction; the default is to fall back to a single constant. The scope of the offset is strictly limited to the first iteration, i.e. This only applies to iteration 0; subsequent iterations default to this setting. To ensure steady-state reachability, the following basic structural conditions must be met: the number of stages must be even, and 2 × the number of micro-batches must be an integer multiple of the number of stages, thereby ensuring that the uplink and downlink interleaving can seamlessly fill the steady-state window.

[0050] Experimental verification: Hardware Environment. This invention was evaluated on a multi-node GPU cluster. The cluster consisted of 32 NVIDIA server nodes, each equipped with four NVIDIA Tesla A100-PCIe (40GB) GPUs. The scheduler defaulted to resource binding per GPU, allocating 32 CPU cores and 55GB of main memory to each GPU. The nodes were interconnected via a 4×100Gb / s RoCE high-speed network.

[0051] Models and Tasks. The main models consist of two types: Encoder-based pre-trained model: A biomedical variant of the BERT model (BioBERT), trained with the goal of masked language modeling (MLM). Whether NSP is enabled depends on the corresponding open-source implementation / configuration (this invention uses the original implementation by default). Decoder-based autoregressive model: A 2.7B parameter biomedical language model based on the GPT-2 architecture, BioMedLM (PubMedGPT 2.7B), trained with the goal of causal language modeling (next-token prediction). Unless otherwise specified, the vocabulary, positional encoding, and output header of each model are consistent with the corresponding open-source implementation.

[0052] Baseline and Parallel Configuration. This invention uses the pipelined parallel training method (GPipe), the one-to-one pipeline scheduling method (1F1B), and the bidirectional pipelined parallel method (Chimera) as comparative schemes, and evaluates the method of this invention as an optimization scheme. The main experiments were conducted on 4 and 8 A100 GPUs, respectively, to compare the training performance of different pipelined parallel strategies on encoder models, large-scale encoder models, and decoder models. Among them, the 8-card experiment used a configuration of PP=8 and DP=1 to analyze the effect of different pipelined scheduling and communication optimization strategies under the condition of single data parallel replicas.

[0053] Multi-node data parallel scaling configuration. In addition to the main experiment described above, this invention further conducted supplementary experiments on 16 GPUs with a PP=8 and DP=2 configuration to evaluate the impact of the ordinary all-reduce reduction and the hierarchical all-reduce reduction on iteration time, GPU utilization, and peak memory usage in a data parallel scaling scenario. Specific experiment numbers, models, parallel scales, workloads, comparison schemes, and verification objectives are shown in Table 1. In the table, S represents the sequence length, and microBS represents the micro-batch size of each pipeline stage.

[0054] In the E4 data parallel expansion experiment, two types of data parallel replica placement methods were further set: non-cross-node placement, that is, the two data parallel replicas in the same logical stage are confined to a single node as much as possible; cross-node placement, that is, the two data parallel replicas in the same logical stage are explicitly distributed to different nodes to amplify the synchronization cost between nodes.

[0055] Table 1: Evaluation Configuration Table for Each Model Experiment number Model Scale / Parallel Configuration Workload (S, microBS) Comparison Plan Verification Objective E1 PubMedBERT-base 4×A100, DP=1 (512,8)、(512,16) GPipe, 1F1B, Chimera, Method of the Invention Encoder Model Comparison E2 BioBERT-large 8×A100, PP=8, DP=1 (512,8)、(512,16) GPipe, 1F1B, Chimera, Method of the Invention Large-scale encoder model evaluation E3 PubMedGPT-2.7B 8×A100, PP=8, DP=1 (1024,1) GPipe, 1F1B, Chimera, Method of the Invention Decoder Model Evaluation E4 BioBERT-large 16×A100, PP=8, DP=2 (512,8)、(512,16) Method of this invention + ordinary specification / Method of this invention + layered specification Evaluation of Data Parallel Scaling and Synchronization Mechanisms like Figures 4-12 As shown, the proposed method was experimentally verified in terms of end-to-end training performance, three-factor stage partitioning effect, memory behavior, communication-aware scheduling benefits, and length-adaptive recomputation effect. BEAM-Pipe represents the method of this invention; other comparative schemes include GPipe, 1F1B, and Chimera.

[0056] End-to-end performance: such as Figures 4-7 As shown, this invention compares the end-to-end performance of different pipelined parallel schemes on representative biomedical models: a biomedical language model based on the GPT architecture with 2.7B parameters (PubMedGPT-2.7B), a BERT-base biomedical encoder model pre-trained on PubMed corpus (PubMedBERT-base), and a BERT-large biomedical encoder model for biomedical text mining tasks (BioBERT-large). Figures 4-7 The different columns correspond to the long-bucket-dominated PubMedGPT-2.7B scenario, the short-bucket-dominated PubMedGPT-2.7B scenario, the PubMedBERT-base scenario, and the BioBERT-large scenario, respectively; different metrics reflect GPU utilization, single-step iteration time, and peak memory usage, respectively. The results show that the method of this invention can maintain relatively stable training performance under different models and different run length distributions, improving or maintaining high GPU utilization, shortening end-to-end iteration time, and reducing or controlling peak memory usage in most scenarios. This indicates that this invention is not only applicable to a single model or fixed sequence length configuration, but can adapt to the long-tailed input distribution and dynamic run length variations commonly found in biomedical corpora. The four columns correspond to the long / short-bucket scenarios of PubMedGPT-2.7B, PubMedBERT-base, and BioBERT-large, respectively, and the three rows report GPU utilization, iteration time, and peak memory usage, respectively.

[0057] The effect of the three-factor phase division: such as Figure 8 and Figure 9 As shown, under the premise of fixed model structure and training configuration, this invention further verifies the impact of three-factor stage division on runtime, GPU utilization and memory behavior. Figure 8 A comparison of iteration time and GPU utilization under different pipeline parallelization schemes is presented. Figure 9A comparison between peak and average video memory is presented. Experimental results show that this invention, by jointly considering forward / backward computation time, stage boundary communication overhead, and active memory usage, can generate a more balanced pipeline stage partitioning scheme, thereby shortening the critical path of the bottleneck stage, improving equipment utilization, and achieving a more reasonable balance between peak and average video memory. Experimental results show that this stage partitioning can reduce stage imbalance and shorten the critical path of the bottleneck stage.

[0058] The effects of communication-aware scheduling and hierarchical synchronization: such as Figure 10 and Figure 11 As shown, this invention compares the impact of different communication implementations and synchronization mechanisms on training performance. For inter-stage communication, this invention replaces the original CPU-dependent communication path with a GPU-side point-to-point communication path, and uses a grouped asynchronous receive / send method to organize inter-stage forward activation transmission and backward gradient transmission, thereby reducing the additional overhead caused by CPU relay and distributed communication calls, and improving the overlap between communication and computation. For intra-stage data parallel synchronization, this invention further compares the ordinary all-reduce reduction with the hierarchical all-reduce reduction. Experimental results show that the hierarchical reduction, through intra-node reduction, cross-node reduction by the node coordinator, and intra-node broadcasting, restricts most gradient synchronization traffic within the node, exposing only the necessary reduction paths to the cross-node network, thereby reducing cross-node synchronization overhead, shortening synchronization tail latency, and improving the overall execution efficiency in multi-node data parallel expansion scenarios.

[0059] Length-aware dynamic padding and adaptive recalculation effects: such as Figure 12 As shown, this invention further analyzes the memory-time trade-offs of different recomputation strategies under dynamic padding conditions. Compared to fixed padding, length-aware dynamic padding prevents a large number of short input batches from being forced to fill to the global maximum length, thereby reducing invalid computation and peak memory pressure. Based on this, length-adaptive partial recomputation dynamically adjusts the recomputation intensity according to the running bucket length, freeing up more memory on long bucket samples and avoiding unnecessary recomputation on short bucket samples. Experimental results show that, compared to the no-recomputation strategy, length-adaptive partial recomputation can reduce peak memory usage and improve trainability when running pressure increases; compared to the full recomputation strategy, it avoids uniformly imposing a large recomputation overhead on all training steps, thus achieving a more balanced training efficiency and memory gain.

[0060] In summary, such as Figures 4-12The experimental results show that this invention can integrate input length distribution, stage partitioning, memory control, communication organization, and bidirectional pipeline stabilization into a unified optimization process without changing the model structure, loss definition, or basic training semantics. Through three-factor stage partitioning, length-aware dynamic padding, length-adaptive partial recalculation, communication-aware scheduling, hierarchical all-reduce, and first-round phase shift and waveform alignment, this invention can reduce stage imbalance, peak memory pressure, communication exposure, and synchronization tail latency in pipelined parallel training, thereby improving the overall efficiency of distributed training of biomedical models.

[0061] The above are merely preferred embodiments of the invention and are not intended to limit the invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the protection scope of the invention.

Claims

1. A biomedical model training method based on length and communication awareness, characterized in that, Includes the following steps: S1: Obtain the original corpus, perform text cleaning on the original corpus, and concatenate them into blocks to generate block-level biomedical corpus; then construct training samples according to the preset maximum sequence length to generate an initial training sample set; S2: Based on the attention mask of the initial training sample set, calculate the number of valid tokens for each sample and take the maximum valid length of the current training batch, and map it to a preset length bucket set to obtain the running bucket length; perform uniform dynamic padding on the samples according to the running bucket length to generate a dynamically padded training set; S3: Based on the length of the running bucket, calculate the recalculation ratio of the stage target, and determine the number of recalculation layers in combination with the number of stages in the production line. In each stage, consecutive blocks are selected to perform selective recalculation on the dynamically filled training set to obtain the adaptive recalculated activation tensor. S4: Perform lightweight runtime profiling on the adaptively recalculated activation tensor and collect data on hierarchical computation time, hierarchical communication time, and activation memory usage. The acquisition layer calculation time, the layer communication time, and the activated video memory usage data are jointly modeled to minimize the stage latency difference under video memory constraints, thereby generating a balanced pipeline stage division scheme. S5: Based on the balanced pipeline stage division scheme, perform communication-aware scheduling and gradient synchronization optimization to generate steady-state pipeline execution timing data; S6: Input the adaptive recalculated activation tensor, the balanced pipeline stage partitioning scheme, and the steady-state pipeline execution timing data together to carry out pipeline parallel training.

2. The biomedical model training method based on length and communication awareness according to claim 1, characterized in that, Based on the attention mask of the initial training sample set, the specific steps for calculating the number of valid tokens for each sample and taking the maximum valid length of the current training batch are as follows: Let the attention mask at position i of the nth sample in a set of input samples be . Valid token count Defined as: ; In the formula, Indicates the maximum sequence length allowed by the model; For the current input sample set, let the upper bound L of the effective length during runtime be: ; In the formula, N represents the total number of samples in the input sample set.

3. The biomedical model training method based on length and communication awareness according to claim 2, characterized in that, The specific steps to map the maximum value of the sample set to a bucket set of preset length to obtain the length of the running bucket are as follows: The preset length bucket set B is as follows: ; In the formula, M represents the number of elements in the length bucket; Based on the upper bound L of the effective length of the current input sample set, the minimum bucket value is selected from the preset length bucket set B to obtain the running bucket length. .

4. The biomedical model training method based on length and communication awareness according to claim 3, characterized in that, Based on the length of the running bucket, the specific steps for calculating the recalculation ratio of the stage target and determining the number of recalculation layers in combination with the number of stages in the production line are as follows: Based on running bucket length The length adaptive recalculation ratio is calculated through a piecewise function. : ; In the formula, Indicates the first length threshold. This represents the second length threshold. Indicates the proportion of short inputs to be recalculated. This indicates a moderate input recalculation rate. Indicates the recalculation ratio for long inputs; Recalculate the obtained proportion Multiply the number of local layers m contained in the current pipeline stage by the number of recalculated layers m and round down to obtain the target number of recalculated layers for this stage. Then, the recalculated layer number k is trimmed to a valid range. .

5. The biomedical model training method based on length and communication awareness according to claim 4, characterized in that, The specific steps for selectively recalculating the dynamically padded training set to obtain the adaptively recalculated activation tensor by selecting consecutive layers within each stage are as follows: In each pipeline stage, a continuous block of layers is selected as the recalculation region within the local layer set. Discretization and scattering selection are performed without crossing stages. Based on the selected continuous block of layers, length-adaptive selective recalculation is performed on the dynamically filled training dataset to generate an adaptive recalculation activation tensor.

6. The biomedical model training method based on length and communication awareness according to claim 1, characterized in that, The specific steps for jointly modeling the acquisition layer computation time, the layer communication time, and the activated video memory usage data, minimizing the stage latency difference under video memory constraints, and generating a balanced pipeline stage partitioning scheme are as follows: For each pipeline stage p, the total execution time is obtained by summing the forward computation time, backward computation time, and hierarchical communication time of the layer to which it belongs. : ; In the formula, This represents the set of model layers protected in the p-th stage. Indicates the first Forward computation time of the layer Indicates the first Layer reverse computation time, Indicates the first Layer-level communication time; The activation values ​​of each layer within the accumulation phase p are summed, and weighted according to whether the activations reside at time t, to obtain the memory usage of that phase at time t. : ; In the formula, Indicates the first The size of the layer output activation. express Whether the layer activation remains active at time t; Record the peak memory usage during the p stage. And set memory constraints: ; In the formula, This indicates the peak memory limit for the p-stage; Execution data optimization: ; In the formula, This indicates the maximum execution time across all stages; Then, the execution stage is segmented by accumulating weights from front to back, and small-scale fine-tuning exchanges are performed between adjacent stages to generate a pipeline stage division scheme.

7. The biomedical model training method based on length and communication awareness according to claim 1, characterized in that, Based on the aforementioned balanced pipeline stage division scheme, the specific steps for performing communication-aware scheduling and gradient synchronization optimization to generate steady-state pipeline execution timing data are as follows: The adaptively recalculated activation tensor and the gradient vector of the balanced pipeline stage partitioning scheme are divided into multiple sub-blocks. The intra-node all-reduce reduction, inter-node main process all-reduce reduction and intra-node broadcast distribution are executed sequentially according to the sub-blocks to complete the hierarchical gradient synchronization. Configure phase offset parameters Take the absolute value In the first iteration, the uplink or downlink waveform of the bidirectional pipeline is delayed by k micro-batch transmit clocks, and phase offset and waveform alignment processing is performed to generate steady-state pipeline execution timing data.