Frame-level automated analysis of medical videos

By combining a frame-level encoder system and self-supervised learning techniques with attention-based encoders and inference networks, the problems of long-term time-dependent capture difficulties and high memory costs in existing medical video analysis are solved, enabling efficient automatic classification and inference of endoscopic videos.

CN122228531APending Publication Date: 2026-06-16JANSSEN RES & DEV LLC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JANSSEN RES & DEV LLC
Filing Date
2024-11-15
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing medical video analysis methods struggle to effectively capture long-term time dependencies and incur high storage costs when processing long medical videos, especially in endoscopic videos exceeding 60,000 frames.

Method used

A frame-level encoder system is adopted, which is pre-trained using self-supervised learning techniques. It combines an attention-based encoder and an inference network, processes frame embeddings through a visual transformer encoder, and processes the frame embedding sequence using a video-level enhancer to achieve temporal enhancement of frame-level embeddings.

🎯Benefits of technology

It improves the accuracy and efficiency of medical video analysis, effectively captures long-term time dependence, reduces storage costs, and is suitable for automatic classification and inference tasks of endoscopic videos.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122228531A_ABST
    Figure CN122228531A_ABST
Patent Text Reader

Abstract

Embodiments of computerized analysis systems and methods of medical videos are disclosed. In some embodiments, the downstream attention-based network includes an attention-based encoder and an inference network. In some embodiments, the attention-based encoder includes a vision transformer (ViT) encoder that processes those frame embeddings to obtain attention-based vector representations, each corresponding to each frame, which can then be processed by an MLP or other neural network for an inference task such as classification, regression, and / or segmentation. In some embodiments, during training of the downstream attention-based network, a video-level augmenter is used to selectively apply temporal augmentation to the sequence of frame embeddings prior to being processed by the downstream attention-based network. Some embodiments are specifically trained and adapted for analysis of endoscopy videos. These and other aspects of the present disclosure are described in further detail herein.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications This application claims the rights of U.S. Provisional Application 63 / 599,994, filed November 16, 2023, and U.S. Provisional Application 63 / 555,883, filed February 20, 2024. This application also shares some subject matter with International Application PCT / IB2024 / 057930, filed August 16, 2024. The contents of these applications are incorporated herein by reference. Background Technology

[0002] This disclosure relates in its entirety to computerized techniques for processing video of a patient’s region of interest. Summary of the Invention

[0003] In some medical video analytics applications, it is helpful to automatically identify anatomical regions in video frames as a camera or other sensor traverses one or more regions of interest of a patient. Endoscopic analysis for assessing inflammatory bowel disease (IBD) is one such application. For example, in a typical Crohn's disease assessment, the Simple Endoscopic Score for Crohn's Disease (SES-CD) is used to assess severity, and this score is determined across five anatomical segments: rectum (RM), left colon / sigmoid colon (LC), transverse colon (TC), right colon (RC), and ileum (IL). Automated endoscopic analysis relies in part on the automatic classification of video frames across these five anatomical segments.

[0004] Several past methods have used fixed points and travel distances to map the current camera position to intestinal segments based on a template of colonic segmentation. However, these methods do not readily account for differences in segment length and intestinal elasticity among patients.

[0005] Some methods avoid these problems by attempting to classify frame segments directly based on image information within the frame using convolutional neural networks (CNNs) or long short-term memory (LSTM) networks. However, such methods have limitations in their ability to effectively capture long-term temporal dependencies.

[0006] Transformers show promise for capturing long-term temporal dynamics in other contexts. However, the memory costs associated with transformers make their use challenging in the context of long medical videos. For example, endoscopic videos can exceed 60,000 frames.

[0007] The embodiments of this disclosure address these challenges by utilizing a system and method for a frame-level encoder that is pre-trained using self-supervised learning (SSL) techniques to obtain frame-level embeddings corresponding to frame sequences in medical videos. In some embodiments, the downstream attention-based network includes an attention-based encoder and an inference network, which may include a multilayer perceptron (MLP) or other neural network. In some embodiments, the attention-based encoder includes a visual transformer (ViT) encoder that processes the frame embeddings to obtain attention-based vector representations, each corresponding to each frame, which can then be processed by an MLP or other neural network for inference tasks such as classification, regression, and / or segmentation. In some embodiments, during the training of the downstream attention-based network, a video-level enhancer is used to selectively apply temporal enhancement to the frame embedding sequence before it is processed by the downstream attention-based network. These variations and other modifications of embodiments consistent with this disclosure are disclosed more fully below. Attached Figure Description

[0008] Figure 1 An example of a medical video processing system for training attention-based networks according to an embodiment of the present disclosure is illustrated.

[0009] Figure 2 An example of a medical video processing system according to an embodiment of the present disclosure is provided, which uses a trained attention-based network to compute classification inferences of frames in a medical video.

[0010] Figure 3 An example is illustrated of a method for processing medical videos to train an attention-based deep learning network to automatically compute inferences about frames in the medical videos, according to an embodiment of the present disclosure.

[0011] Figure 4 A video-level augmentation process for augmentation training according to an embodiment of this disclosure is illustrated.

[0012] Figure 5 An example is illustrated of a method for processing medical videos to compute inferences about frames of the medical videos using an attention-based deep learning network trained according to an embodiment of the present disclosure.

[0013] Figure 6 Examples of computer systems are shown, one or more of which can be used to implement one or more of the devices, systems and methods illustrated herein.

[0014] While embodiments of the present disclosure have been described with reference to the accompanying drawings, the drawings are intended to be illustrative. Other embodiments are consistent with the spirit and scope of this disclosure. Detailed Implementation

[0015] Various embodiments will now be described more fully below with reference to the accompanying drawings, which form part of this document and illustrate specific examples of practical embodiments by way of illustration. However, this specification may be embodied in many different forms and should not be construed as limited to the embodiments listed herein; rather, these embodiments are provided so that this specification will be comprehensive and complete and will fully convey the scope of this disclosure to those skilled in the art. Among other things, this specification may also be embodied as a method or apparatus. Therefore, any of the various embodiments described herein may take the form of a completely hardware implementation, a completely software implementation, or an implementation combining software and hardware aspects. Therefore, the following description should not be construed as limiting.

[0016] Figure 1 A medical video processing system 1000 according to an embodiment of this disclosure is illustrated. This embodiment and other embodiments will be described with reference to endoscopic video. However, the basic principles of this disclosure are applicable to other medical image / video analysis applications.

[0017] This example illustrates a system 1000 for processing medical videos 10 (such as endoscopic videos) from a medical video training set 100. The system 1000 processes the training set 100, which includes the medical videos 10, to train a downstream attention-based deep learning network, including an attention-based encoder 140 and an inference network 150, on an inference task. In one example, the training set 100 includes over 21 million labeled frames from 1,335 endoscopic videos corresponding to 753 patients for training blocks 140 and 150, and over 61 million frames from 4,160 endoscopic videos from four clinical trials for training block 120.

[0018] Preprocessing block 110 preprocesses the video data corresponding to video 10 by generating and resizing frames and removing annotations. Block 110 processes the video data such that each frame has a uniform dimension D1. In this example, after preprocessing, each frame 11 in the sequence 101 corresponding to video 10 has a dimension D1 of 224 × 224 × 3 (if RGB color channels are used). Other dimensions may be used in different specific implementations. Preprocessing block 110 prepares the sequence 101 of frames 11 corresponding to medical video 10. Therefore, the complete sequence 101 of frames 11 corresponding to medical video 10 has a dimension of N × D1, where N is the total number of frames obtained from the video for processing.

[0019] A self-supervised learning (SSL) pre-trained encoder 120 processes frame 11 to obtain a sequence 102 of frame embeddings 12. For each frame 11 processed by the SSL pre-trained encoder 120, there exists a frame embedding 12, and the order of the frame embeddings 12 is the same as the order of the frames 11 in the frame sequence 101. Each frame embedding 12 has a dimension D2.

[0020] The value of D2 depends on the underlying SSL model used for SSL pre-training of encoder 120. In one example, the DINOv2 objective is used to pre-train the visual transformer encoder (ViT) (see the “DINOv2” method described by Oquab et al. in “DINOv2: Learning Robust Visual Features without Supervision” (2023). However, other SSL pre-trained encoders and pre-training techniques may be used without departing from the spirit and scope of this disclosure. In another example, the DINOv1 objective can be used to pre-train an SSL pre-trained encoder, such as encoder 120 (see the “DINOv1” method described by Caron et al. in “Emerging Properties in Self-Supervised VisionTransformers” (2021). In yet another example, a convolutional neural network (CNN) encoder, such as ResNet (a CNN with residual connections), is used with the “SimCLR” method (see Chen et al. in “ASimple Framework for Contrastive Learning of Visual Representations” (2020)). All three papers cited in this paragraph are incorporated into this paper by way of citation.

[0021] In one example, the SSL pre-trained encoder 120 has been pre-trained using publicly available clinical trial endoscopy videos. In one example, over 61 million frames from more than 4,000 endoscopy videos were used for SSL pre-training. In one example, the SSL pre-trained encoder is a ViT-B / 16 encoder pre-trained using the DINOv2 method. In one example, a batch size of 256 was used for pre-training, where, in the case of the Adam optimizer, the cosine decay learning rate was... In one example, 15 pre-training iterations were performed on four NVIDIA A10G GPUs. See also Mobadersany, Pooya et al., “Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos,” cited above and incorporated herein by reference. For further details, see the International Conference on Medical Image Computing and Computer-Assisted Intervention, Cham: Springer Nature Switzerland, 2024. This paper is incorporated herein by reference.

[0022] In one such DINOv2 example, the value of D2 is 768 (one-dimensional embedding). Therefore, the complete sequence 102 of frame embeddings 12 corresponding to medical video 10 has a dimension of N×D2, where N is the total number of frames obtained from the video for processing.

[0023] Video-level enhancer (VLA) 130 processes frame embeddings 12 of frame embedding sequence 102. In one embodiment, VLA 130 randomly applies one or more video-level enhancements to sequence 102 of frame embeddings 12. In one embodiment, VLA 130 randomly determines whether to apply a first temporal modification operation to sequence 102 of embeddings 12, and then randomly determines whether to apply a second temporal modification operation to the embedding set. An example of such an operation is splitting, i.e., cropping the sequence, where a subset of frame embeddings in the sequence is retained and the remainder is discarded. Another example of such an operation is reversing the frame embedding sequence. In one example, the first temporal modification is splitting, and the second temporal modification is reversing, as follows: Figure 4 Further details are provided in the background.

[0024] continue Figure 1 The description states that the VLA 130 output frame embeds an enhanced sequence 103 of 13. Sequence 103 has a dimension N. A ×D2, where N A This is the number of frame embeddings after the enhancement operation. In this example, if VLA 130 did not apply the splitting operation to sequence 12, then N A =N. Otherwise, N AThe dimension is less than N. The frame embeddings are processed by an attention-based deep learning network, which in this example includes an attention-based encoder 140 and an inference network 150. The attention-based encoder 140 processes the sequence 103 of the frame embeddings 13 and outputs an output with dimension N. A A sequence 104 of ×D2 attention vectors 14. Each attention vector 14 corresponds to a frame embedding 12 that has been processed by the attention-based encoder 140.

[0025] In one example, the attention-based encoder 140 is a ViT encoder based on the model described by Dosovitskly et al. in “An Image is Worth 16x16 Words”, 2021, which is incorporated herein by reference. In Vaswani's paper, additional classification tokens are added, and these additional classification tokens are used to provide a summary output token for classifying the image corresponding to the set of input tokens (in Vaswani's paper, each input token corresponds to a patch of the image to be analyzed). However, in the illustrated implementation, frame-by-frame classification is desired, therefore the additional classification tokens used in Vaswani's paper are unnecessary, and the inference network directly uses the output tokens embedded in each input frame. An example of encoder 140 uses a ViT encoder with four layers and eight self-attention heads in each layer.

[0026] When performing classification, each attention vector 14 of sequence 104 is processed by an inference network 150 comprising a multilayer perceptron (MLP) 150-1 and a softmax layer 150-2. When performing regression, the softmax layer 150-2 is not required. In one example, the dropout rate of the attention encoder is 0.25, the dropout rate of the classification network is 0.5, the batch size used is 1, and the learning rate is... and weight decay The Adam optimizer. Network 150 outputs the computed inference set 15. In classification problems, inferences 15 can be in the form of the probability that a frame can be inferred to belong to each class, and in regression problems, they can be the regression values ​​corresponding to each frame. The dimension of inference set 15 is N. A ×C, where C is the number of categories (or, in the case of regression inference, the number of regression outputs), depending on the application.

[0027] In one example, System 1000 is trained to infer colonic segment class probabilities corresponding to the following five intestinal segments: rectum (RM), left colon / sigmoid colon (LC), transverse colon (TC), right colon (RC), and ileum (IL). In this type of application, "C" equals 5, meaning that for each frame, classification network 150 outputs five class probabilities. In another example, System 1000 is trained to infer whether a frame is taken from the forward or backward path of the procedure. In this type of application, "C" equals 2, meaning that for each frame, inference network 150 outputs two class probabilities. In another example, System 1000 is trained to infer whether a frame is taken from the left or right side of the colon. In this type of application, C also equals 2. In yet another example, System 1000 is trained to infer for each frame from the left side of the colon: rectum (RM), sigmoid colon (SC), and descending colon (DC), inferring which of these three colonic segments (RM, SC, or DC) the frame is taken from. In this type of 3-class example, C is 3.

[0028] In another example, multiple models are combined in an end-to-end manner as follows: First, a first trained model selects frames corresponding to the withdrawal path. Then, another model selects frames that correspond to the left side of the colon. Then, one or more inference models analyze the selected frames (which correspond to the withdrawal path and the left side of the colon) to perform one or more inference tasks, such as frame-by-frame segmental classification and / or regression, thereby outputting, for example, a disease severity score or other possible analytical results relevant to the colon being studied.

[0029] The learning module 160 performs typical learning processing by comparing the category inference 15 with the training data labels to calculate the loss (error) according to the selected loss function and then backpropagating the loss to tune the learnable parameters in the MLP 150-1 and the attention-based network 140.

[0030] While specific types of classification inference have been described above, in some implementations, the computed "inference" can be a prediction, estimate, rating, suggestion, classification, or other inference. In some implementations, instead of using a softmax layer to provide class probabilities, one or more outputs of an MLP can be used to provide a regression output in the form of a predicted rating within a range of values.

[0031] Figure 2 A medical video processing system 2000 according to an embodiment of the present disclosure is illustrated for computing classification inference of a medical video 20 using a trained attention-based network including an attention-based encoder 140 and an inference network 150. In the illustrated example, the trained attention-based encoder 140 is Figure 1The illustrated system 1000 is a trained version of the attention-based encoder 140, and the trained inference network 150 is... Figure 1 The illustrated system 1000 is a trained version of the inference network 150.

[0032] Inference System 2000 is similar to System 1000, except that it does not have... Figure 1 The VLA block 130 and learning block 160 are illustrated. Figure 2 In the illustrated example, medical video 20 (such as endoscopic video) is processed by preprocessing block 210 to generate frames and resize the frames to a generic size with dimension D1, for example, 224×224×3. Preprocessing block 210 outputs a frame sequence 201 of frames 21 corresponding to medical video 20. The same SSL pre-trained encoder 120 used by system 1000 during training ( Figure 1 (As illustrated in the example) during the inference period by Figure 2 The system 2000 illustrated above uses this. In system 2000, an SSL pre-trained encoder outputs a sequence 202 of frame embeddings 22. Each frame embedding has a dimension D2 (e.g., 768 for a DINOv2 SSL pre-trained encoder). A trained attention-based encoder 140 generates a sequence 204 of attention vectors 24, which is then processed by an inference network 150 to compute a classification inference 25 for each frame. As described above... Figure 1 As described in the background, the number of inferences 25 for each frame output will depend on the number of categories associated with a particular application or the number of outputs associated with a particular regression application, such as disease severity scores for each frame.

[0033] Figure 3 An embodiment of the present disclosure is illustrated for processing medical videos to train an attention-based deep learning network (such as including...) Figure 1 The method 3000 for automatically calculating the classification inference of frames in medical videos (shown as an attention-based deep learning network with an attention-based encoder 140 and an inference network 150) is as follows.

[0034] Step 301 preprocesses the medical videos in the training set to generate frames, resizes them to a uniform size, and masks any annotations. Step 302 encodes each frame of the medical videos using an SSL pre-trained encoder to obtain a frame embedding sequence corresponding to the medical videos. Step 303 selectively applies one or more temporal enhancements to the frame embedding sequence. In the illustrated implementation, one or more temporal enhancements include splitting (i.e., cropping the video into selected subsequences of the frame embedding sequence) and inversion. Step 303 outputs the enhanced sequence of frame embeddings (for some sequences, in implementations where the application of one or more operations is randomly determined, it may be the same as the sequence before the application of step 303).

[0035] Step 304 processes the augmented sequence of frame embeddings using an attention-based network to compute an attention-based vector for each frame embedding of the augmented sequence. Step 305 processes the attention-based vectors to compute classification or regression inference on a frame-by-frame basis. As previously discussed, the number of outputs depends on the application.

[0036] Step 306 uses the frame labels and the computed inference (e.g., class probability) to calculate a loss value, which quantifies the error in the inference using a chosen loss function. In one example, the cross-entropy loss function is used. However, other loss functions can be used. Step 307 determines whether the error is now minimized. If yes, training method 3000 ends. If not, step 308 adjusts the learnable parameters of the classification and attention-based deep learning networks to further reduce the error, and the method returns to step 302.

[0037] Figure 4 Method 4000 is illustrated. In one implementation, the method can be... Figure 1 The VLA 130 implementation of the scheme selectively performs one or more time-enhancing operations. As shown, step 401 determines whether to select the frame embedding sequence F via random selection. i Perform a splitting (time trimming or time pruning) operation. Step 402 determines whether sequence F has been selected. i If the result of step 402 is negative, then step 403 will enhance sequence F. aug Set to equal the initial sequence F i If the result of step 402 is yes, then step 404 initiates the splitting operation by: for example, randomly selecting a starting frame number R between 0 and N / 2-L / 2. s (integers), where N is the sequence F i The number of frames in the dataset, where L is greater than 0 and less than or equal to N. iAn integer, and randomly selects the end frame number re between N / 2 + L / 2 and N. Then step 405 will F aug Set to equal from frame r s to frame r e The frame sequence. Therefore, after the splitting operation, F aug It is F i A subset of (which may be F) i , or F i (A smaller subset of L). In one example, L is set to 2, but other numbers can also be used.

[0038] Step 406: Randomly determine whether to select F. aug Perform sequence reversal. Step 407 determines whether F has been selected in step 406. aug Perform the reversal. If the result of step 407 is negative, then frame sequence F... aug No reversal. If the result of step 407 is yes, then step 408 will F aug The sequence is reversed. Then, step 409 outputs F. aug Those skilled in the art will understand that when the sequence is reversed, the label data may also need to be modulated to facilitate matching the labels with downstream inferences for supervised (or weakly supervised) learning purposes. Such modulation of the label data may also be necessary given other temporal augmentations (such as temporal pruning (splitting)).

[0039] In one example Figure 4 Method 4000 can be more formally represented as the following algorithm.

[0040] Algorithm 1 Those skilled in the art will understand that Figure 4 The example illustrates an instance of introducing temporal augmentation into a training video sequence. In the example above, a given video sequence processed by method 4000 can be split (cropped), reversed, split and reversed, or left unchanged based on randomization operations within the method. In alternative implementations, these and / or other operations can be performed based on random or predefined selection techniques. In a general example of a temporal cropping operation, the function "min" is defined such that at least min(A, N) frames are available in the video after temporal cropping, where A is defined based on the desired severity of augmentation, and N is the number of available frames in the video. In one example, a lower value of A indicates more severe augmentation, i.e., a more aggressive temporal cropping.

[0041] Figure 5An embodiment of the present disclosure is illustrated for processing medical videos to train an attention-based deep learning network (such as including...) Figure 2 The method 5000 for automatically calculating the classification inference of frames in medical videos (shown as an attention-based deep learning network with an attention-based encoder 140 and an inference network 150) is as follows.

[0042] Step 501 preprocesses the medical video to generate frames and resizes them to a uniform size. Step 502 encodes each frame of the medical video using an SSL pre-trained encoder to obtain a frame embedding sequence corresponding to the medical video.

[0043] Step 503 processes the frame embedding sequence corresponding to the medical video using an attention-based network to compute an attention-based vector for each frame embedding in the sequence. Step 504 processes the attention-based vectors to compute classification inference on a frame-by-frame basis. As previously discussed, the number of categories depends on the application.

[0044] Examples of training data and selected results In some implementations, data from four different clinical trials involving 1,847 patients and 4,160 videos were used for training, validation, and testing. In a specific example, two datasets included patients with Crohn's disease (CD) (ClinicalTrials.gov ID: NCT03464136, NCT02877134), while the other two datasets included patients with ulcerative colitis (UC) (ClinicalTrials.gov ID: NCT02407236, NCT01959282). The UC dataset lacks anatomy. Segment labels were used because UC uses the Global Disease Severity Score, which does not include the Simple Endoscopic Scoring (SES-CD) standard for Crohn's disease. Therefore, the UC dataset was used only for SSL pre-training of the initial encoder and not for training the downstream attention-based network. Automatic text extraction using an OCR-based algorithm was used, followed by manual review and refinement. This process was applied to all frames from endoscopic videos of CD clinical trials. Approximately 50% of these frames lacked text annotations for anatomical segments, primarily due to their placement in the forward path. These frames were categorized as "unknown" and excluded from the downstream supervised learning task. The remaining frames were mapped to normalized anatomical segment labels: IL, RC, TC, LC, and RM. In one example, over 21 million labeled frames from 1335 videos from 753 patients were used separately for training the downstream attention-based network and for validation and testing, using a split ratio of 70:10:20%. The 20% of CD data allocated for testing was excluded from the initial encoder pre-training.

[0045] The performance of the model consistent with the implementation of this disclosure was tested against pre-existing models and other models. The results are shown in Table 1 below. These tests measured five different models: a template-based model, a CNN-based model, an LTSM-based model, a base-based model, and finally, the model consistent with the implementation of this disclosure (referred to as "EndoFormer" for simplicity). All models were optimized on the same training set used to train the downstream attention-based network consistent with this disclosure.

[0046] Template-based models rely on Yao, H. et al., “Motion-based camera localization system in colonoscopy videos,” Medical Image Analysis 73, 102180 (2021), and employ Gunnar Farnebäck’s dense OF method (see Farnebäck, G., “Two-frame motionestimation based on polynomial expansion,” Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003, Vol. 13, pp. 363–370, Springer (2003)). CNN-based models rely on Azagra et al., “Endomapper dataset of complete calibrated endoscopy procedures,” Scientific Data 10(1), 671 (2023). Base-based models use an SSL pre-trained ViT encoder as the initial encoder, consistent with EndoFormer, but replace the downstream attention-based deep learning network with linear layers. The LSTM-based model is based on Jin, Y. et al., “Temporalmemory relation network for workflow recognition from surgical video”, IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021). It also uses an SSL-pretrained ViT encoder as the initial encoder, consistent with EndoFormer, but replaces the downstream attention-based network with an LSTM network. Apart from EndoFormer, none of the models use video-level augmentations (such as...). Figure 4 The models were trained using the video-level augmentations illustrated in the examples. These models were evaluated using the area under the curve (AUC), F1 score, accuracy, and neighbor accuracy.

[0047] Table 1 Variations of the embodiments disclosed herein have all been tested. In one variation, a ViT encoder with a DINOv2 training objective is used to implement an SSL pre-trained encoder (such as...). Figure 1 and Figure 2 Encoder 120 in [the code]. In another variant, a ResNet encoder is used to implement an SSL pre-trained encoder (such as [the encoder name]). Figure 1 and Figure 2 The ResNet / SimCLR pre-trained encoder (120) was used, and the SimCLR objective was used to pre-train this SSL pre-trained encoder. Tests showed that the two variants performed similarly in terms of accuracy (the latter performing slightly lower accuracy). However, the ResNet / SimCLR variant required a longer training time. Furthermore, the accuracy varied with and without video-level enhancers (such as...). Figure 1 These variants were tested under the condition of VLA 130. For videos without perturbations during inference (i.e., no random splitting or random reversal), the two variants performed similarly. For videos with only random splitting variations during inference, the variant using VLA during training performed slightly better. For videos with random reversal during inference, the variant using VLA during training performed significantly better. For further details, see Mobadersany et al., “Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos,” which is cited and incorporated above. For further details, see the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024. This paper is incorporated herein by reference.

[0048] Figure 6 An example of a computer system 6000 is shown, one or more of which can be used to implement one or more of the devices, systems, and methods illustrated herein. The computer system 6000 executes instruction code contained in a computer program product 660. The computer program product 660 includes executable code in an electronically readable medium that can instruct one or more computers (such as the computer system 6000) to perform processing to implement the exemplary method steps performed.

[0049] An electronically readable medium can be any temporary or non-temporary medium that stores information electronically and can be accessed locally or remotely, for example, via a network connection. The medium may include multiple geographically dispersed media, each configured to store different portions of executable code at different locations and / or at different times. The executable instruction code in the electronically readable medium instructs the illustrated computer system 6000 to perform the various exemplary tasks described herein. The executable code used to instruct the performance of the tasks described herein will typically be implemented in software. However, those skilled in the art will understand that a computer or other electronic device may utilize code implemented in hardware to perform many or all of the identified tasks. Those skilled in the art will understand that many variations of the executable code implementing the exemplary methods can be found within the spirit and scope of this disclosure.

[0050] Code or copies of code contained in computer program product 660 may reside in one or more persistent storage media (not shown separately) communicatively linked to system 6000 for loading and storage into persistent storage device 670 and / or memory 610 for execution by processor 620. Computer system 600 also includes I / O subsystem 630 and peripheral devices 640. I / O subsystem 630, peripheral devices 640, processor 620, memory 610, and persistent storage device 670 are linked via bus 650. Similar to persistent storage device 670 and any other persistent storage device that may contain computer program product 660, memory 610 is a non-transitory medium (even if implemented as a typical volatile computer memory device). Furthermore, those skilled in the art will understand that, in addition to storing computer program product 660 for performing the processes described herein, memory 610 and / or persistent storage device 670 may also be configured to store various data elements referenced and illustrated herein.

[0051] Those skilled in the art will understand that computer system 6000 is merely one example of a system that can implement a computer program product according to this disclosure. By way of example only, execution of instructions contained in the computer program product can be distributed across multiple computers, such as, for example, across computers in a distributed computing network.

[0052] Instructions for implementing artificial neural networks or other deep learning networks may reside in computer program product 660. When processor 620 is executing the instructions of computer program product 660, the instructions, or portions thereof, are typically loaded into working memory 610 from which processor 620 can easily access the instructions.

[0053] Processor 620 may include multiple processors, which may include corresponding additional working memory (the additional processors and memory are not separately illustrated), the additional working memory including one or more graphics processing units (GPUs), the one or more GPUs including at least several thousand arithmetic logic units supporting massively parallel computing. GPUs are frequently used in deep learning applications because they can perform related processing tasks more efficiently than typical general-purpose processors (CPUs). Processor 620 may additionally or alternatively include one or more dedicated processing units, the one or more dedicated processing units including systolic arrays and / or other hardware arrangements supporting efficient parallel processing. Such dedicated hardware may work in conjunction with CPUs and / or GPUs to perform the various processing described herein. Such dedicated hardware may include application-specific integrated circuits (ASICs, which may refer to a portion of ASICs), field-programmable gate arrays (FPGAs), or combinations thereof. However, a processor (such as processor 620) may be implemented as one or more general-purpose processors (preferably having multiple cores) without necessarily departing from the spirit and scope of this disclosure.

[0054] While the term “inference” may be used in various ways herein, those skilled in the art will understand that the systems and methods described herein are not limited thereto, and that the term “inference” herein may refer to the execution of any computation in various calculations and / or the generation of various outputs, which may include, but are not limited to, any one or more of inference, rating, estimation, prediction, estimation, suggestion, recommendation, classification, categorization, annotation, conclusion, etc., or any combination of the foregoing.

[0055] While this disclosure has been described in detail with respect to the illustrated embodiments, it should be understood that various changes, modifications, and adaptations may be made based on this disclosure, and such changes, modifications, and adaptations are intended to be within the scope of this disclosure. Although this disclosure has been described in conjunction with embodiments currently considered to be most practical and preferred, it should be understood that this disclosure is not limited to the disclosed embodiments, but rather, is intended to cover various modifications and equivalent arrangements that fall within the scope of the basic principles of the invention as described in the various embodiments referenced above and below.

Claims

1. A method for analyzing one or more medical videos corresponding to examinations of one or more patients using one or more computers, the method comprising: The video data is preprocessed to obtain multiple video frames corresponding to the medical videos in the one or more medical videos, which correspond to the patient examinations in the one or more patient examinations; The plurality of video frames are encoded using a pre-trained encoder to obtain corresponding frame embeddings corresponding to corresponding frames in the plurality of video frames, wherein the pre-trained video encoder has been pre-trained using self-supervised learning. The corresponding frame embedding is processed using an attention-based deep learning encoder to obtain a corresponding attention vector generated by the attention processing, the corresponding attention vector corresponding to the corresponding frame in the plurality of video frames; as well as The corresponding attention vector is submitted to the inference network to obtain one or more computed inferences corresponding to the corresponding frame in the plurality of video frames.

2. The method according to claim 1, wherein, The one or more calculated inferences include at least one calculated inference corresponding to each corresponding frame.

3. The method according to any one of claims 1 to 2, wherein, The one or more calculated inferences include classification inferences.

4. The method according to any one of claims 1 to 3, wherein, The one or more calculated inferences include regression inferences.

5. The method according to any one of claims 1 to 4, wherein, The multiple medical videos include endoscopic videos.

6. The method according to claim 2, wherein, The plurality of medical videos include endoscopic videos, and the at least one calculated inference for each frame includes at least one classification inference.

7. The method according to claim 6, wherein, The at least one classification inference includes a classification of categories including the rectum, left colon / sigmoid colon, transverse colon, right colon, and ileum.

8. The method according to claim 6, wherein, The at least one classification inference includes inferences about categories containing the left and right sides of the colon.

9. The method according to any one of claims 1 to 8, wherein, The attention-based deep learning encoder and the inference network are trained using supervised learning.

10. The method according to any one of claims 1 to 8, wherein, The attention-based deep learning encoder and the inference network are trained using weakly supervised learning.

11. The method according to any one of claims 1 to 10, wherein, The pre-trained encoder includes a Vision Transformer (ViT) encoder.

12. The method according to any one of claims 1 to 10, wherein, The pre-trained encoder includes a convolutional neural network.

13. The method according to claim 11, wherein, The pre-trained encoder was trained using DINOv2.

14. The method according to claim 12, wherein, The pre-trained encoder was trained using SimCLR.

15. The method according to any one of claims 1 to 14, wherein, During the training of the attention-based deep learning encoder and the inference network, one or more temporal augmentation operations are selectively applied to the frame embedding sequence corresponding to the training video.

16. The method according to claim 15, wherein, The one or more time enhancements include time clipping and inversion.

17. The method according to any one of claims 15 to 16, wherein, The one or more temporal enhancements are applied selectively based on a selection function, such that for each sequence in the plurality of frame embedding sequences, one or more of the temporal enhancements are not applied, or are applied.

18. The method according to claim 17, wherein, The selection function randomly selects whether to apply each of the one or more temporal enhancements to the currently being processed frame embedding sequence.

19. A computerized deep learning system for processing medical videos, comprising: A pre-trained encoder, which is configured to encode multiple video frames corresponding to a medical video to obtain a corresponding frame embedding corresponding to a corresponding frame among the multiple video frames, wherein the pre-trained video encoder has been pre-trained using self-supervised learning. An attention-based deep learning encoder is configured to process the corresponding frame embedding to obtain a corresponding attention vector generated by the attention processing, the corresponding attention vector corresponding to the corresponding frame in the plurality of video frames; and An inference network is configured to process the corresponding attention vector and generate one or more computed inferences corresponding to the corresponding frame in the plurality of video frames.

20. A computer program product stored in a non-transitory tangible medium and comprising instructions executable on one or more processors of one or more computers to perform processing for analyzing one or more medical videos, the processing including performing the method according to any one of claims 1 to 18.

21. A computerized deep learning system for processing medical videos, comprising a plurality of computerized deep learning systems according to claim 19, wherein: The first computerized deep learning system in the plurality of computerized deep learning systems is configured to generate a first computed inference on a frame-by-frame basis to identify a first set of frames from the medical video for further analysis by the second computerized deep learning system in the plurality of computerized deep learning systems; and The second computerized deep learning system is configured to generate one or more second computed inferences about the first set of frames.

22. The computerized system according to claim 21, further wherein: The one or more second calculated inferences regarding the first set of frames are inferences calculated on a frame-by-frame basis to identify a second set of frames that is a subset of the first set of frames. This second set of frames is identified for further analysis by a third computerized deep learning system among the plurality of deep learning systems. The third computerized deep learning system among the plurality of deep learning systems is configured to generate one or more third computed inferences about the second set of frames.

23. The computerized system according to claim 22, wherein: The medical video is an endoscopic video; The calculated inference is about whether the frame comes from the forward path or the retraction path of the endoscopy video, and the first set of frames is inferred to come from the backward path of the endoscopy video; The second calculated inference concerns whether the frame originates from the left or right colonic region, and the second set of frames is inferred to originate from the left colonic region; and The one or more third-calculated inferences include inferences about the severity of the disease based on the second set of frames.