System and method for converting a landscape-format video into a portrait-format video

The video editing system effectively converts landscape to portrait videos by using neural networks and tracking algorithms to preserve contextual elements and maintain video quality, addressing the limitations of conventional methods.

WO2026133374A1PCT designated stage Publication Date: 2026-06-25FANBUFF TECHNOLOGY INDIA PVT LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
FANBUFF TECHNOLOGY INDIA PVT LTD
Filing Date
2025-12-20
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Conventional methods for converting landscape-format videos to portrait-format videos, particularly in sports broadcasting, suffer from labor-intensive manual editing, loss of contextual information, and suboptimal automated solutions that degrade video quality, especially in high-definition sports footage.

Method used

A video editing system utilizing a processing unit with neural networks and tracking algorithms to identify and track multiple objects across frames, dynamically determining primary and secondary objects, and generating portrait frames that maintain critical contextual elements.

Benefits of technology

Enables efficient, high-quality conversion of landscape to portrait videos by preserving contextual continuity and maintaining clarity, suitable for real-time processing of sports content.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IN2025052115_25062026_PF_FP_ABST
    Figure IN2025052115_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A video editing system for converting a landscape video into a portrait video includes a processing unit having one or more processors and a memory storing a set of instructions. The processors receive an input video in a landscape format and extracts a plurality of frames. The processors identify a plurality of objects in each frame using a first neural network and the objects using a tracking algorithm to generate a trajectory for each object. For each frame, the processors dynamically determine a primary object and at least one secondary object based on the trajectories of the objects, activity score of the objects within a scene, and scene changes. The processors generate a portrait frame corresponding to each frame such that the primary object and the at least one secondary object are positioned within the portrait frame, and generate a portrait format video based upon the plurality of portrait frames.
Need to check novelty before this filing date? Find Prior Art

Description

SYSTEM AND METHOD FOR CONVERTING A LANDSCAPE-FORMAT VIDEO INTO A PORTRAITFORMAT VIDEOFIELD OF INVENTION

[0001] The present disclosure relates to video processing. More particularly, the present disclosure relates to a system and a method for converting a landscape-format video into a portrait-format video.BACKGROUND OF INVENTION

[0002] The increasing reliance on mobile devices for media consumption has fundamentally altered user viewing preferences. Mobile users predominantly hold and view content on vertically oriented screens, resulting in a strong industry shift toward portrait-format (vertical) videos, such as 9:16 or 3:4 aspect ratios. This transformation is further accelerated by social media platforms, including Instagram, TikTok, and YouTube Shorts, which are designed to display and promote vertically formatted clips. Even traditionally landscape-oriented content, such as live sports, is now frequently viewed on mobile devices in portrait orientation.

[0003] Despite this shift, most professional content, particularly sports broadcasts, is still captured in landscape format (e.g., 16:9, 5:4, 4:3). The landscape format is ideal for large screens such as television, where wider horizontal coverage enables viewers to observe the full field of play, player positioning, and tactical formations. For instance, football broadcasts rely on wide shots to reveal the entire pitch coverage, cricket broadcasts simultaneously show bowler, batter, wicketkeeper, and fielders, and tennis relies on full-court views to convey both players' positions. Converting such landscape sports footage into a portrait format without losing these critical contextual elements presents a significant technical challenge. Existing approaches are unable to preserve the completeness and interpretability of such content.

[0004] Conventional techniques rely heavily on manual editing using software tools such as Adobe Premiere Pro or Final Cut Pro. Editors typically crop and pan across the landscape frame to fit important subjects within the vertical frame. Maintaining contextual continuity often requires manual object tracking, such as following multiple players during a football attack, keeping both players in view during a tennis rally, or tracking the basketball as it transitions across the court. These tasks are labor-intensive, time-consuming, and impractical for long-durationgames or real-time highlights. Moreover, they require skilled editors with a deep understanding of sport-specific dynamics to avoid loss of context, visibility, or narrative relevance.

[0005] Recent Al-based tools (e.g., Adobe Auto Reframe) automate aspects of subject tracking by focusing on a single dominant object and reframing a 16:9 shot into 9:16. However, such tools are fundamentally limited to single-object or single-region tracking. This leads to substantial information loss in sports scenarios where multiple focal points are simultaneously critical. For example, a football reframing tool may track only the ball-carrying player while omitting nearby defenders or attacking runs; in cricket, the tool may lock onto the batter while excluding the bowler's action or field placement; in basketball, it may follow the player in possession, thereby losing the defensive formation essential for understanding play development. Additionally, these tools lack adaptability across the diverse motion patterns, speed variations, and rule-based contextual needs of different sports, resulting in inconsistent or suboptimal outputs.

[0006] Many existing automated landscape-to-portrait conversion solutions rely primarily on simple cropping and zooming operations. While these operations achieve the desired aspect ratio, they often degrade output quality. Excessive zooming reduces resolution and causes blurring or pixelation, particularly problematic in high-definition sports footage where viewers expect clarity to track fast-moving elements like footballs, shuttlecocks, or tennis balls. Such degradation impairs viewer understanding and diminishes the overall viewing experience.

[0007] Therefore, there is a need for a system that overcomes the drawbacks associated with conventional systems.SUMMARY OF INVENTION

[0008] The present invention relates to a video editing system and a method for converting a landscape-format video into a portrait-format video. The video editing system includes a processing unit having one or more processors and a memory coupled to the one or more processors and storing a set of instructions that, when executed by the one or more processors, cause the processing unit to receive an input video in a landscape format and extract a plurality of frames from the input video. Additionally, the processing unit is configured to identify a plurality of objects within each frame of the plurality of frames using a first neural network. Each frame is associated with a corresponding scene. Further, the processing unit is configured to track the plurality of objects across the plurality of frames to generate a trajectory for each of the plurality of objects using a tracking algorithm and dynamically determine, for each frame, aprimary object and at least one secondary object from the plurality of objects based upon the trajectory of the plurality of objects, activity score of the plurality of objects within the scene associated with the frame, and a change in the scene, using a machine learning model. Furthermore, the processing unit is configured to generate a portrait frame corresponding to each frame of the plurality of frames based upon the primary object, such that the primary object and at least one secondary object are positioned within the portrait frame. Moreover, the processing unit is configured to generate a portrait video based upon the plurality of portrait frames.

[0009] The foregoing features and other features as well as the advantages of the invention, will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the apportioned drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale.

[0011] Fig. 1 depicts a block diagram of a video editing system 100 for converting a landscapeformat video into a portrait-format video, according to an embodiment of the present disclosure.

[0012] Fig. 2 depicts a block diagram of a video generation module 600 of the video editing system 100, according to an embodiment of the present disclosure.

[0013] Fig. 3 illustrates one exemplary frame 402 depicting various objects identified by the object identification module 250 and corresponding bounding boxes 404, according to an embodiment of the present disclosure.

[0014] Fig. 4A depicts an exemplary region of interest (ROI) segment 501 determined by a portrait frame generation module 500 within a frame 503, according to an embodiment of the present disclosure.

[0015] Fig. 4B depicts an exemplary region of interest (ROI) segment 504 determined by the portrait frame generation module 500 within a frame 505, according to an embodiment of the present disclosure.

[0016] Fig. 5 depicts a portrait frame 502 generated from the ROI segment 501 by the portrait frame generation module 500, according to an embodiment of the present disclosure.

[0017] Fig. 6 depicts a flowchart of a method 800 for converting an aspect ratio of a video, according to an embodiment of the present disclosure.DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS

[0018] Prior to describing the invention in detail, definitions of certain words or phrases used throughout this patent document will be defined: the terms "include" and "comprise", as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and / or; the phrases "coupled with" and "associated therewith", as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have a property of, or the like. Definitions of certain words and phrases are provided throughout this patent document, and those of ordinary skill in the art will understand that such definitions apply in many, if not most, instances to prior as well as future uses of such defined words and phrases.

[0019] Reference throughout this specification to "one embodiment," "an embodiment," or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in one embodiment," "in an embodiment," and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "including," "comprising," "having," and variations thereof mean "including but not limited to" unless expressly specified otherwise. An enumerated listing of items does not imply that any or all of the items are mutually exclusive and / or mutually inclusive, unless expressly specified otherwise. The terms "a," "an," and "the" also refer to "one or more" unless expressly specified otherwise.

[0020] Although the operations of exemplary embodiments of the disclosed method may be described in a particular, sequential order for convenient presentation, it should be understood that the disclosed embodiments can encompass an order of operations other than the particular, sequential order disclosed. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Further, descriptions and disclosures provided in association with one particular embodiment are not limited to that embodiment, and may beapplied to any embodiment disclosed herein. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed system, method, and apparatus can be used in combination with other systems, methods, and apparatuses.

[0021] The embodiments are described below with reference to block diagrams and / or data flow illustrations of methods, apparatus, systems, and computer program products. It should be understood that each block of the block diagrams and / or data flow illustrations, respectively, may be implemented in part by computer program instructions, e.g., as logical steps or operations executing on a processor in a computing system. These computer program instructions may be loaded onto a computer, such as a special purpose computer or other programmable data processing apparatus to produce a specifically-configured machine, such that the instructions which execute on the computer or other programmable data processing apparatus implement the functions specified in the data flow illustrations or blocks.

[0022] These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including computer-readable instructions for implementing the functionality specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the data flow illustrations or blocks.

[0023] Accordingly, blocks of the block diagrams and data flow illustrations support various combinations for performing the specified functions, combinations of operations for performing the specified functions and program instructions for performing the specified functions. It should also be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, can be implemented by special-purpose hardware-based computer systems that perform the specified functions or operations, or combinations of special-purpose hardware and computer instructions. Further, applications, software programs or computer-readable instructions may be referred to as components or modules. Applications may be hardwired or hardcoded in hardware or take the form of software executing on a general-purpose computer, such that when the software isloaded into and / or executed by the computer, the computer becomes an apparatus for practicing the disclosure, or they are available via a web service. Applications may also be downloaded in whole or in part through the use of a software development kit or a toolkit that enables the creation and implementation of the present disclosure. In this specification, these implementations, or any other form that the disclosure may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the disclosure.

[0024] Furthermore, the described features, advantages, and characteristics of the embodiments may be combined in any suitable manner. One skilled in the relevant art will recognize that the embodiments may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments. These features and advantages of the embodiments will become more fully apparent from the following description and apportioned claims, or may be learned by the practice of embodiments as set forth hereinafter.

[0025] Fig. 1 illustrates a block diagram of a video editing system 100 (interchangeably referred to as system 100 hereinafter), according to an embodiment of the present disclosure. The system 100 is configured to convert a landscape-format video into a portrait-format video. In an embodiment, the landscape-format video has an aspect ratio of 16:9. The portrait-format video has an aspect ratio of 9:16. It should be appreciated, however, that these aspect ratios are merely illustrative, and the system 100 is capable of performing conversions between any suitable aspect ratios without departing from the scope of the present disclosure. The present disclosure has been explained in the context of converting the aspect ratio of a sports video. It should be understood, though, that the teachings of the present disclosure may be extended to videos in other application scenarios and are within the scope of the present disclosure. The system 100 may be employed for automatically reformatting videos for streaming platforms, social media applications, or mobile-first environments. Additional non-limiting examples include converting landscape videos to portrait videos for platforms like YouTube Shorts, TikTok, Instagram, etc., repurposing corporate training videos for mobile-based learning, reformatting video advertisements for mobile-centric marketing, and other similar applications where optimized portrait content is desirable.

[0026] The system 100 includes an input module 150, a frame extraction module 200, an object identification module 250, a scene change detection module 300, an object tracking module 350, an activity analysis module 400, a focus determination module 450, a portrait frame generation module 500, a database 550, a video generation module 600, and a processing unit 650 coupled to each other. The system 100 may further include one or more additional modules configured to perform auxiliary or complementary functions, such as metadata processing, resolution enhancement, noise reduction, or rendering management. It shall be understood that the inclusion or omission of such additional modules does not limit the scope of the present disclosure, and variations thereof are contemplated within the framework of the described system 100.

[0027] In an embodiment, the input module 150, the frame extraction module 200, the object identification module 250, the scene change detection module 300, the object tracking module 350, the activity analysis module 400, the focus determination module 450, the portrait frame generation module 500, the database 550, and the video generation module 600 are executed by the processing unit 650. The processing unit 650 includes one or more processors 655 operating individually or in combination. Examples of the one or more processors 655 include a microprocessor, a central processing unit (CPU), a personal computer, a graphical processing unit (GPU), a neural processing unit (NPU), a vision processing unit (VPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a cloud-based server, and edge computing device, or any other computing device capable of executing machine-readable instructions. In another embodiment, the processing unit 650 may employ hardware-optimized libraries (e.g., cuDNN, TensorFlow Lite, PyTorch Mobile, OpenCV, FFmpeg, ONNX Runtime) to accelerate AI / ML tasks, video decoding, and frame generation.

[0028] The one or more processors 655 may be configured in a multi-core or many-core architecture, enabling parallel execution of computationally intensive operations such as neural network inference, image processing, video decoding, object detection, and trajectory estimation. The processing unit 650 may further incorporate hardware accelerators optimized for deep-learning workloads, including tensor cores, matrix multiplication units, and specialized instruction sets for accelerating convolution, attention, or transformation operations.

[0029] In an embodiment, the processing unit 650 includes a memory 660 coupled to the one or more processors 655. The memory 660 may include volatile memory (e.g., random access memory (RAM)), non-volatile memory (e.g., solid-state drives (SSD), flash memory, a hard diskdrive, magnetic or optical storage media), cache memory, or combinations thereof. The memory 660 stores a set of instructions that is executed by one or more processors 655. The memory 660 may store model parameters, extracted frames, intermediate tensors, feature maps, scene metadata, and other data required for real-time or near-real-time execution of the modules described herein. The processing unit 650 may also access external storage or network-attached storage for handling large video files or high-volume intermediate datasets.

[0030] In an embodiment, the input module 150, the frame extraction module 200, the scene change detection module 300, the object tracking module 350, the activity analysis module 400, the focus determination module 450, the portrait frame generation module 500, and the video generation module 600 are implemented in the form of a machine-readable set of instructions stored in the memory 660. The memory 660 may include any non-transitory computer-readable storage medium configured to store machine-executable instructions and data used by the system 100. The memory 660 may be a hard disk drive (HDD), a solid-state drive (SSD), a flash memory device, a magnetic or optical storage medium, a distributed storage database, or any other persistent storage mechanism capable of storing machine-readable instructions. The set of instructions, when executed by the one or more processors 655, causes the processing unit 650 to perform various functions of the modules as described herein. The memory 660 may further include volatile memory. In one example implementation, the system 100 may be hosted on one or more servers and provided as a network-accessible or web-based service. The server may be a local or remote server, or a cloud-based server infrastructure capable of handling user requests, processing uploaded content, and executing the set of instructions stored in the memory 660. The system 100 may further support multi-tenant operation, load balancing, and dynamic resource allocation depending on user demand.

[0031] In another example implementation, the system 100 may be deployed as a stand-alone application installable on a user device, such as a desktop computer, a laptop, a tablet, or a mobile device. In such implementations, the memory 660 may correspond to the internal storage medium of the user device, and the set of instructions may be executed locally without requiring a persistent network connection. In yet another implementation, the processing unit 650 may be deployed in the form of a hybrid architecture in which one or more modules of the system 100 are partially executed locally on the user device, while computationally intensive processing tasks and large-scale data storage operations are offloaded to a remote server or cloud-based environment. Such a hybrid configuration may facilitate optimized system 100 performance byreducing processing load on the local device, minimizing latency associated with complex operations, and enabling dynamic scalability of computing resources.

[0032] The input module 150 is configured to receive an input video. The input video is in the landscape format and has the first aspect ratio. In an embodiment, the input video corresponds to either a live sports stream or a pre-recorded sports video. The input module 150 may receive the input video from a user interface via a communication network or from one or more external devices, including a recording device, a streaming server, or a server associated with a broadcasting or social media platform. The input video may correspond to any form of video content, including but not limited to a live stream, a pre-recorded video file, or a combination thereof. In an embodiment, the input module 150 receives the input video directly from a user. For example, the user may upload the input video to the system 100 via a suitable user interface, such as a graphical user interface provided on the user device or a web-based interface associated with the system 100, which may be accessed over a communication network such as the Internet. In another embodiment, the input module 150 receives the input video from one or more recording devices (e.g., a video camera) capturing a video of a sport or from an external server (e.g., a recording server or a streaming server of a broadcasting entity, or a server associated with a social media entity). The input module 150 may send the input video to the frame extraction module 200 for further processing. Additionally, or alternatively, the input module 150 may store the input video, either temporarily or persistently, in the database 550, thereby enabling subsequent retrieval and processing by other modules of the system 100.

[0033] The frame extraction module 200 is configured to obtain the input video from the input module 150 and / or from the memory 660. The frame extraction module 200 is configured to extract a plurality of frames from the input video. Each frame may include a plurality of objects. Each frame includes an image of a shot of the input video and has an associated time stamp. The frame extraction module 200 extracts the plurality of frames in a lossless manner so that the raw, uncompressed quality of each frame is maintained. This helps in preserving fidelity and preventing data degradation during further processing of the input video. The frame extraction module 200 may utilize any suitable or known frame extraction technique or tool, including but not limited to FFmpeg, MEncoder, OpenCV, or a deep-learning toolchain such as PyTorch in combination with CUDA-accelerated libraries, depending on the requirements of the application. For example, in the application scenario where real-time processing of a live stream is required,the frame extraction module 200 may employ a high-speed extraction technique capable of rapidly decoding and extracting frames at the incoming stream's frame rate.

[0034] In an example implementation, the frame extraction module 200 uses FFmpeg to extract the plurality of frames. Additionally, the frame extraction module 200 may employ a GPU- accelerated processing to accelerate the frame extraction process, thereby reducing latency and ensuring that the plurality of frames are extracted at a desired frame rate to support real-time analysis and downstream decision making. Further, the frame extraction module 200 may select or adjust various extraction parameters, such as codec type, video format, bit depth, chroma sampling structure, and resolution, to maximize image quality while balancing computational speed, memory utilization, and extraction time. The extraction parameters may be dynamically adjusted based on factors such as the required output quality of the frames, the file size of the input video, the encoding characteristics of the input video, and the time needed for frame extraction. Further, the frame extraction module 200 may be configured to convert the frames into a format that facilitates rapid accessibility, low-latency retrieval, and efficient downstream processing. Such formats may include, without limitation, JPEG, TIFF, PNG, or any other format that supports rapid retrieval while maintaining frame integrity. Preferably, JPG format may be used due to its efficient compression and relatively small file size. TIFF format may be used when a high-quality video may be needed, as .tiff format retains more image details without significant compression loss. PNG format may be used when lossless compression is required. This frame extraction process establishes a high-quality, frame-by-frame foundation for the system 100, enabling accurate downstream processing throughout the system's analytical pipeline.

[0035] The object identification module 250 is configured to identify a plurality of objects within each frame of the plurality of frames of the input video using a first neural network 251. Each frame is associated with a corresponding scene of the sport. The plurality of identified objects depends upon the sport associated with the input video. In an embodiment, the plurality of objects (or objects) includes at least one of participants, game-play objects, and sport-arena structures, or any combination thereof. In an embodiment, the participants may include, without limitation, player(s), referee(s), umpire(s), spectator(s), coach(s), etc. The game-play objects may include, without limitation, ball(s), bat(s), shuttle(s), stick(s), baton(s), hoop(s), goalpost(s), net(s), etc. The sport-arena structures may include a court, a pitch, field marking(s), boundary line(s), or any other sport-specific elements. The types, sizes, and motion characteristics of the objects may vary depending on the sport (e.g., football, basketball, cricket, hockey, tennis,athletics, etc.). For example, in the context of a football game, the plurality of objects may include one or more of: players, ball, referee, goalkeeper, audience, cards, flags, penalty spot, corner flags, goal posts, ground markings, etc.

[0036] For each identified object, the object identification module 250 is configured to generate a tag for each object of the plurality of objects. The tag includes one or more of one or more bounding boxes 404, an object class identifier, and a confidence metric, a corresponding class label (e.g., "player", "ball", "goal post", etc.), and a confidence score indicative of the accuracy of detection / identification. Fig. 3 illustrates one exemplary frame 402 depicting various objects identified by the object identification module 250 and corresponding bounding boxes 404. These outputs serve as inputs for subsequent modules that perform various tasks such as object tracking, primary object selection, and portrait-mode video generation. Sports videos exhibit complexities such as rapid player movement, small fast-moving objects (e.g., a football, a basketball, or a cricket ball), the presence of multiple objects, varying stadium lighting conditions, crowd interference, background clutter, sudden occlusions, and the simultaneous presence of multiple objects. The first neural network 251 is designed to address these challenges, especially for high-speed and high-variability sports video analysis. In an embodiment, the first neural network 251 is configured to prioritize features relevant to sports content, including small objects such as balls, rapid movements, and specific player poses.

[0037] The first neural network 251 is further configured to process frames affected by motion blur, diverse lighting conditions, and environmental variations present in stadiums to ensure robust performance across different sport scenarios. In an embodiment, the first neural network 251 is able to handle edge cases, such as when two or more of the plurality of objects are too close together or where one object occludes another object (e.g., one runner occluding another runner in an athletics race). Such conditions are a common cause of background confusion and object overlap-based misclassification in conventional detection systems. This ensures that the first neural network 251 maintains stable detection performance and retains an ability to distinguish between overlapping or densely packed objects.

[0038] In an embodiment, the first neural network 251 employs the YOLOv8 framework. The first neural network 251 receives the plurality of frames as input. In an embodiment, the first neural network 251 receives the plurality of frames in Full HD resolution (e.g., 1920 x 1080 or higher). Using Full HD resolution improves the identification of small, fast-moving objects such as the ball, players, and court features. The first neural network 251 implements a standard backbone andneck architecture of the standard YOLOv8 configuration. In an embodiment, a custom object class set containing a reduced set of sport-specific object classes (e.g., is defined ball, player / person, and goalpost for football). The identification head of the YOLOv8 architecture is re-parameterized to predict outputs for the reduced set of sport-specific object classes. This configuration enables real-time processing while focusing on objects relevant to the sport-specific application. In addition, the plurality of objects (or objects) includes context-aware features related to the detected objects. In an embodiment, in addition to features extracted by the first neural network 251 (e.g., conventional features extracted by YOLOv8), the objection identification module 250 is configured to extract context aware features from the plurality of frames. The context-aware features may include one or more visual cues, such as texture, shape, movement patterns, etc., of the plurality of objects. The context-aware features further improve the performance of the first neural network 251.

[0039] In an embodiment, the first neural network 251 is trained with a training dataset customized for sports video applications. The training dataset includes frames (approx. 10,000 to 15,000) from multiple sports with annotations for associated objects such as players, balls, bats, goalposts, referees, bowlers, fielders, and baskets. In an embodiment, separate sport-specific instances of the first neural network 251 are trained individually using the training data set containing frames from the respective sports. The training dataset is annotated with bounding boxes 404 corresponding to various objects belonging to the custom object class set. The annotations are provided manually or through human-assisted methods. Data augmentation techniques, such as random flips, random cropping / resizing, color and brightness adjustments, or the like, are applied to simulate variations common in sports footage, including changes in lighting, camera angles, object occlusions, viewpoints, and rapid motion. These training strategies directly contribute to improved identification robustness and accuracy under complex sports scenarios involving rapid motion, overlapping objects, and varying environmental conditions.

[0040] In an embodiment, the training process for the first neural network 251 employs the standard training loop of the YOLOv8 framework. A standard YOLOv8 multi-task loss including bounding-box regression loss, classification loss, and objectness loss is used. The bounding-box regression loss includes loU-based regression loss, such as CloU, DIoU, or GloU, for regressing predicted boxes toward ground-truth boxes for the ball, players, and goalposts. The classification loss includes binary cross-entropy or classification loss applied over the custom set of classes.The objectness loss relates to the distinction between object regions and background regions. In an example implementation, the Adam optimizer is used, though other optimization algorithms such as stochastic gradient descent with momentum or AdamW may also be used. The momentum parameter is in the range of approximately 0.9 to 0.95, and the weight decay parameter is in the range of approximately lxio-4to 5xl0-4. The initial learning rate is selected within the object-detection range of approximately lxio-3to lxio-2depending upon the optimizer algorithm, with cosine decay used as a learning-rate schedule, though other learningschedules may also be employed, across the training epochs.

[0041] The batch size is determined based on GPU memory availability for Full HD input. In an example implementation, 8 to 16 images per batch are used for 1920x1080 resolution. The first neural network 251 is fine-tuned over 100 to 200 epochs, with validation monitoring or early- stopping criteria applied to reduce overfitting. Various training parameters given herein are merely exemplary, and other values may also be employed based upon requirements. The selection of hyperparameters, including learning rate, optimizer configuration, momentum, weight decay, batch size, and number of epochs, is performed to balance accuracy with real-time processing capability. Further, it should be understood that the use of YOLOv8 architecture for the objection detection is merely exemplary. The first neural network 251 may include any other neural network suitable for detecting small, fast-moving objects in a sports environment with rapid movements, object occlusions, and so on. Such examples include YOLOV11 - nano, medium, large, YOLOVIO, etc. or the like.

[0042] The scene change detection module 300 is configured to process the plurality of frames of the input video. The scene change detection module 300 may obtain the plurality of frames from the frame extraction module 200 or from the memory 660. The scene change detection module 300 is configured to identify scene changes in the input video content based upon the plurality of frames. A scene change may represent a change in the context or action within the input video. The scene change may be marked by a shift in visual content, such as a transition from one camera view to another camera view.

[0043] In an embodiment, the scene change detection module 300 is configured to determine whether a scene change has occurred based on the set of predefined features in a given set of consecutive frames. The consecutive frames, including a current frame and one or more adjacent frames to the current frame, and assign a scene change tag to the current frame based upon the detection. In an example implementation, the set of consecutive frames includes a pair ofconsecutive frames (e.g., the current frame and a preceding frame or the current frame and a succeeding frame). In another embodiment, the set of consecutive frames may include more than two consecutive frames, which is particularly useful for identifying camera view changes or detecting gradual transitions such as fade-in, fade-out, or slow camera switches. The scene change tag is indicative of the scene change. For example, the current is tagged using one of: "same scene" or "different scene". Based upon the analysis, the scene change detection module 300 may also assign, for each scene, a scene identifier, a start frame, an end frame, a start timestamp, and / or an end timestamp, and associate such data with each current frame.

[0044] In an embodiment, the scene change detection module 300 includes an Al-driven technique and / or a deep learning model such as a Convolutional Neural Network (CNN), a supervisory binary classifier model, or the like, trained to determine whether the set of consecutive frames belongs to the same scene or a transition has occurred, indicating a scene change. The deep learning model is configured to distinguish between minor visual changes (e.g., slight camera movements within the same scene) and major visual changes (e.g., transitioning from one camera view to another camera view). Accordingly, the set of consecutive frames that exhibit minor visual changes may be treated as belonging to the same scene.

[0045] In an embodiment, the scene change detection module 300 extracts a set of predefined features from each frame using a deep machine learning. The set of predefined features may include, without limitation, pixel-level color histograms, the number of the plurality of objects, positions of the plurality of objects, object detection results, texture patterns, or a combination thereof. The object detection results include the plurality of objects identified in the frame and corresponding positions within the frame. In an embodiment, the set of predefined features include one or more of: structural similarity index (SSI M ) between the set of consecutive frames, histograms in H, S, V channels, correlation coefficient distance per channel, edge / texture difference parameter (e.g., Laplacian variance difference between the consecutive frames), change in the number of objects detected in the consecutive frames, distribution of the bounding box positions for the detected objects. The set of predefined features may be determined using any techniques known in the art. The set of predefined features are stacked into a feature vector, which is provided as input to the deep learning model.

[0046] In an embodiment, the predefined features are extracted using an advanced computer vision technique. The advanced computer vision technique may include a custom Al model or a multi-model approach having one or more deep-learning-based neural networks designed forextracting different types of features. For example, a convolutional neural network (CNN) may be used for spatial features, a region-based CNN (R-CNN) or YOLO may be used for object detection, and an optical flow algorithm may be used for motion analysis. The set of pre-defined features is provided as input to the deep learning model. The deep learning model is configured to determine whether a scene change has occurred and to identify the corresponding frame(s). In other words, the deep learning model is configured to classify a given set of consecutive frames as belonging to the same scene or different scenes. Accordingly, the scene change detection module 300 enables precise segmentation of the input video for further processing, thereby allowing the system 100 to dynamically adjust to the flow of the sport. Further, by identifying scene changes in real-time, the scene change detection module 300 ensures that subsequent processing (explained later) may be done on distinct segments of the input video, thereby increasing efficiency and accuracy in tracking and analyzing key moments in the input video.

[0047] In an embodiment, the deep learning model of the scene change detection module 300 employs a supervised binary classification model configured to determine whether a scene transition has occurred between a pair of consecutive frames. In one implementation, the deep learning model includes a logistic-regression model having a single linear transformation applied to the extracted feature vector, followed by a sigmoid activation function. The output of the logistic-regression model is given by Eq. (1)

[0048] y = a(wTx + h) ...Eq. (1)

[0049] In Eq. (1), x represents the feature vector, w represents a learned weight vector, and b represents a bias term. The resulting output value y lies within the range of 0 to 1 and represents the probability that the pair of consecutive frames corresponds to a scene change. This architecture enables efficient, real-time classification with low computational complexity while maintaining high sensitivity to transitions in sports video content.

[0050] In an embodiment, the deep learning model, such as, the logistic-regression model, is trained using a training dataset including a plurality of pairs of images (approx. 10,000 pairs) from sports videos. The plurality of pairs of images covers a diverse set of sports and captures variations in camera angles, lighting, object movements, background changes, and the like, reflecting real-world transitions in sports videos and providing the deep learning model a comprehensive understanding of scene changes in dynamic environments of multiple sports. Each pair of images is labelled as either "same scene" or "different scene". In an example implementation, the labeled pair of images was generated by manually annotating transitionsbetween the pair of images. The set of predefined features is extracted from each pair of images. The set of predefined features, along with the corresponding labels, is fed to the deep learning model, which uses them for learning to classify the pair of frames as belonging to the "same scene" or "different scene".

[0051] Further, the deep learning model is trained using a binary cross-entropy loss function to supervise the model. Adam optimization algorithm is used, though any other optimization algorithm (e.g., Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm) compatible with the logistic-regression can be used. The learning rate is selected within a range of approximately lxio-3to 5xl0“3. The deep learning model is trained until convergence using a training-validation split, and a classification threshold in the range of approximately 0.5 to 0.7 is selected to minimize false positives in sports-video scenarios.

[0052] The object tracking module 350 (interchangeably referred to as a tracking module 350 hereinafter) is configured to track the plurality of objects across the plurality of frames to generate a trajectory for each of the plurality of objects using a tracking algorithm. The trajectory for each object may include a location (i.e., spatial position) of the object within each frame, and changes in the spatial position over time. In an embodiment, the tracking module 350 may selectively track the trajectory of each object having a confidence score greater than or equal to a predefined confidence threshold. The predefined confidence threshold is set based on the target sport. In an embodiment, the tracking module 350 tracks the plurality of objects across the plurality of frames and generates the trajectory of each of the plurality of objects using an example tracking algorithm explained below. The tracking algorithm enables the tracking module 350 to track the trajectory of objects precisely, even in difficult-to-identify conditions, such as tracking the trajectory of small and / or blurry objects, when there are abrupt movements, or when any of the objects are occluded, etc. The tracking algorithm may be tailored to the application scenario to accurately track the plurality of objects across the plurality of frames under a real-world scenario.

[0053] In an embodiment, the tracking module 350 implements the following tracking algorithm. For each of the plurality of frames, the tracking module 350 receives the bounding boxes 404, the object class labels, and the confidence scores corresponding to the identified objects, from the object identification module 250. These inputs serve as the basis for subsequent trajectory prediction and association. For each frame, the tracking module 350 executes a series of operations. In an embodiment, the tracking module 350 is configured to estimate the trajectoryof each object based on a bounding box, a class label, and a confidence score of each of the plurality of objects using the tracking algorithm. The tracking module 350 adapts a matching threshold for each trajectory based upon the speed of the corresponding object and scene complexity. For example, the tracking module 350 predicts a next state of a trajectory for the objects based upon a current state of the trajectory of the object using a constant-velocity Kalman filter, which estimates the center coordinates, width, height, and velocities of each bounding box, collectively a state of the trajectory. The predicted next state includes a predicted position of the bounding box for each object. The tracking module 350 computes a cost matrix between the predicted trajectory and the new detections (i.e., the input received at the current frame). The new detection includes the position of the bounding boxes 404 for the objects identified in the current frame. The cost matrix incorporates the intersection-over-union (loU) values between predicted and detected bounding boxes 404, optionally combined with appearance-based distances, and class-specific constraints distinguishing, for example, players from the ball. In an example implementation, the cost matrix as defined in the ByteTracker framework is used. The tracking module 350 applies the Hungarian algorithm to the cost matrix to associate the predicted trajectory with the positions of the bounding boxes 404 of the plurality of objects identified in the current frame. In an embodiment, the tracking module 350 identify the optimal assignment or matching between the predicted trajectory and the detected position of the plurality of objects in the current frame.

[0054] The tracking module 350 compares the resulting matches against adaptive matching (or association) thresholds (explained later) to determine whether to accept or reject the matches, and determines the trajectory based upon the comparison for the trajectory of each object of the plurality of objects in a current frame. The tracking module 350 accepts the matches that satisfy the adaptive thresholds, and updates the corresponding trajectory with the new observations. In an embodiment, the tracking module 350 initiates a new trajectory for the unmatched detections that exceed the minimum confidence threshold for a given camera view. Further, the tracking module 350 marks the trajectories that remain unmatched as missed, and deletes the trajectories exceeding their maximum permitted age. In an example implementation, the tracking module 350 implements the tracking algorithm using ByteTracker framework. The use of adaptive matching thresholds helps in maintaining tracking stability during fast-motion scenarios or when visual distinction between objects becomes challenging.

[0055] In an embodiment, the tracking module 350 adapts the association thresholds based on the speed of the plurality of objects (hereinafter, referred to as object speed or velocity) and scene complexity. For each active trajectory, the tracking module 350 maintains a velocity estimate for each object derived from a Kalman filter state, a class label (player or ball), and a short history of loU values and match scores. In one example, the loU threshold for associating a player's bounding box is adapted according to the estimated velocity magnitude. When the velocity is below approximately 5 pixels per frame, a higher loU threshold (=0.5) is applied or set; the loU threshold is reduced to approximately 0.4 for velocities between 5-20 pixels per frame, and to approximately 0.3 for velocities of 20 pixels per frame or higher. For ball tracking, which involves significantly higher velocities, the loU threshold is adapted more aggressively (e.g., loU threshold=0.45 for velocities below 10 pixels per frame, =0.35 for velocities between 10-40 pixels per frame, and =0.25 for velocities beyond 40 pixels per frame).

[0056] In an embodiment, the matching threshold is adapted according to the scene complexity as follows. For example, when the number of active detections (i.e., number of detected objects) exceeds a threshold (for example, more than 15 detections), the association threshold for player trajectory is incrementally increased (e.g., in steps of +0.05) to reduce identity switches. During low-complexity scenes, indicated by the number of detections less than the aforesaid threshold (e.g., 15 detections), the tracking module 350 relies more heavily on the velocity profile and applies slightly relaxed association thresholds.

[0057] In an embodiment, the tracking module 350 further employs a camera-view-adaptive tracking process to improve tracking accuracy in sports broadcast scenarios featuring fast-moving objects (balls often exceeding lOOkm / h), frequent camera cuts, and scene changes (e.g., wide shots, close-ups, etc.). The tracking module 350 classifies each incoming frame into one of a plurality of camera-view categories, such as wide-angle pitch view, close-up view, or goal-line view. For each camera-view category, the tracking parameters are independently tuned to account for the apparent object size, expected motion characteristics, and scene geometry. Further, the tracking module 350 incorporates view-specific motion assumptions based on constant-velocity behaviour. The constant-velocity parameters are adjusted to account for typical object speeds observed within each view category. For instance, wide-angle views generally assume slower velocity updates, while close-up views incorporate higher apparent velocities. Goal-line views apply intermediate values based on expected rapid direction changes near the scoring area.

[0058] In addition to these view-based adjustments, the tracking module 350 incorporates sports-dynamics priors to improve trajectory estimation. Representative priors include expected player acceleration ranges (e.g., 0-10 m / s2for a football player), ball deceleration due to ground friction, and projectile constraints for aerial ball motion influenced by gravity. The prediction logic may also consider the game state. For example, during set-piece situations, the tracking module 350 restricts the predicted displacement of players to a smaller region, while during open play, the prediction allows longer trajectories. In addition, the tracking module 350 utilizes camera metadata to compensate for camera-induced apparent motion. Pan, tilt, and zoom parameters are used to separate true object motion from motion introduced by camera movement. This correction improves the accuracy of both Kalman-based prediction and subsequent association stages.

[0059] Further, the tracking module 350 incorporates per-object motion features into the prediction and association logic. These features include velocity and acceleration profiles for each tracked object, category-specific movement patterns such as stationary behaviour, jogging, or sprinting for players, and historical trajectory patterns inferred from preceding frames. These object-specific features allow the tracking module 350 to adapt its prediction for individual objects, improving performance in crowded scenarios, partial occlusions, and rapid-movement sequences.

[0060] In an embodiment, the tracking parameters for each camera-view category include a high loU threshold (r_high), a low loU threshold (r_low), a maximum permissible track age, and a minimum confidence requirement for initiating new tracks. Table 1 illustrates one example of such parameter tuning:Table 1

[0061] Optionally, or in addition, the tracking module 350 further incorporates an occlusionhandling mechanism to maintain accurate trajectories even when the objects become temporarily invisible or partially occluded. For each active trajectory, the tracking module 350 maintains a standard constant-velocity Kalman filter that predicts the bounding-box state, including the center coordinates (ex, cy), width, height, and corresponding velocities. When a matching detection is unavailable in the current frame, the tracking module 350 propagates the predicted state using the Kalman filter, increments a missed-frame counter, and retains the trajectory in an inactive-but-alive state for a predefined maximum duration depending on object type and scene dynamics. For example, player trajectories may remain active for 10-20 frames, while ball trajectories may remain active for 3-8 frames.

[0062] When a previously detected object reappears near the predicted position, the tracking module 350 may re-associate the trajectory to such object using a combined cost function that incorporates both spatial overlap and appearance similarity. Specifically, the tracking module 350 computes the intersection-over-union (loU) between the predicted bounding box and candidate detection, and a simple appearance descriptor, such as a 128-dimensional embedding obtained from a pre-trained RelD or CNN model, using cosine similarity. The combined association cost is computed using, e.g., Eq. (2) below:

[0063] cost = c (l — loU) + (1 — c )(l — cos_sim) ...Eq. (2)

[0064] In Eq. (2), a is a weighting factor in the range of 0.5-0.8, loU represents loU value and cos_sim represents the cosine similarity value. If the computed cost is below a predetermined threshold, the detection is re-associated with the previously occluded trajectory, thereby restoring continuity in the object trajectory.

[0065] The tracking module 350 may also consider contextual information from surrounding trajectories when determining whether to maintain an occluded trajectory. For example, in situations where multiple players congregate in a confined area, such as a penalty box, thetracking module 350 allows longer occlusion durations to account for likely temporary overlaps. Conversely, if the predicted position of an object moves outside the field boundaries or exits the frame, the tracking module 350 may terminate the trajectory earlier. This combination of trajectory prediction, visual similarity, and contextual reasoning ensures robust and accurate tracking of objects even under challenging occlusion and crowding conditions.

[0066]

[0067] In an embodiment, the tracking module 350 works in tandem with the object identification module 250. For example, when a new frame is processed, the tracking module 350 may receive a new plurality of objects from the object identification module 250 and begin tracking their movement across subsequent frames. When the object identification module 250 detects a new object or when the tracking module 350 loses focus on any object, the system 100 may re-invoke the object identification module 250 to reassess the scene and update the state of the tracking module 350. For example, when an object (such as a player or the ball) becomes temporarily occluded or obscured by another object, the tracking module 350 may reassess the scene to maintain an accurate trajectory for that object. In an embodiment, when such a situation occurs, the tracking module 350 may update its state by recalculating the position of such an object by backtracking through previous frames to regain accurate trajectory estimation. Such a synergy between the object identification module 250 and the tracking module 350 ensures consistent and reliable trajectory tracking throughout the input video, enabling smooth transitions in the portrait-format output video. The proposed customized tracking algorithm allows the system 100 to respond quickly to the scene changes and maintain trajectory accuracy for objects despite motion blur, occlusion, or fast movement. This significantly enhances tracking performance in dynamic sports video scenarios compared to conventional systems.

[0068] The activity analysis module 400 is configured to analyze and monitor the activity of each object detected within the input video, including but not limited to human objects (e.g., players, referees, coaches), non-human objects such as sports equipment (e.g., balls, bats, rackets), and other relevant entities. The activity analysis module 400 determines the activity of each object based upon one or more factors such as spatial movement, velocity, acceleration, interaction with other objects, object trajectories, positional relevance, and centrality within a scene. In an embodiment, the activities for the human objects may include, for example, kicking or passing the ball, striking with a bat, dancing, running, sprinting, jumping, colliding with other players, celebrating a goal, falling on the ground, blocking, tackling, or remaining idle, etc. These activitiesfor the non-human objects may include actions such as ball deflection, bounce, trajectory shift, or abrupt changes in direction, etc.

[0069] In an embodiment, the activity analysis module 400 is configured to classify each activity into predefined or dynamically generated action categories, such as passing, shooting, dribbling, defending, tackling, receiving the ball, intercepting, or initiating a counterattack. Based upon the activities, the activity analysis module 400 assigns an activity weight (also, referred to as activity score) to each object. The activity weight is indicative of the level of involvement, significance, or contribution of the object to the ongoing action within the scene. For example, in a football match, a higher activity weight may be assigned to a player actively dribbling, shooting, or passing the ball, whereas a player jogging away from the main action or standing idle may receive a lower activity weight. Similarly, when the ball exhibits a rapid trajectory change, such as during a goal attempt, the ball's weight may be temporarily elevated to reflect its heightened relevance.

[0070] In an embodiment, the activity weights is determined based on one or more parameters including but not limited to, motion intensity (e.g. speed, acceleration, and direction change), interaction frequency (e.g. interactions with the ball or other players), proximity to key events (e.g. being near the ball or goalpost), pose-based indicators (e.g. limb movement associated with striking or defensive actions), temporal continuity (e.g. sustained participation in an action over multiple frames). In an embodiment, the activity weights may be updated dynamically as the scene progresses. The activity analysis module 400 may compute these weights on a frame-by- frame basis or over sliding temporal windows to capture short-term and long-term behavioral cues. These dynamically computed weights may then be provided to other modules of the system 100, for example, to determine the primary object of each frame, the region of interest (ROI), or to prioritize a particular object when generating the portrait-mode output video.

[0071] The focus determination module 450 is configured to dynamically determine a primary object and at least one secondary object (hereinafter, referred to as the secondary object) from the plurality of objects for each frame based upon the trajectory of the plurality of objects from among the plurality of objects present in the scene. The primary object corresponds to the object of highest contextual relevance for the frame, while the secondary object represents additional object(s) whose spatial or semantic relationship with the primary object contributes to determining a region-of-interest (ROI). The determination of the primary object and the secondary object is based on contextual parameters (or features), including the trajectory of the plurality of objects, an activity weights (or scores) of the plurality of objects within the sceneassociated with the frame, and a change in the scene. The focus determination module 450 is configured to dynamically determine activity score of the plurality of objects within the scene associated with the frame, and a change in the scene, using a machine learning model 451. In an embodiment, the focus determination module 450 employs a context-aware time-series machine learning model 451 to identify the primary object over time while dynamically selecting the secondary object conditioned or based on the selected primary object, thereby ensuring smooth and responsive selection of the primary object and the secondary object, such as a player, ball, or other game-play elements.

[0072] In an embodiment, the focus determination module 450 receives calculated data from the tracking module 350, the activity analysis module 400, and the scene change detection module 300 to determine or predict the secondary object relative to the primary object. The secondary object may include, for example, one or more interacting player, a ball associated with the primary player, a goalkeeper, or the like.

[0073] In one implementation, the machine learning model 451 is a Transformer-based sequence encoder that processes a short temporal window of per-frame features to compute object relevance. It should be understood that any other machine learning models such as RNN, LSTM, GRU, etc. may also be used for determining the primary object.

[0074] The machine learning model 451 receives multiple features for each object, including one or more spatial features, one or more temporal features, activity features, interaction features of the plurality of objects, and the scene change indicator (or tag). The spatial features include, for example, bounding boxes 404 and object shapes. The temporal features include, for example, trajectory histories of the plurality of objects from the tracking module 350. The activity features include, for example, the activity weights generated by the activity analysis module 400. The scene-change tag is received from the scene change detection module 300. The interaction features include, for example, distances, relative motion vectors, and relational encodings between the plurality of objects. The machine learning model 451 processes these features to generate a normalized relevance score, representing the likelihood that a given object is the primary object for the current frame, and interactional relevance scores representing the likelihood of each remaining object being selected as a secondary object given the selected primary object. Temporal attention layers of the machine learning model 451 assign a higher weight to recent or significant events, ensuring rapid reflection of sudden changes, such as a ball being kicked or a player initiating an action.

[0075] The focus determination module 450 identifies the object with the highest relevance score as the primary object. Thereafter, the focus determination module 450 determines the secondary objects by evaluating interaction relevance scores relative to the primary object. The machine learning model 451 outputs the coordinates of the primary object and the coordinates of the determined secondary object(s).

[0076] In an embodiment, for each frame, the focus determination module 450 builds a feature vector for one or more candidate objects of the plurality of objects. For example, the candidate objects include top 8 - 10 objects by importance. The feature vector includes a set of features, including one or more of: normalized center, width and height of bounding boxes 404 of the candidate objects, a class label for the candidate objects encoded as one-hot, a confidence score for the candidate objects, magnitude of trajectory velocity of the candidate objects, proximity to other objects, an activity score for the candidate objects. For example, the normalized bounding box center (ex, cy), width, and height, the object class (e.g., player, goalkeeper, ball, goalpost) encoded as one-hot, detection confidence, trajectory velocity (vx, vy), the activity score. The activity score may include motion (or velocity) magnitude and proximity to other objects. These features are either flattened into a single vector of dimension D (e.g., 128) or treated as object tokens concatenated with a special frame token. For example, the length of the temporal window frames (e.g., 16-32) is fed into the machine learning model 451 to capture spatiotemporal dependencies and evolving object relationships. In an embodiment, the focus determination module 450 provides the feature vector to the machine learning model 451. The machine learning model 451 provides a relevance score for each candidate object based upon the feature vector and identify the candidate object with the highest relevance score as the primary object and determine coordinates of the primary object.

[0077] In an example implementation, the machine learning model 451 is a Transformer-based sequence encoder having a following architecture. An embedding layer performs linear projection from input dimension D to model dimension d_model, e.g., 128). Positional encoding, such as, sinusoidal or learned, is added to frame embeddings, i.e., the feature vector. The Transformer encoder includes 2-4 blocks, each containing multi-head self-attention (4-8 heads), a feed-forward layer (hidden size 2x-4x d_model), LayerNorm with residual connections. ReLU or GELU activation is used in the feed-forward layer. In the Output head, pooled encoder outputs are passed through a small MLP to either compute probabilities over K candidate objects(softmax) or directly output continuous primary object coordinates (cx_focus, cy_focus).Optionally, the Output head also outputs a zoom factor.

[0078] In an embodiment, the Transformer-based sequence encoder is trained on time-series data from the object identification module 250 and the tracking module 350. The training data includes annotated multiple-frame sequences (e.g., approx. 80,000 sequences are used) from various sports. The frame sequences are extracted from annotated sports videos. Each frame is annotated with a ground-truth primary object (e.g., this player or this ball), one or more groundtruth secondary objects, and a desired crop center and zoom for the portrait view. The annotations may be manually created or may be based upon heuristic rules (e.g., crop that keeps both ball and active player). Loss functions such as Smooth LI or MSE between predicted and ground-truth crop center (and optionally, zoom) is use as the loss function. Adam or AdamW optimizer is used with the following learning parameters: initial learning rate between le-4 to 5e-4, batch size between 16 - 64 sequences, between 20 - 80 epochs, weight decay between le- 5 to le-4, drop out between 0.1 - 0.3 inside Transformer layers to avoid overfitting. During training, the machine learning model 451 learns to associate object behaviors— including motion intensity, acceleration, deceleration, direction changes, frequency of interaction, proximity to high-activity regions, and semantic cues from human activities, with contextual relevance. The machine learning model 451 also predicts near-future positions of the primary object based on the temporal progression of all objects, allowing anticipation of significant actions, such as a player preparing to take a shot.

[0079] To ensure smooth and natural camera motion in the portrait output video, the focus determination module 450 processes the predicted coordinates (or trajectory) of the primary object and the secondary object predictions by the machine learning model 451 across the frames using a temporal smoothing pipeline. This produces smooth, natural camera movements in the output video while preserving responsiveness to fast action and preventing frame-to-frame jitter, which results in an unpleasant viewing experience in the final portrait output video. The focus determination module 450 implements a temporal buffer that stores the coordinates of the primary object predictions from the machine learning model 451 for N frames. N may be set at a value between 15 - 30 frames. In an embodiment, the focus determination module 450 calculate acceleration of the primary object based upon predicted coordinates by the machine learning model 451 for the primary object for a set of previous frames stored in a temporal buffer, the set of previous frames comprising between 15 - 30 frames. The buffer is used for look-aheadand look-behind smoothing. In an embodiment of the temporal smoothing pipeline, the predicted coordinates of the primary object predictions are processed using one or more of the following filtering algorithms: Savitzky-Golay (i.e., Savgol) filter, Kalman filter and spline interpolation. The Savitzky-Golay filter fits a local polynomial of a predefined order (e.g., 2ndor 3rdorder) to predicted coordinates. This preserves sharp transitions during genuine actions. The Kalman filter models the predicted coordinates as a dynamic system with position and velocity states and generates an optimal estimation of the primary object under constant-velocity motion assumptions. The spline interpolation generates smooth curves for the predicted coordinates. This ensures C1 / C2 continuity for natural-looking camera motion. In one embodiment, the focus determination module 450 applies exponential polynomial weighting to the smoothed trajectory of the primary object resulting from the temporal smoothing pipeline. This reduces excessive smoothing during high-acceleration events (e.g., ball kicks) to maintain responsiveness. In an example implementation, the focus determination module 450 applies the exponential weighting using Eq. (3) given below.

[0081] In Eq. (3), w(t) is the smoothing weight applied at time t, | a(t) | is the magnitude of the detected acceleration at time t, a is a parameter controlling the smooth weight and having a value between 0.3 -0.7, / ? is a parameter controlling sensitivity to acceleration and having a value between 0.01 - 0.1, y is a polynomial base weight and poly(t, ri) is an n-th order polynomial fitted to the recent trajectory of the primary object. Thus, when the acceleration is low indicating a normal play, the weight is high, which results in a strong smoothing, whereas when the acceleration is high in indicating a rapid action (e.g., a kick), the weight decreases, resulting in reduced smoothing and faster response to action.

[0082] In an embodiment, the focus determination module 450 determines the final primary object and associated coordinates as follows. The acceleration is calculated from the predicted coordinates. The smoothing weight is calculated using Eq. (3) as above. The focus determination module 450 compares the smoothing weight with a weight threshold. When the smoothing weight is high (e.g., greater than a weight threshold), Savgol and / or Kalman filters are applied to the predicted coordinates for further smoothing. If the weight is low (e.g., less than a weight threshold), no further smoothing is applied. The final smooth trajectory (coordinates) is determined by applying the spline interpolation with continuity constraints. Consequently, (another word "apply spline interpolation to the predicted coordinates or the smoothedpredicted coordinates to determine the coordinates of the primary object"), there is no jitter during normal play and a faster response to rapid actions, thereby maintaining aesthetic, broadcast-quality camera motion. The weight threshold is set based upon the target sport, expected velocity and acceleration for a given object in the target sport. Once the primary object is determined, the focus determination module 450 determines the secondary object based upon the interactional relevance score of the objects relative to the determined primary object. In an embodiment, an object having the highest interactional relevance score is selected as the secondary object. For example, if the ball is the primary object, a player closest to the ball has a higher interactional relevance score and distant players receive are assigned a lower interactional relevance score. Accordingly, the closest player is selected as the secondary object. It should be appreciated that the number of second objects depend upon the target sport and contextual information of a given sport moment. For example, in a penalty kick scenario in football, the ball is the primary object, and a player taking the penalty kick, the goalkeeper and the goalpost are the secondary objects.

[0083] The portrait frame generation module 500 is configured to generate a portrait frame corresponding to each frame of the plurality of frames based upon the primary object, such that the primary object and at least one secondary object are positioned within the portrait frame. The portrait frame generation module 500 is configured to receive, for each frame, the primary object and the secondary object identified by the focus determination module 450. The focus determination module 450 provides the identity of the primary object and the secondary object, the bounding box of the primary object and the secondary object, and the trajectory of the primary object and the secondary object derived from the tracking module 350 to the focus determination module 450. In an embodiment, using this information, the portrait frame generation module 500 dynamically identifies a region of interest (ROI) segment within the frame based upon the primary object and the corresponding secondary object, such that the primary object and the secondary object remain visually emphasized in the ROI segment. The ROI segment is defined by its position and size. In an embodiment, the portrait frame generation module 500 determines a position and a size of the ROI segment based upon a distance between the primary object and the secondary object such that the ROI segment encloses the primary and secondary object. The ROI is defined as a content-adaptive spatial region enclosing the primary object and selectively enclosing the secondary object based on contextual relevance. 1

[0084] As explained, the position and size of the ROI segment are dependent upon the proximity between the primary object and the second object and may be calculated as a minimal bounding area that contains the primary and secondary objects. For example, when the ball and the player are in close proximity (e.g., in situations like dribbling, skill moves, imminent kicking of the ball), the size of the ROI segment is smaller. This helps in providing "close-up" feel, maximizing vertical pixels for the viewer. Fig. 4A illustrates one such example scenario and depicts an ROI segment 501 identified by the portrait frame generation module 500 within a frame 503. In another example, when the distance between the ball and the player increases (e.g., in situations like a long shot or a cross pass), the size of the ROI segment is larger to keep both in view. Fig. 4B illustrates one such example scenario and depicts an ROI segment 504 identified by the portrait frame generation module 500 within a frame 505. As can be seen, the ROI segment 504 is larger than the ROI segment 501.

[0085] Once the ROI segment is identified, the portrait frame generation module 500 is configured to generate a portrait frame from the ROI segment. The portrait segment has a second aspect ratio corresponding to a portrait format. For example, the second aspect ratio is 9:16. The portrait frame generation module 500 crops a portion of the frame outside the ROI segment to generate a cropped frame, the frame (which is in the landscape format) according to the size and position of the ROI segment and resizes the cropped frame based upon the dimensions of the ROI segment and the portrait frame. In an embodiment, the portrait frame generation module 500 applies a Bicubic interpolation algorithm one the cropped frame to generate the portrait frame based upon the cropped frame. This is done to ensure that even when zoomed in, the players' features remain sharp and clear. Fig. 5 illustrates a portrait frame 502 generated by the portrait frame generation module 500 corresponding to the ROI segment. In another embodiment, a buffer zone of black pixels around the ROI segment is added when resizing the ROI segment to a higher size of the portrait segment to ensure that no body parts are clipped. The buffer zone may be 10% - 15% of the size of the ROI segment. It should be understood that any other technique or algorithm may be used to generate the portrait frame from the ROI segment.

[0086] The focus determination module 450 and the portrait frame generation module 500 work in tandem to ensure dynamic, context-aware primary object switching across the successive portrait frames. In a sports video or other scenarios involving multiple moving objects within a single camera view, the system 100 adaptively adjusts the position and size of the ROI segmentto maintain visual focus on the primary object and the secondary object identified by the focus determination module 450. For example, in a fast-paced sport such as football, the primary object may shift from a player initiating an action to the ball in motion or to another player receiving the ball. The focus determination module 450 determines such shifts based on contextual cues, including activity levels, trajectories, and relevance of the primary objects and the secondary object. Then, the portrait frame generation module 500 uses the updated primary object and the secondary object information to reposition the ROI segment as needed so that the newly identified primary object and the secondary object remain within the view. The portrait frame generation module 500 may move the ROI segment horizontally or vertically, depending on the position of the primary object and the secondary object, so that the ROI segment continues to capture the most relevant region of the scene. For example, consider a sequence in which a first player has a ball and initiates a pass to a second player. During frames preceding the pass, the focus determination module 450 identifies the ball as the primary object based on its high activity weight and interaction relevance, while the first player is identified as the secondary object due to spatial proximity and direct interaction with the ball. Upon execution of the pass, the focus determination module 450 dynamically updates the secondary object from the first player to the second player as the ball transitions toward the second player and interaction relevance shifts accordingly.

[0087] In response to this change, the portrait frame generation module 500 dynamically updates the ROI segment by adjusting its spatial position and size based on the relative positions and smoothed trajectories of the ball, the first player, and the second player. The ROI segment is repositioned along the predicted motion path of the ball such that the ball remains emphasized within the central portion of the portrait frame, while the currently interacting player is retained within the ROI segment to preserve contextual continuity. The adjustment may include horizontal and / or vertical displacement and adaptive scaling of the ROI segment based on inter-object distance and motion direction.

[0088] .In an example implementation, the object identification module 250, the tracking module 350, the activity analysis module 400, the focus determination module 450, and the portrait frame generation module 500 operate in a synchronized manner to determine the primary object and the secondary object, and dynamically adjust the portrait frame 502 based on the contextual changes. Such an integrated approach ensures that the portrait video remainscentered on the most relevant portions of the scene, delivering a seamless and engaging viewing experience.

[0089] The video generation module 600 is configured to generate a portrait video based on the plurality of portrait frames. In an embodiment, the video generation module 600 is configured to compile the portrait frames and generate an output video (also referred to as a portrait mode video or a portrait video). The portrait mode video has the second aspect ratio, for example, 9:16. The portrait frames are produced with adjusted zoom levels, updated primary objects, and the secondary object, the positions of the primary objects, the position of the secondary object, and the positions of surrounding objects. The video generation module 600 ensures that the resulting portrait format video maintains smoothness, visual coherence, and clarity of the primary objects and the secondary object, while keeping the overall viewer experience natural and engaging. In an embodiment, the video generation module 600 includes, without limitation, a frame composition module 605, an audio sync module 610, and an encoding module 615 as depicted in Fig. 2.

[0090] The frame composition module 605 assembles the portrait frames in their temporal sequence. In an embodiment, the frame composition module 605 applies frame interpolation and temporal smoothing techniques known in the art, on the portrait frames to reduce abrupt visual jumps and ensure smooth transition between successive portrait frames (especially when the zoom level changes and / or the primary objects shift), giving the portrait mode video a fluid, cinematic quality. Further, the frame composition module 605 adjusts the portrait frame smoothly. In an embodiment, the frame composition module 605 uses context-aware video composition techniques known in the art to avoid sudden cuts or awkward framing charges, creating a seamless viewing experience. The frame composition module 605 also ensures that the resolution and quality of the input video are preserved.

[0091] Additionally, and / or optionally, the audio sync module 610 is configured to synchronize the audio of the input video with the portrait video using any audio synchronization technique known in the art. For example, when the primary object and the secondary object is changed based upon a person's activity or speech, the audio sync module 610 adjusts a corresponding audio track to match the new primary object. The encoding module 615 is configured to encode the portrait video having the synced audio to maintain the quality and ensure compatibility with various devices, thereby facilitating the portrait video to be easily shared or broadcast. The portrait video may then be rendered on a device of the viewer in portrait mode.

[0092] The database 550 is configured to store data without limitation, the input video, the extracted frames, details about scene changes, the plurality of objects, trajectories of the plurality of objects across multiple frames, context of the scenes, details of the primary objects, the portrait frames, the portrait video, and the like. The data interface may be wired or wireless. The network may include, without limitation, a local area network, a wide area network, a private network, a wireless network (such as Wi-Fi, Bluetooth), a cellular network (such as 2G, 3G, 4G, 5G, etc.), an Internet, or any combination thereof. The database 550 may be a local data store or a remote database (e.g., a cloud database) accessible by the server.

[0093] In an embodiment, the system 100 is configured to receive raw video data from one or more sources, such as cameras, satellite feeds, or broadcast encoders, and to dynamically generate a portrait-format video stream with minimal latency. The system 100 continuously updates the position and status of the primary object and the secondary object to maintain centralization, visibility, and contextual relevance even during rapid or unpredictable movement within the scene. The system 100 is further configured to output a real-time portrait-mode video stream compliant with mobile-platform specifications, including, but not limited to, 9:16 resolution profiles, bitrate limitations, safe-area constraints, and adaptive encoding requirements. Because the system 100 operates entirely through software and does not require dedicated vertical-format camera arrays or hardware-based switching equipment, it provides a scalable, cost-efficient solution for broadcasters seeking to deploy portrait-mode live video on social-media platforms, mobile applications, and digital-first streaming channels.

[0094] In an embodiment, the system 100 is further adapted to optimize the portrait-mode output for consumption on mobile devices and social-networking platforms, addressing the increasing demand for mobile-first viewing. The system 100 may automatically generate output in a 9:16 aspect ratio or other portrait-oriented formats based on platform-specific guidelines, including vertical-centering rules, safe-zone definitions, and maximum permissible cropping thresholds. The system 100 is configured to identify and preserve semantically relevant regions within the original landscape frame— such as player positions, scoreboards, on-field graphics, interview backdrops, or branding elements— despite the reduced horizontal field of view. In an embodiment, the system 100 determines which regions contain essential visual information and ensures their inclusion in the portrait-mode output. The system 100 is further scalable to additional device categories, including tablets, foldable displays, portable monitors, smart televisions operating in portrait-enabled modes, and public-screen installations. The system 100may generate multiple portrait-oriented variants simultaneously, thereby enabling broadcasters to distribute content across diverse platforms without redundant or repetitive processing.

[0095] Fig. 6 depicts a flowchart of a method 800 for converting a landscape video into a portrait video, according to an embodiment of the present disclosure. The method 800 is performed by various modules executed by the processing unit 650.

[0096] At step, 802, an input video is received by an input module 150. The input video is in the landscape format and has the first aspect ratio. The input video may be a live stream, a prerecorded video, or a combination thereof. The input module 150 preprocesses the received video stream to ensure compatibility with downstream processing.

[0097] At step 804, a plurality of frames are extracted from the input video by a frame extraction module 200. The extracted frames may be converted into a format that makes the frames easily accessible and processable as needed.

[0098] At step 806, a plurality of objects is identified within each frame of the plurality of frames by the object identification module 250. The plurality of objects may include one or more human objects and / or one or more non-human objects. Each frame is associated with a corresponding scene. In an embodiment, the plurality of objects is identified using the first neural network 251 in a similar manner as described earlier.

[0099] At step 808, a trajectory of each of the plurality of objects across the plurality of frames is determined by the tracking module 350. The trajectories capture temporal motion patterns, object displacement, and continuity across frames. In an embodiment, the trajectories are determined in a similar manner as described earlier.

[0100] At step 810, a primary object is determined from the plurality of objects for each frame by the focus determination module 450 based upon the trajectory of the plurality of objects, activity weight of the plurality of objects with the scene associated with the frame, a change in the scene, using the machine learning model 451, such as a transformer-based sequence encoder, as explained earlier. In an embodiment, for each frame, at least one secondary object is identified from the plurality of objects by the focus determination module 450 in a similar manner as described earlier. The scene change is detected by the scene change detection module 300 based on analysis of a set of consecutive frames, and a scene change tag is assigned to the frame as explained earlier.

[0101] At step 812, a portrait frame corresponding to each of the plurality of frames is generated by the portrait frame generation module 500. The portrait frame corresponding to each landscape frame is produced based on the primary object, e.g., the location, size, and contextual relevance of the primary object in a similar manner as explained earlier.

[0102] At step 814, a portrait video is generated by a video generation module 600 based upon the plurality of portrait frames. The generated portrait video corresponds to a transformed version of the original landscape video, with the primary object maintained within the field of view in each frame, as explained earlier.

[0103] The performance of the system 100 was benchmarked against conventional Al- based techniques using a football video. The conventional system achieved normalized accuracy of 0.28 - 0.47 and 0.77 - 0.90 when detecting the ball and players, respectively. The corresponding values for the system 100 were 0.97 and 0.98, respectively. Thus, the system 100 demonstrated near-complete identification of relevant objects in sports videos when compared to conventional systems. ID switching was used as a metric to measure the performance of both systems in object tracking. The conventional system resulted in 617 ID switches, whereas the system 100 resulted in 42 ID switches, thus exhibiting superior performance through customizations (e.g., adaptive matching thresholds) in the object tracking algorithm as described herein. The effectiveness of the portrait frame was assessed, such that context-aware tracking and focal point determination are quantitatively assessed using temporal alignment metrics, including Intersection over Union (loU) and Coverage, which compare an Al-generated bounding span corresponding to the tracked ROI region with a manually annotated ground-truth span. In this one-dimensional temporal formulation, loU reflects the tightness of overlap between the predicted and ground-truth spans, while Coverage measures the proportion of the Al-generated span that correctly aligns with the intended ROI region, thereby providing objective indicators of tracking robustness during occlusions, overlaps, and rapid motion events. The quality of the portrait video generated by the conventional system and the system 100 was assessed both qualitatively and quantitatively vis-a-vis ground truth. The convention system exhibited incomplete temporal coverage, delayed adaptation to subject transitions, unstable framing during pans, zooms, and cuts, and an overall accuracy of approximately 75%, manifested as intermittent reductions in loU and consecutive low-coverage segments. In contrast, the system 100 (having the capability of context-aware, dynamic primary object-based portrait frame generation) resulted in consistent alignment between the ROI segments generated by the system100 and the ground-truth focus regions. The system 100 also had improved framing stability, and a marked reduction in low-coverage intervals, yielding accuracy levels in the range of 90% to 95%. The observed reduction in ID switches from 617 to 42 further evidences the technical advantage of the disclosed approach. The adaptive thresholds, view-adaptive tuning, occlusion handling, and temporal context modeling collectively improve identity preservation and tracking reliability for downstream applications, including performance analytics, automated highlight generation, and portrait-mode video rendering.

[0104] The present disclosure offers several advantages over conventional editing systems. The Al-enabled editing system proposed by the present disclosure automates the reframing of the videos in landscape format into portrait format. The system is configured to quickly generate portrait videos from live sports feeds, highlight reels, and event recaps, making it possible to deliver timely, mobile-optimized content to the viewers. For example, during a live football match, broadcasters can use the editing system to create portrait format (e.g., vertical) highlight clips in real-time or near-real-time, focusing on key actions like goals, tackles, and closeups of players celebrating. This automation significantly reduces human effort, operational delay, cost, and dependency on large editing teams.

[0105] Further, the system provides improved accuracy and consistency by deploying Al- enabled techniques for object detection, scene change detection and primary object determination, especially for sports video, which involves fast-moving, small, and / or occluding objects. For example, in a cricket match, the video editing system is capable of dynamically tracking the ball from the bowler's release, through the batter's shot, and to the fielder's catch or boundary play. This consistent tracking ensures that each phase of the play is in focus without losing context, providing fans with a high-quality, immersive experience on mobile screens. The system ensures superior framing accuracy, consistency, and contextual preservation.

[0106] Furthermore, the editing system adapts to a broad spectrum of content types, each with unique framing demands. The system dynamically adjusts object tracking strategies, cropping logic, and portrait-framing rules based on the detected content type, ensuring optimal preservation of context and semantic relevance. The system supports real-time, software-based live streaming in portrait mode, enabling broadcasters to dynamically reframe live video feeds without specialized vertical-format cameras or hardware switching systems. The proposed system enhances flexibility, reduces equipment requirements, and allows rapid deployment across multiple production environments. The system is designed to optimize video for mobileconsumption and social-media distribution, addressing the global shift toward mobile-first viewing. It automatically generates 9:16 or other portrait-oriented outputs that comply with platform-specific constraints, including safe-zone rules, content visibility requirements, and maximum cropping thresholds. The system also supports adaptation to other device categories, including tablets, portrait-enabled televisions, foldable displays, and public digital signage, providing an integrated solution for multi-device content delivery.

[0107] The scope of the invention is only limited by the appended patent claims. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and / or configurations will depend upon the specific application or applications for which the teachings of the present invention is / are used.

Claims

WE CLAIM1. A video editing system (100) for converting a landscape video into a portrait video, the video editing system (100) comprising: a. a processing unit (650) comprising one or more processors (655) and a memory (660) coupled to the one or more processors (655) and storing a set of instructions, that when executed by the one or more processors (655), cause the processing unit (650) to: i. receive an input video in a landscape format; ii. extract a plurality of frames from the input video; iii. identify a plurality of objects within each frame of the plurality of frames using a first neural network (251), each frame associated with a corresponding scene; iv. track the plurality of objects across the plurality of frames to generate a trajectory for each of the plurality of objects using a tracking algorithm; v. dynamically determine, for each frame, a primary object and at least one secondary object from the plurality of objects based upon the trajectory of the plurality of objects, activity score of the plurality of objects within the scene associated with the frame, and a change in the scene, using a machine learning model (451); vi. generate a portrait frame corresponding to each frame of the plurality of frames based upon the primary object, such that the primary object and at least one secondary object are positioned within the portrait frame; and vii. generate a portrait video based upon the plurality of portrait frames.

2. The video editing system (100) as claimed in claim 1, wherein the set of instructions causes the processing unit (650) to: a. extract a predefined set of features from a set of consecutive frames comprising a current frame and one or more adjacent frames to the current frame, wherein the set of predefined features comprises one or more of: pixel-level color histograms, the number of the plurality of objects, positions of the plurality of objects, and texture patterns using a deep learning model; andb. determined whether the change in the scene has occurred based upon the set of predefined features.

3. The video editing system (100) as claimed in claim 1, wherein the plurality of objects comprises one or more of: participants, game-play objects, sport-arena structures, or combinations thereof.

4. The video editing system (100) as claimed in claim 1, wherein the set of instructions causes the processing unit (650) to estimate the trajectory of each object based on a bounding box, a class label, and a confidence score of each of the plurality of objects using the tracking algorithm.

5. The video editing system (100) as claimed in claim 4, wherein the set of instructions cause the processing unit (650) to, for the trajectory of each object of the plurality of objects in a current frame: a. adapt a matching threshold for each trajectory based upon a speed of the corresponding object and scene complexity; b. predict a next state of the trajectory of the object based upon a current state of the trajectory of the object using a constant-velocity Kalman filter; c. compute a cost matrix between the predicted next state of the trajectory and positions of the bounding boxes 404 of the plurality of objects identified in the current frame; d. apply a Hungarian algorithm to the cost matrix to associate the predicted trajectory with the positions of the bounding boxes 404 of the plurality of objects identified in the current frame; e. compare the association against the respective adaptive matching threshold; and f. update the trajectory based upon the comparison.

6. The video editing system (100) as claimed in claim 1, wherein the set of instructions causes the processing unit (650) to generate a tag for each object of the plurality of objects, the tag comprising one or more of: a bounding box, an object class identifier, and a confidence metric.

7. The video editing system (100) as claimed in claim 1, wherein the set of instructions causes the processing unit (650) to, for each frame:a. identify a Region of Interest (ROI) segment within the frame, the ROI segment defined by a corresponding position and size determined based upon a distance between the primary object and the at least one secondary object; b. crop a portion of the frame outside the ROI segment to generate a cropped frame; and c. generate the portrait frame based upon the cropped frame.

8. The video editing system (100) as claimed in claim 1, wherein the set of instructions cause the processing unit (650) to: a. build a feature vector for one or more candidate objects of the plurality of objects, the feature vector comprising one or more of: normalized center, width and height of bounding boxes 404 of the candidate objects, a class label for the candidate objects encoded as one-hot, a confidence score for the candidate objects, magnitude of trajectory velocity of the candidate objects, proximity to other objects, an activity score for the candidate objects; b. provide the feature vector to the machine learning model (451), wherein the machine learning model (451) provides a relevance score for each candidate object based upon the feature vector; and c. identify the candidate object with the highest relevance score as the primary object and determine coordinates of the primary object.

9. The video editing system (100) as claimed in claim 8, wherein the set of instructions causes the processing unit (650) to: a. calculate acceleration of the primary object based upon predicted coordinates by the machine learning model (451) for the primary object for a set of previous frames stored in a temporal buffer, the set of previous frames comprising between 15 - 30 frames; b. calculate a smoothing weight using w(t) =+ y. poly(t, ri); c. compare the smoothing weight with a weight threshold; d. when the smoothing weight is greater than the weight threshold, apply one of Savgol filter or Kalman filter to the predicted coordinates to obtain smoothed predicted coordinates; ande. apply spline interpolation to the predicted coordinates or the smoothed predicted coordinates to determine the coordinates of the primary object.

10. A method (800) for converting a landscape video into a portrait video, the method (800) comprising: a. receiving, by a processing unit (650), an input video in a landscape format; b. extracting, by the processing unit (650), a plurality of frames from the input video; c. identifying, by the processing unit (650), a plurality of objects within each frame of the plurality of frames using a first neural network (251), each frame associated with a corresponding scene; d. tracking, by the processing unit (650), the plurality of objects across the plurality of frames to generate a trajectory of each of the plurality of objects using a tracking algorithm; e. dynamically determining, by the processing unit (650), for each frame, a primary object and at least one secondary object from the plurality of objects based upon the trajectory of the plurality of objects, activity of the plurality of objects with the scene associated with the frame, a change in the scene, using a machine learning model (451); f. generating, by the processing unit (650), a portrait frame corresponding to each frame of the plurality of frames based upon the primary object such that the primary object and at least one secondary object are positioned within the portrait frame; and g. generating, by the processing unit (650), a portrait video based upon the plurality of portrait frames.

11. The method (800) as claimed in claim 10, wherein the method (800) comprises: a. extracting, by the processing unit (650), a predefined set of features from a set of consecutive frames, including a current frame and one or more frames adjacent to the current frame, the predefined set of features comprising one or more of: pixel-level color histograms, the plurality of objects, object positions, and texture patterns; andb. determining, by the processing unit (650), whether a change in the scene has occurred is based upon the predefined set of features.

12. The method (800) as claimed in claim 10, wherein the plurality of objects comprises at least one of participants, game-play objects, and sport-arena structures, or a combination thereof.

13. The method (800) as claimed in claim 10, wherein the step of tracking the plurality of objects comprises estimating a trajectory of each object based on bounding box, a class label, and a confidence score of the each of the plurality of objects.

14. The method (800) as claimed in claim 13, wherein the step of estimating the trajectory for each object of the plurality of objects comprises: a. adapting a matching threshold for each trajectory based upon a speed of the corresponding object and scene complexity; b. predicting a next state of the trajectory of the object based upon a current state of the trajectory of the object using a constant-velocity Kalman filter; c. computing a cost matrix between the predicted next state of the trajectory and positions of the bounding boxes 404 of the plurality of objects identified in the current frame; d. applying a Hungarian algorithm to the cost matrix to associate the predicted trajectory with the positions of the bounding boxes 404 of the plurality of objects identified in the current frame; e. comparing the association against the respective adaptive matching threshold; and f. updating the trajectory based upon the comparison.

15. The method (800) as claimed in claim 10, wherein the step of identifying the plurality of objects comprises generating, for each identified object, a tag comprising one or more of: a bounding box, an object class identifier, and a confidence metric.

16. The method (800) as claimed in claim 10, wherein the step of generating the portrait frame comprises:a. identifying a Region of Interest (ROI) segment within the frame, the ROI segment defined by a corresponding position and size determined based upon a distance between the primary object and the at least one secondary object; b. cropping a portion of the frame outside the ROI segment to generate a cropped frame; and c. generating the portrait frame based upon the cropped frame.

17. The method (800) as claimed in claim 10, wherein the step of dynamically determining the primary object comprises: a. building a feature vector for one or more candidate objects of the plurality of objects, the feature vector comprising one or more of: normalized center, width and height of bounding boxes 404 of the candidate objects, a class label for the candidate objects encoded as one-hot, a confidence score for the candidate objects, magnitude of trajectory velocity of the candidate objects, proximity to other objects, an activity score for the candidate objects; b. providing the feature vector to the machine learning model (451) to obtain a relevance score for each candidate object based upon the feature vector; and c. identifying the candidate object with the highest relevance score as the primary object and determining coordinates of the primary object.

18. The method (800) as claimed in claim 17, wherein the step of determining the coordinates of the primary object comprises: a. calculating acceleration of the primary object based upon predicted coordinates by the machine learning model (451) for the primary object for a set of previous frames stored in a temporal buffer, the set of previous frames comprising between 15 - 30 frames; b. calculating a smoothing weight using w(t)_|_ y. poly(t, n'); c. comparing the smoothing weight with a weight threshold; d. when the smoothing weight is greater than the weight threshold, applying one of Savgol filter or Kalman filter to the predicted coordinates to obtain smoothed predicted coordinates; ande. applying spline interpolation to the predicted coordinates or the smoothed predicted coordinates to determine the coordinates of the primary object.