A fast variational on-line learning technique for training a transformed hidden Markov model. A simplified general model and an associated estimation algorithm is provided for modeling visual data such as a video sequence. Specifically, once the model has been initialized, an expectation-maximization (“EM”) algorithm is used to learn the one or more object class models, so that the video sequence has high marginal probability under the model. In the expectation step (the “E-Step”), the model parameters are assumed to be correct, and for an input image, probabilistic inference is used to fill in the values of the unobserved or hidden variables, e.g., the object class and appearance. In one embodiment of the invention, a Viterbi algorithm and a latent image is employed for this purpose. In the maximization step (the “M-Step”), the model parameters are adjusted using the values of the unobserved variables calculated in the previous E-step. Instead of using batch processing typically used in EM processing, the system and method according to the invention employs an on-line algorithm that passes through the data only once and which introduces new classes as the new data is observed is proposed. By parameter estimation and inference in the model, visual data is segmented into components which facilitates sophisticated applications in video or image editing, such as, for example, object removal or insertion, tracking and visual surveillance, video browsing, photo organization, video compositing, and meta data creation.