A method,
system and apparatus for facilitating transcription and captioning of multi-
media content are presented. The method,
system, and apparatus include automatic multi-media analysis operations that produce information which is presented to an operator as suggestions for spoken words, spoken word timing, caption segmentation, caption playback timing, caption mark-up such as non-spoken cues or
speaker identification, caption formatting, and caption placement. Spoken word suggestions are primarily created through an
automatic speech recognition operation, but may be enhanced by leveraging other elements of the multi-
media content, such as correlated text and imagery by using text extracted with an
optical character recognition operation. Also included is an
operator interface that allows the operator to efficiently correct any of the aforementioned suggestions. In the case of word suggestions, in addition to best
hypothesis word choices being presented to the operator, alternate word choices are presented for quick selection via the
operator interface. Ongoing operator corrections can be leveraged to improve the remaining suggestions. Additionally, an automatic multi-media playback control capability further assists the operator during the correction process.