“Skimming” video, alternatively known as browsing, has been a technical challenge for a long time.
a. Fast forwarding: fast forwarding (also known as increasing the video playback speed) shortens the video viewing time. However, speeding up the
video rate distorts the video information and may cause
elimination of short events. This method has been the most popular browsing technique so far. Fast forwarding is discussed in more detail below.
b. Text Based Queries: This refers to a querying of
metadata associated with the full length video or video chapters for specific
textual information. For example, a text based query may be in the form of “scene with George falling off the bridge”. Text based queries today require the video to be annotated, mostly a manual process, before the video can be queried. Although text-based video query has been in existence for a long time, only few applications can afford the required intense human effort needed to intelligently categorize and annotate the videos. One example of video content that contains
metadata which enables text based queries is medical records used in some systems.
c.
Automatic Indexing: In the academic literature [for example, Cees G. M. Snoek and Marcel Worring, “Multimodal Video Indexing: A Review of the State-of-the-art,”
Multimedia Tools and Applications, Volume 25, Number 1 / January, 2005, Springer], techniques have been proposed to automatically index video for browsing representations based on information within the video. These indexing systems can use, for example, any of the following information aspects to generate video chapters:Motion of the video;Scene changes;Image statistics—such as color and shape;Audio information; and / orSpecific object types in the video.
Today, when using any form of automated video skim generation, it is unfortunately quite frequent that a certain scene, in which a user may be interested, stays unidentified by the skimming process.
In summary, the automated context-sensitive generation of video skims, despite the significant research conducted over the past decade, has remained a task that is difficult, requiring high computational complexity and involving
human interaction such as filtering and
processing.
(1) The search may still take a long time depending on where the specific video segment of interest is located within the full length
video sequence (particularly if it is located towards the end).
(2) The video segment of interest may be made unnoticeable or totally lost during sub-sampling as it may fall on the deleted frames (especially when large sub-sampling intervals are in use).
(3) The associated audio information, if any, often cannot be meaningfully presented.
This process may require “spatial sub-sampling” to reduce the resolution of the original video to fit into smaller windows, because of
display size limitations as illustrated in FIG. 2.
(1) Performing spatial sub-sampling in real-time to generate smaller versions of the full-length video is computationally intensive and
time consuming. Depending on how many windows are generated and the size of each window, the sub-sampling may require significant computing resources.
(2) The information may be lost during sub-sampling due to side effects of spatial sub-sampling such as filtering or
aliasing.
However, the compressed video file can't be temporally sub-sampled randomly as the sequence of compressed frames may depend on other frames due to inter-picture prediction.
If there are no IDR frames or if their frequency is low, then fast forwarding will not be feasible without decoding a large percentage of the coded pictures of the full length
video sequence.
(2) With an increase of the number of IDR frames, the
compression ratio decreases. The transcoded full length sequence with a higher number of IDR frames may be significantly larger than the original compressed full length sequence.
(3) The disadvantages of fast forwarding with raw files still remain.
The process of sub-dividing a compressed file suffers from similar disadvantages as temporal sub-sampling.
Further using traditional video compression technologies, spatial sub-sampling is not possible in the compressed domain.
Moreover, although use of compressed video file eliminates the
disadvantage of storing a large file, the need to decode the file several times in real-time introduces significant additional cost and
processing complexity to spatial re-sampling.
If the full length video file, be it in raw or compressed format, is not available locally, the problem of video skimming according to the described techniques is further exacerbated by the need to retrieve it in real-time to a local computer over a network like the public Internet.
Particularly, if the file is in raw format, then the bandwidth requirements are impractically large (i.e., 45 Mbps for a reasonable speed download of an SDTV resolution sequence).
Accordingly, given the issues of using raw and compressed video, and using temporal and spatial sub-sampling, there has not been an acceptable implementation of a practical real-time video skimmer in the
market place.