Methods and apparatus for story-anchored content creation

By anchoring media presentations to story elements, the method provides flexible and immersive playback experiences with customizable options for media content, addressing the limitations of fixed timeline-based approaches.

WO2026128738A1PCT designated stage Publication Date: 2026-06-18DOLBY LABORATORIES LICENSING CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
DOLBY LABORATORIES LICENSING CORP
Filing Date
2025-12-11
Publication Date
2026-06-18

Smart Images

  • Figure US2025059249_18062026_PF_FP_ABST
    Figure US2025059249_18062026_PF_FP_ABST
Patent Text Reader

Abstract

The disclosure relates to a method of generating a media presentation comprising a plurality of audio tracks and metadata including playout conditions for playout of the audio tracks. The method comprises receiving the plurality of audio tracks, generating, for one or more audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track, and generating the metadata based on the generated playout conditions for the one or more audio tracks. The disclosure further relates to a corresponding method of processing a media presentation, and to corresponding apparatus, computer programs, and computer-readable storage media.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] METHODS AND APPARATUS FOR STORY- ANCHORED CONTENT CREATION

[0002] Cross-Reference to Related Applications

[0003] This application claims the benefit of priority from U.S. Provisional Application No. 63 / 935,888, filed on December 10, 2025 and U.S. Provisional Application No. 63 / 733,395 filed on December 12, 2024, each of which is incorporated by reference herein in its entirety.

[0004] Technical Field

[0005] The present disclosure relates to techniques for generating media presentations (e.g., audio experiences or media presentations including audio experiences) and for processing (e.g., decoding or rendering) such media presentations. In particular, the present disclosure relates to story- anchored content creation and to rendering of such content.

[0006] Background

[0007] When creating media presentations, including rich immersive audio content, for example an audiobook or radio drama, the audio is typically authored on a timeline and then rendered to a static format. For instance, it is common for the sound engineer to craft the audio experience in a Digital Audio Workstation (DAW), where sound objects such as voices, sound effects, music, and ambience are added to the timeline. Effects may be applied to each of these tracks, to a subset of tracks, or to the mix as a whole. When the experience is satisfactory it is rendered to a more compact format such as a mono or stereo audio file, or in some cases to an object-based format (for example, Dolby® Atmos®).

[0008] An example of such timeline-based content 100 is schematically illustrated in Fig. 1, where several audio tracks 10 arc added to a timeline 110 or generally, arc provided with a fixed relationship to the timeline 110.

[0009] However, the above approaches confine the listening experience to the final rendered result.

[0010] Thus, there is a need for improved techniques for media content (e.g., media presentation) creation. There is particular need for techniques for more flexible media content creation. Summary

[0011] In view of this need, the present disclosure provides methods of generating media presentations (e.g., audio presentations, audio experiences) and of processing (e.g., decoding, rendering) media presentations, as well as corresponding apparatus, computer programs, and computer-readable storage media, having the features of respective independent claims.

[0012] An aspect of the disclosure relates to a method of generating a media presentation comprising a plurality of audio tracks and metadata including playout conditions (or playout information / instructions in general) for playout of the audio tracks (e.g., audio files). The media presentation may be an audio presentation (e.g., audio experience, immersive audio experience), such as an audio book or a podcast, for example. The audio presentation may be augmented with images and / or video, for example. The method may include obtaining (e.g., receiving) the plurality of audio tracks. The method may further include generating, for one or more audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track (e.g., a triggering audio track for the respective audio track) among the plurality of audio tracks. The anchor in the other audio track may relate to a termination of (audible) sound in the other audio track, or to any other story-based instance in the other audio track. The playout conditions may relate to or comprise a trigger for playout of the respective audio track. A playout condition may be generated for each but one audio track among the plurality of audio tracks, which is the first audio track to be played out. The method may yet further include generating the metadata based on the generated playout conditions for the one or more audio tracks. In other words, the method may include generating the metadata to indicate, for the one or more audio tracks, the playout condition for the respective audio track in relation to the anchor in the other audio track.

[0013] Configured as described above, the proposed method can provide a flexible format for media presentations that is not anchored on a fixed timeline, but that instead enables story- anchored playout. Since playout is not relative to a fixed timeline, such story-anchored playout allows for several customization options at the playback device that would not be available for prebaked or timeline-based media content.

[0014] In some embodiments, the method may further include generating, for one or more audio tracks among the plurality of audio tracks that relate to speech, an anchor representing a specific instance within the speech. This anchor may be generated at the line level (e.g., end of a given line of text in a script) or at the word level (e.g., first word or last word), for example. In some cases, multiple anchors may be generated for each of the one or more audio tracks among the plurality of audio tracks that relate to speech.

[0015] Thereby, the proposed playout conditions enable story-anchored playout of the audio tracks of the media presentation. Having available such anchors further enables advanced navigation through the media content at the playback device, such as line-based skipping or rewinding, etc.

[0016] In some embodiments, the playout condition for a given audio track may include an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track. The delay and advancement may be a positive and negative offset, respectively, for example in units of seconds, relative to the anchor in the other audio track.

[0017] This allows inserting customizable and meaningful gaps for example between speech segments (e.g., lines of different characters) to thereby ensure a more natural flow of narration.

[0018] In some embodiments, the playout condition for a given audio track may include an indication of an anchor in the given audio track at which playout is to be started. In general, playout of an audio track may be limited to a portion of the audio track between onset and termination anchors in the audio track.

[0019] In some embodiments, the playout condition for a given audio track may include an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track.

[0020] In some embodiments, the playout condition for a given audio track may include an indication of a fade-out for the given audio track. A timing of the fade-out may be in relation to the anchor in the other audio track or in relation to another anchor in yet another audio track that is played out after the other audio track. The anchor in the other audio track or the other anchor may relate to a termination of (audible) sound or to a specific instance within the respective other audio track (e.g., a story-based instance).

[0021] In some embodiments, the playout condition for a given audio track may include an indication of a first portion (e.g., foreground portion) of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion (background portion) of the given audio track that is intended for playout overlapping with the second audio track. Then, playout of the second portion may be at a lower volume level than playout of the second audio track.

[0022] Playout of the second portion may further be at a lower volume level than playout of the first portion.

[0023] With this configuration, a more immersive experience involving foreground and background audio, for example where sound effects continue to play in the background, can be created in a flexible manner.

[0024] In some embodiments, the playout condition for a given audio track may include an indication of a first anchor in a first audio track and a second anchor in a second audio track. Therein, playout of the given audio track is to occur between the first anchor and the second anchor. The playout condition for the given audio track may include a respective indication. Such playout condition may be generated for example for audio tracks that relate to music or ambience.

[0025] Thereby, music and / or ambience, for example, can be played out to cover whole regions in the narrative, independently of user-enabled modifications to the playback, such as increased speed of speech segments, disabling of sound effects, etc.

[0026] In some embodiments, the method may further include generating, for one or more audio tracks among the plurality of audio tracks, an anchor representing an onset of sound in the respective audio track. The anchor (onset anchor) may represent the onset of audible (or perceptible) sound.

[0027] In some embodiments, the method may further include generating, for one or more audio tracks among the plurality of audio tracks, an anchor representing a termination of sound in the respective audio track. The anchor (termination anchor) may represent the termination of audible (or perceptible) sound.

[0028] In some embodiments, the anchor in the other audio track may relate to a termination of sound (e.g., audible or perceptible sound) in the other audio track. For example, the anchor may relate to the termination of a last word if the other audio track relates to speech.

[0029] In some embodiments, the playout condition may be further based on a user setting. The user setting may be a setting at the playback device.

[0030] Foreseeing dependence of the playout conditions on a user setting may enable removing for example sound effects and / or music from the mix that is played out at the playback device, altering the density of sound effects, selecting different voices, and / or speeding up the narration while playing other content at normal speed. Further, this allows providing alternative audio tracks that can be selected, at the playback device, based on a user setting.

[0031] In some embodiments, each audio track may include a unique label or identifier.

[0032] Providing a unique label or identifier may enable replacing audio tracks at the user-side, by exchanging an audio track with a new audio track with the same label, for example to change a narrator voice. Moreover, having labels available can allow for selective disabling of audio at playback.

[0033] In some embodiments, each of the plurality of audio tracks may include audio relating to at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, and an ambience segment. The event-based segment may comprise a sound effect, for example.

[0034] In some embodiments, the media presentation may include one or more visual tracks. Then, the method may include generating, for one or more among the visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track. The visual tracks may be image or video tracks, for example. It is understood that any details of the playout conditions for the audio tracks as described above or elsewhere in the disclosure may likewise apply to the playout conditions for the visual tracks.

[0035] In some embodiments, each of the one or more visual tracks may include visual media relating to at least one of an image segment, a video segment, and a text-based segment.

[0036] In some embodiments, the method may further include providing the plurality of audio tracks and the metadata (and optionally, the one or more visual tracks) as part of the media presentation.

[0037] Another aspect of the disclosure relates to a method of processing (e.g., decoding, rendering) a media presentation for playout at a playback device. The media presentation may include a plurality of audio tracks and metadata that includes, for one or more audio tracks among the plurality of audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track. The method may include extracting the plurality of audio tracks from the media presentation. The method may further include playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions. In some embodiments, playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions may include playing out a first audio track. Said playing out may further include playing out a second audio track in accordance with its playout condition. Therein, the second audio track may have a playout condition in relation to an anchor in the first audio track. In other words, the first audio track may be a triggering audio track for the second audio track.

[0038] In some embodiments, the playout condition for a given audio track may include an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track. Then, playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions may include playing out the given audio track with said delay or advancement relative to the anchor in the other audio track.

[0039] In some embodiments, the playout condition for a given audio track may include an indication of an anchor in the given audio track at which playout is to be started.

[0040] In some embodiments, the playout condition for a given audio track may include an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track. Then, playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions may include fading in the given audio track with the timing in relation to the anchor in the other audio track.

[0041] In some embodiments, the playout condition for a given audio track may include an indication of a fade-out for the given audio track. Then, playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions may include fading out the given audio track with a timing indicated by the playout condition.

[0042] In some embodiments, the playout condition for a given audio track may include an indication of a first portion of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion of the given audio track that is intended for playout overlapping with the second audio track. Then, playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions may include playing out the first portion of the given audio track during the gap between the first and second audio tracks.

[0043] Said playing out may further include playing out the second portion of the given audio track overlapping with the second audio track, wherein playout of the second portion is at a lower volume level than playout of the second audio track.

[0044] In some embodiments, the playout condition for a given audio track may include an indication of a first anchor in a first audio track and a second anchor in a second audio track. Therein, playout of the given audio track may be intended to occur between the first anchor and the second anchor. Further, playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions may include starting playout of the given audio track on occurrence of the first anchor. Said playing out may further include ending playout of the given audio track on occurrence of the second anchor.

[0045] In some embodiments, one or more audio tracks among the plurality of audio tracks may include a respective anchor representing an onset of sound in the respective audio track.

[0046] In some embodiments, one or more audio tracks among the plurality of audio tracks may include a respective anchor representing a termination of sound in the respective audio track.

[0047] In some embodiments, one or more audio tracks among the plurality of audio tracks that relate to speech may include a respective anchor representing a specific instance (e.g., storybased instance) within the speech.

[0048] In some embodiments, the anchor in the other audio track may relate to a termination of sound in the other audio track.

[0049] In some embodiments, the playout condition may be further based on a user setting. Then, the method may further include receiving input of the user setting. The method may yet further include playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions, based on the user setting.

[0050] In some embodiments, each audio track may include a unique label or identifier.

[0051] In some embodiments, each of the plurality of audio tracks may include audio relating to at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, and an ambience segment.

[0052] In some embodiments, the media presentation may include one or more visual tracks. Then, the metadata may include, for one or more among the visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track. The method may further include playing out visual tracks among the one or more visual tracks in accordance with their playout conditions. In some embodiments, each of the one or more visual tracks may include visual media relating to at least one of an image segment, a video segment, and a text-based segment.

[0053] According to another aspect, an apparatus is provided. The apparatus may include one or more processors and a memory coupled thereto and storing instructions for the one or more processors. The one or more processors may be configured to perform the methods or method steps outlined throughout the present disclosure. This apparatus may relate to an encoder, encoding apparatus, or encoding system, or to a decoder, decoding apparatus, or decoding system, as the case may be.

[0054] According to a further aspect, a computer program is described. The computer program may comprise executable instructions for performing the methods or method steps outlined throughout the present disclosure when executed by a computing device (e.g., one or more processors).

[0055] According to another aspect, a computer-readable storage medium is described. The storage medium may store a computer program adapted for execution on a computing device (e.g., one or more processors) and for performing the methods or method steps outlined throughout the present disclosure when carried out on the computing device.

[0056] It should be noted that the methods and apparatus including their preferred embodiments as outlined in the present disclosure may be used stand-alone or in combination with the other methods and apparatus disclosed in this document. Furthermore, all aspects of the methods and apparatus outlined in the present disclosure may be arbitrarily combined. In particular, the features of the claims may be combined with one another in an arbitrary manner.

[0057] It will be appreciated that apparatus features and method steps may be interchanged in many ways. In particular, the details of the disclosed method(s) can be realized by the corresponding apparatus, and vice versa, as the skilled person will appreciate. Moreover, any of the above statements made with respect to the method(s) (and, e.g., their steps) are understood to likewise apply to the corresponding apparatus (and, e.g., their blocks, stages, units), and vice versa.

[0058] Brief Description of the Drawings

[0059] The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein Fig. 1 schematically illustrates an example of timeline-based audio experience;

[0060] Fig. 2 is a flowchart schematically illustrating an example of a method of generating a media presentation according to embodiments of the disclosure;

[0061] Fig. 3 illustrates an example of a text script including several lines of speech within an audio experience;

[0062] Fig. 4 shows time-frequency spectra of audio of two different lines of speech from Fig. 3;

[0063] Fig. 5 schematically illustrates an example of story-anchored sound effects according to embodiments of the disclosure;

[0064] Fig. 6 schematically illustrates an example of story-anchored music according to embodiments of the disclosure;

[0065] Fig. 7 is a flowchart schematically illustrating an example of a method of processing a media presentation according to embodiments of the disclosure;

[0066] Fig. 8 schematically illustrates an example of an apparatus suitable for implementing techniques according to embodiments of the disclosure; and

[0067] Fig. 9 schematically illustrates another example of an apparatus suitable for implementing techniques according to embodiments of the disclosure.

[0068] Detailed Description

[0069] In the following, example embodiments of the disclosure will be described with reference to the appended figures. Identical elements in the figures may be indicated by identical reference numbers, and repeated description thereof may be omitted.

[0070] Overview

[0071] Generally, when creating rich immersive audio content, for example an audiobook or radio drama, the audio is typically authored on a timeline and then rendered to a static format. For a sound engineer, however, audio tracks are considered relative to one another or relative to a particular point in the ‘story’. This disclosure describes techniques for capturing, retaining, and delivering that intent in an advanced format. This format can then support a varied, more personalized playback experience as it allows for adjustments including but not limited to (a) removing sound effects and / or music from the mix, (b) altering the density of sound effects, (c) selecting different voices, and (d) speeding up the narration while playing other content at normal speed.

[0072] To this end, this disclosure considers the composition of an audio experience as a collection of audio sources (audio tracks) that are anchored not on any specific point in time, but rather on a specific point within the story being told.

[0073] In other words, this disclosure describes techniques for authoring, storing, and transmitting media content in a manner anchored to a representation of the story, as contrasted with anchoring to a specific point in time within the presentation. In an example embodiment, this representation is a line by line script that is then broken into scenes, then chapters. This is of course an example and not the only possibly representation of a story. Other representations of stories are within the scope of this disclosure.

[0074] Furthermore, this disclosure describes:

[0075] - Anchoring speech audio to the portion of the story the speech represents o Doing so at a line level, at a word level, and at a character level o Anchoring on the relevant points within the speech audio, not to file boundaries

[0076] Anchoring event-based audio such as sound effects to a point within the story

[0077] Supporting a time offset for event-based audio relative to the story-anchor o Anchoring on relevant points within the sound effect audio, not to file boundaries

[0078] Supporting an indication of whether or not speech should be paused to leave space for the presentation of this sound effect o Supporting an offset before and after the sound effect to leave more or less space

[0079] - Anchoring continuous audio such as music and ambience to a region of the story rather than a specific point o Supporting fade-in and fade-out times o Supporting portions of the audio the speech should not cover, to leave time for fade-in and out Anchoring other modes of media presentation such as image, video, and text to points and regions within the story

[0080] Support for alternative but still story-anchored speech tracks (different voices, less dramatic readings, etc) that can be selected at playback time

[0081] One example method of authoring media content by a media content authoring system may be as follows.

[0082] At a first step, the media content authoring system receives a representation of the media content. For example, the media content authoring system receives media content that includes a text-based script representation of a story.

[0083] At a second step, the media content authoring system receives a plurality of story moments, wherein each of story moments of the plurality of story moments comprises at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, an image segment, a video segment, a text-based segment, and an ambience segment.

[0084] At a third step, the media content authoring system anchors each of the plurality of story moments to a corresponding story position in the representation of the media content. For example, the media content authoring system anchors each of the story moments to a specific point or a specific region in the representation of the media content (e.g., the story).

[0085] Fig. 2 schematically illustrates another example of a method 200 of generating (e.g., creating) a media presentation according to embodiments of the disclosure. The media presentation may be an audio presentation (e.g., audio experience, immersive audio experience), such as an audio book or a podcast, for example. The audio presentation may be augmented with images and / or video, for example, in which case the media presentation could be said to be a multimedia experience including audio. The media presentation is understood to comprise a plurality of audio tracks and metadata including playout conditions (or play out information, playout instructions in general) for playout of the audio tracks. Each of the plurality of audio tracks may comprise audio relating to at least one of a speech segment, a music segment, an event-based segment (e.g., one or more sound effects), a non-diegetic element segment, and an ambience segment. Speech segments may be script-based segments, i.e., may relate to a corresponding line of spoken words, for example. Speech segments may be related to different speakers, such as characters, narrators, etc. To allow for advanced playback options, each audio track may comprise a unique label or identifier, and / or one mor more labels indicating characteristics of the audio track, such as type (e.g., speech, sound effects, music, ambience, etc.), character name, speaker identity, etc.

[0086] Method 200 comprises steps S210 through S230.

[0087] At step S210, the plurality of audio tracks is received (or otherwise obtained).

[0088] At step S220, for one or more audio tracks, a playout condition for the respective audio track (e.g., first audio track) in relation to an anchor in another audio track (e.g., second audio track) is generated.

[0089] A respective playout condition in relation to an anchor may be generated for all audio tracks among the plurality of audio tracks, or for all but one audio track among the plurality of audio tracks, which is the first audio track to be played out.

[0090] The playout conditions may relate to or comprise a trigger for playout of the respective audio track, wherein the trigger refers to the respective anchor in the other audio track. In other words, the anchor in the other audio track serves as a trigger for playout of the respective audio track, possibly with delay or advancement relative to the trigger (where the latter may require appropriate look-ahead at playback).

[0091] The anchor in the other audio track may relate to a termination of sound (e.g., audible or perceptible sound) in the other audio track. For example, for a speech-based audio track as the other audio track, the anchor may relate to the termination of a last word (e.g., case of multiple lines, termination of a last line). In general, the anchor in the other audio track may relate to any story -based (or script-based) anchor, such as an end of a line, a specific word, etc.

[0092] Further, the playout condition may also be based on (or depend on) a user setting (e.g., user- configurable setting at a playback device). Foreseeing playout conditions depending on a user setting may enable, for example removing sound effects, music, and / or ambience from the mix that is played out at the playback device, altering the density of sound effects, selecting different voices, and / or speeding up the narration while playing other content at normal speed. Further, this allows providing alternative audio tracks that can be selected, at the playback device, based on a user setting. Further details on user-configurable playback based on user settings will be described below in the context of method 700.

[0093] At step S230, the metadata is generated based on the generated playout conditions for the one or more audio tracks. In other words, method 200 may comprise generating the metadata to indicate, for the one or more (e.g., all, or all but one) audio tracks, the playout condition for the respective audio track in relation to the anchor in the other audio track.

[0094] Method 200 may further comprise, for one or more (e.g., all) audio tracks among the plurality of audio tracks that relate to speech, generating an anchor representing a specific instance within the speech. As described above, this anchor may be generated at the line level (e.g., end of a given line) or at the word level (e.g., first word or last word), for example. Moreover, multiple anchors may be generated for each of the one or more (e.g., all) speech-based audio tracks. These anchors may include, for example, an onset of speech, a termination of speech, and / or a specific word (e.g., key word). In general, these anchors may be referred to as storybased anchors.

[0095] The aforementioned story-based anchors may be embedded into respective audio tracks themselves, or they may be stored separately in metadata for the audio tracks.

[0096] Moreover, each of the aforementioned story-based anchors may act as a trigger for playout of another audio track. Thus, the audio tracks of the media presentation may be said to be chained or linked together, for playout, via the anchors.

[0097] Finally, although not shown in Fig. 2, method 200 may further comprise a step of providing the plurality of audio tracks and the metadata as part of the media presentation.

[0098] Example Embodiments

[0099] In the following, example implementations embodying principles of the present disclosure will be described without intended limitation.

[0100] Speech

[0101] As an example embodiment, it is considered that the media presentation comprises a ‘story’ that is the text of the script being read. An example of such script 300 is illustrated in Fig. 3, which shows an excerpt from ‘Lord of the Rings’ by J.R.R. Tolkien.

[0102] In this example there may be five voice audio files for speech (one for each line 310, each comprising a plurality of words 320), as examples of (speech-based) audio tracks. If played on their own, logically they should be played one after the other. In practice, the delay between each voice track needs to be chosen. For example, there might need to be a longer delay between the end of the first line and the beginning of the second than there is between the end of the 3rdand the beginning of the 4thline in order for the story to flow smoothly. This timing could be manually tuned (e.g., by prescribing a certain gap), but can also be predicted from the semantic meaning of the surrounding text, and from the punctuation.

[0103] Gaps (or correspondingly, overlap) between sound, such as speech, may be achieved by including in the play out condition for a given audio track an indication of a delay (e.g., positive offset) or an advancement (e.g., negative offset) of playout of the given audio track relative to the anchor in the other audio track (i.e., the anchor that triggers playout of the given audio track). The delay or advancement may be given for example in units of seconds, relative to the anchor in the other audio track. Implementing advancement of playout of the given audio track relative to the anchor in the other audio track may require implementing a look-ahead at playback.

[0104] Importantly, according to the present disclosure, any gap between audio tracks would be anchored not on the (somewhat arbitrary) beginning and end of the audio file, but rather on where the speech contained within the audio file begins and ends.

[0105] Accordingly, the playout condition for a given audio track may comprise an indication of an anchor (onset anchor) in the given audio track al which playout is to be started. In general, playout of an audio track may be limited to a portion of the audio track between onset and termination anchors in the audio track.

[0106] Reference is now made to the example of Fig. 4, which shows time-frequency spectra of audio (i.e., speech) of two different lines of the example script in Fig. 3. The generated audio includes some leading silence and some trail-off, but the words associated with the story end at 3.069s in track 1, 410 and begin at 0.222s within track 2, 420. The relevant information that informs the gap between these tracks is when the speech ends in line 1 and begins in line 2, not the end and beginning of the two audio files. The punctuation, flow of the story, or creative choice could make the desired gap longer or shorter, or even foresee partial overlap between speech segments (e.g., for talkers slightly overlapping), but it is more natural to anchor it from where the spoken words begin and end than from where the audio file begins and ends.

[0107] In line with this, in general, method 200 may further comprise generating, for one or more audio tracks among the plurality of audio tracks, an anchor (onset anchor) representing an onset of sound (e.g., audible or perceptible sound, such as speech) in the respective audio track and / or an anchor (termination anchor) representing a termination of sound in the respective audio track. Describing the spoken word in this manner also allows for the content to change while maintaining creative choice. For example if the voice for the Nazgul were selected to make use of a different voice actor and this led to a different audio file length and a different amount of leading and trailing silence within that audio file, the audio experience would still flow in the same manner.

[0108] Generating such onset and termination anchors may be applied to audio tracks of other audio types as well, to improve control over gaps between sounds.

[0109] Sound Effects

[0110] The inclusion of sound effects will be described next. Returning to the example script of Fig. 3, to enhance the storytelling it may be decided that a ‘hiss’ sound will be added after ‘they cried with deadly voices’, conveying that the Nazgul are unearthly creatures that invoke terror. An audio file (audio track) containing a ‘hiss’ is sourced, or generated using Al, and included in the mix (i.e., in the plurality of audio tracks of the media presentation).

[0111] Under the traditional audio creation workflow, the sound designer would find the point in the story where ‘deadly voices’ is spoken, move all subsequent audio to a slightly later time to make room for the hissing noise, and then add the new audio track in the created space.

[0112] By contrast, according to the present disclosure the sound effect is linked to the end of the word ‘voices’ and the audio file (audio track) is automatically sequenced at wherever in the resulting audio experience the word ‘voices’ finishes being spoken. As described above, this may involve “nudging” the sound earlier or later relative to this anchor point. There is also the option of indicating that the narration should be paused to make room for the effect, or that it should be played alongside the narration without gap. For example, the difference between these two experiences may be indicated in the output format only by a Boolean flag, as the sequencing of audio will be cemented at a later stage.

[0113] An example of the difference in this ‘leave a narration gap’ option is depicted in Fig. 5, which also serves to illustrate how audio files (audio tracks) are sequenced relative to one another.

[0114] The upper panel 500A of Fig. 5 relates to the case that a dedicated gap in narration is left between playout of a first speech-based audio track 510 (with onset and termination anchors 512, 514) and playout of a second speech-based audio track 520 (with onset and termination anchors 522, 524), for playout of a third audio track 530 (with onset and termination anchors 532, 534) that relates to the “hiss” sound effect. The dedicated gap may be achieved by including a corresponding delay in the playout condition for the second speech-based audio track 520 in relation to the termination anchor 514 of the first speechbased audio track 510 that may serve as trigger for play out of the second speech-based audio track 520.

[0115] To ensure a more natural embedding of the “hiss” sound, playout of the third audio track 530 is advanced in this example by (negative) offset 540 relative to the termination anchor 514 in the first speech-based audio track 510, so that the third audio track 530 partially overlaps with the first speech-based audio track, and a (positive) offset 545 (or gap) is foreseen between the temrination anchor 534 of the third audio track and the onset anchor 522 of the second speech-based audio track 520.

[0116] The lower panel 500B of Fig. 5 relates to the case that no dedicated gap in narration is left between playout of a first speech-based audio track 550 (with onset and termination anchors 552, 554) and playout of a second speech-based audio track 560 (with onset and termination anchors 562, 564), for playout of a third audio track 570 (with onset and temrination anchors 572, 574) that relates to the “hiss” sound effect. Merely a narrative pause is left between termination anchor 554 of the first speech-based audio track 550 and the onset anchor 562 of the second speech-based audio track 560 in accordance with natural flow of the narration.

[0117] Again, to ensure a more natural embedding, playout of the third audio track is advanced by (negative) offset 580 relative to the termination anchor 554 in the first speech-based audio track 550, but there is now also partial overlap between the third audio track 570 and the second speech-based audio track 560.

[0118] A point to note is that sound effects may have useful anchor points within the audio file (audio track) itself that will be utilised, similar to what has been described above for speech. The ‘hiss’ sound may begin and end anywhere within the file, and the story-anchored format may reference these anchor points (e.g., onset and termination anchors) rather than the start or end of the file itself.

[0119] Thus, similarly to the above case for speech, the playout condition for a given (or any) audio track may comprise an indication of an anchor (onset anchor) in the given audio track at which playout is to be started and in general, playout of an audio track may be limited to a portion of the audio track between onset and termination anchors in the audio track. Foreground vs Background Time

[0120] Foreground vs. background time as described herein may be seen as an extension to the previous discussion on ‘leave a narration gap'. Each sound effect (or other type of sound, such as ambience or music, for example) can have a foreground time, which is the length of narration gap to leave. While occupying the foreground role in the sound design, the sound effect is played prominently in the narration gap and captures the listener’s attention. However, the sound effects may continue to play in the background (i.e., at a lower level while narration continues) after the designated foreground time for that sound effect. The foreground time and background time may be individual controls that a content creator has access to and may be expressed in the audio format (e.g., media presentation, specifically in the playout conditions) in seconds, words, or a combination of the two.

[0121] In general, according to embodiments of the disclosure the playout condition for a given audio track may comprise an indication of a first portion (foreground portion) of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion (background portion) of the given audio track that is intended for playout overlapping with the second audio track. Preferably, playout of the second portion is at a lower volume level than playout of the second audio track, and / or at a lower volume level than playout of the first portion.

[0122] A further extension involves having a pre-build time, in which the sound effect starts playing before narration pauses, then occupies the foreground role in the sound design while the narration is paused, and then fades again into the background when narration continues. Accordingly, the playout condition for a given audio track may further comprise an indication of a third portion (background portion) that is to be played out prior to playout of the first portion and that and overlaps with the first audio track.

[0123] Music

[0124] Until now, the presented examples contained speech and sound effects (e.g., diegetic sound effects) to help tell the story. Another category of audio to add is non-diegetic elements, such as music, sounds, or visual elements, which are frequently used to influence and enhance the mood arc of each passage. Non-diegetic elements are elements not part of the story’s world, such as a musical score, voice-over narration, or some types of functional sound effects. Non- diegetic elements can also refer to visual elements, such as graphics, that are outside of the story world. In the context of this disclosure, music surfaces another aspect of story-anchored audio in that rather than being anchored to a point it may be instead anchored across a region.

[0125] In the example of the script in Fig. 3, one may select intense dramatic music to build tension starting from the beginning of line 1. This could continue to play until the end of line 2, at which point the music may transition to dramatic triumphant music to reinforce the determination of the main character, as schematically depicted in the example of Fig. 6.

[0126] In this example, three speech-based segments 610, 620, 630 (or audio tracks) are consecutively played out, with appropriate gaps and sequence determined by their respective playout conditions. The speech-based segments 610, 620, 630 in this example correspond to lines 1 to 3 of the script shown in Fig. 3.

[0127] In a traditional mixing scenario, the sound designer would select an appropriate portion of music, ensuring the length matches the rest of the mix, and mix it in starting at a specific time point and beginning to fade-out toward the end of the speech associated with line 2 (segment 620). By contrast, in the context of the present disclosure, the beginning of the intense, dramatic music 640 would instead be anchored on the beginning of the word ‘the’ at the start of line 1 (i.e., anchor 612 in speech-based segment 610). This may include an optional offset 660 to ensure the music has time to build (e.g., fade-in) before the voice starts, and / or may add a flag to prevent voice from beginning until a specific distance into the track. From anchor 612 onwards, a portion 670 of the story may be covered by intense, dramatic music 640, until playback arrives at anchor 624 in speech-based segment 620. At that point, intense, dramatic music 640 may have a fade-out during portion 680, during which triumphant, dramatic music 650 fades in. Playout of the triumphant, dramatic music 650 may be anchored on the beginning of the word “by” at the start of line 3 (i.e., anchor 632 in speech-based segment 630).

[0128] As the audio format (e.g., media presentation) retains this information, if the audio experience is adjusted and this results in a different length for that passage (for example, sound effects are disabled or a voice actor with a different speaking rate is chosen), the music can be faded out early or let run for longer to ensure it covers the appropriate portion of the story.

[0129] In line with the above, generally, the playout condition for a given audio track (e.g., music, such as intense dramatic music 640 in the above example) may comprise an indication of a first anchor in a first audio track (e.g., anchor 612 in speech-based segment 610 in the above example) and a second anchor in a second audio track (e.g., anchor 624 in speech-based segment 620 in the above example), as well as an indication that playout of the given audio track is to occur between the first anchor and the second anchor, optionally with fade-in before the first anchor and / or fade-out after the second anchor.

[0130] Further, the playout condition for a given audio track (e.g., without intended limitation, for music) may comprise an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track. Providing for such fade-in may require sufficient look-ahead at playback. In the example of Fig. 6, intense, dramatic music 640 has a fade-in with a timing in relation to the anchor 612 in speech-based audio track 610.

[0131] Additionally or alternatively, the playout condition for the given audio track may comprise an indication of a fade-out for the given audio track. A timing of the fade-out may be in relation to the anchor in the other audio track or in relation to another anchor in yet another audio track that is played out after the other audio track. The anchor in the other audio track or the other anchor may relate to a termination of (audible) sound or to a specific instance within the respective other audio track (e.g., a story-based instance). In the example of Fig. 6, intense, dramatic music 640 has a fade-out with a timing in relation to the anchor 624 in speech-based audio track 624.

[0132] Ambience

[0133] There are other classes of audio to be added to the story, for example ambience, which are presently defined as diegetic sound representing the acoustic scene the story takes place within. In the example script of Fig. 3, this could include a raging river (as the story takes place while escaping across a river). This class of sound is similar to music in that it covers a region of story rather than being anchored on a specific point. Likewise, this class of sound may have a fade-in and / or fade-out.

[0134] Accordingly, the playout condition for a given audio track relating to ambience may be structured in analogous manner to the above-described playout condition for music.

[0135] Other Media Types

[0136] It should be appreciated that the approach of anchoring presentation elements (e.g., audio tracks) to points or regions of the story may extend to other media types such as images, text, and video. In the context of method 200 described above, this may mean that the media presentation comprises one or more visual tracks (e.g., comprising visual media relating to at least one of an image segment, a video segment, and a text-based segment), and that the method comprises, for one or more (e.g., all) visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track. Therein, it is understood that details of the playout conditions for the audio tracks as described above or elsewhere in the disclosure may likewise apply to the playout conditions for the visual tracks. The audio tracks and visual tracks may be jointly referred to as media tracks in the context of the disclosure.

[0137] For example, one may choose to present an image of character Frodo holding up his sword and for that image to be displayed from the beginning of line 3 in the example script of Fig. 3 until the end of line 5. Alternatively, one may choose to display a short video and / or display text. Importantly, according to the present disclosure the corresponding media presentation is authored, stored, and transmitted with all elements anchored to the story rather than to absolute time.

[0138] User-Side Processing

[0139] Fig. 7 schematically illustrates an example of a method 700 of processing (e.g., decoding, rendering) a media presentation (e.g., a media presentation as generated using method 200 described above) for playout at a playback device. It is understood that the media presentation comprises a plurality of audio tracks and metadata that includes, for one or more audio tracks among the plurality of audio tracks, a playout condition (or generally, playout instruction, playout information) for the respective audio track in relation to an anchor in another audio track. Each of the plurality of audio tracks may comprise audio relating to, for example, at least one of a speech segment, a music segment, an event-based segment, a non- diegetic element segment, and an ambience segment. To allow for advanced playback options, each audio track may also comprise a unique label or identifier, and / or one mor more labels indicating characteristics of the audio track, such as type (e.g., speech, sound effects, music, ambience, etc.), character name, speaker name, etc.

[0140] Importantly, as described above in the context of method 200, some or all of the audio tracks may include one or more anchors. These anchors may represent an onset of sound in the respective audio track, a termination of sound in the respective audio track, and / or a specific instance within the audio track (e.g., a specific (story-based) instance within a speech segment). Further details will be described below.

[0141] Method 700 comprises steps S710 and S720. At step S710, the plurality of audio tracks are extracted (or otherwise obtained) from the media presentation.

[0142] At step S720, the one or more audio tracks among the plurality of audio tracks are played out in accordance with their playout conditions. This may comprise playing out a first audio track, and based thereon, playing out a second audio track in accordance with its playout condition, wherein the second audio track has a playout condition in relation to an anchor in the first audio track. That is, the first audio track may be a triggering audio track for the second audio track.

[0143] Therein, the playout condition for a given audio track may comprise an indication of an anchor in the given audio track at which playout is to be started (e.g., an onset of speech or sound in the given audio track). By eliminating playout of silent lead-ins or lead-outs of audio tracks, gaps between consecutive segments (e.g., speech segments) can be controlled with greater accuracy.

[0144] In some cases, as noted above, the anchor in the other audio track may relate to a termination of sound in the other audio track.

[0145] In general, it is understood that one or more audio track among the plurality of audio tracks (e.g., all audio tracks) each comprise one or more anchors. These anchors may represent an onset of sound (e.g., speech) in the respective audio track, a termination of sound (e.g., speech) in the respective audio track, and / or a specific instance within the audio track (e.g., within the speech). Having these anchors available in the audio tracks allows to flexibly link playout of the audio tracks by means of the playout conditions.

[0146] As noted above, playout of a given audio track at step S720 depends on its playout condition. Non-limiting examples thereof are given below.

[0147] If the playout condition for a given audio track comprises an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track, playing out the one or more audio tracks at step S720 may involve playing out the given audio track with said delay or advancement relative to the anchor in the other audio track. Playing out with advancement may require sufficient look-ahead.

[0148] Further, if the playout condition for a given audio track comprises an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track, playing out the one or more audio tracks at step S720 may involve fading in the given audio track with a timing in relation to the anchor in the other audio track. Further, if the playout condition for a given audio track comprises an indication of a fade-out for the given audio track, playing out the one or more audio tracks at step S720 may involve fading out the given audio track with a timing indicated by the playout condition.

[0149] Further, if the playout condition for a given audio track comprises an indication of a first portion (foreground portion) of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion (background portion) of the given audio track that is intended for playout overlapping with the second audio track, playing out the one or more audio tracks at step S720 may involve the following: playing out the first portion of the given audio track during the gap between the first and second audio tracks, and playing out the second portion of the given audio track overlapping with the second audio track, wherein playout of the second portion is at a lower volume level than playout of the second audio track.

[0150] Further, if the playout condition for a given audio track comprises an indication of a first anchor in a first audio track and a second anchor in a second audio track, as well as an indication that playout of the given audio track is to occur between the first anchor and the second anchor, playing out the one or more audio tracks at step S720 may involve the following: starting playout of the given audio track on occurrence of the first anchor, and ending playout of the given audio track on occurrence of the second anchor.

[0151] In some implementations, the playout condition may also be based on a user setting for enabling additional flexibility at playout. To account for the user setting in the playout condition(s), method 700 may further comprise a step of receiving input of the user setting. Once or if the user setting is available, the one or more audio tracks among the plurality of audio tracks may be played out in accordance with their playout conditions, based on the user setting.

[0152] For instance, playout of certain audio tracks, such as sound effects, music, and / or ambience, may depend on a respective user setting for enabling or disabling sound effects, music, and / or ambience. Such user settings may relate to Boolean flags, for example. Further, playout volumes of certain audio tracks may be increased or decreased, depending on the user setting.

[0153] As another example, the user setting may allow to select between different alternatives for character or narrator voices. This may be achieved by including redundant versions with different character or narrator voices for each speech segment into the media presentation, and labeling all audio tracks belonging to a certain selectable character voice with an identification of the selectable character voice. At playback, audio tracks with labels (matching labels) corresponding to a selectable character voice selected by the user (as per the user setting) may be played out, while the redundant alternatives with non-matching labels may not be played out.

[0154] As another example, playout of certain audio tracks (e.g., speech-based audio tracks) may be accelerated or decelerated, depending on the user setting, while other audio tracks (e.g., sound effects, music, and / or ambience) are played out at normal speed.

[0155] Playback of the media presentation may also comprise anchor-based (and thus, story -based) navigation through the media content. For example, playing out the one or more audio tracks at step S720 may involve pausing at a given anchor, resuming playback at a given anchor, or jumping to a given anchor (e.g., as per user input at the playback device). This given anchor may relate to the beginning of a current line, paragraph, or chapter, per the user’s input.

[0156] In implementations that extend to other media types such as images, text, or video, the media presentation may (further) comprise one or more visual tracks (e.g., comprising visual media relating to at least one of an image segment, a video segment, and a text-based segment). In this case, the metadata may comprise, for one or more (e.g., all) visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track. Then, step S720 may further involve playing out visual tracks among the one or more visual tracks in accordance with their playout conditions.

[0157] Format and Benefits

[0158] According to the present disclosure, where media content is authored in the manner described above, the format used to store and transmit the experience to the end user retains the information that associates presented elements to the story and to each other, rather than to absolute time on a presentation timescale. Doing so enables the playback experience to be varied in response to controls expressed later, such as user preference, listening environment, or accessibility.

[0159] Some examples include, but are not limited to:

[0160] 1 . Disabling all immersive audio (e.g., sound effects, music, ambience) if the user preference is to just have the book read to them. It is an additional advantage that the audiobook creator needed to make only one product to support this preference. 2. Selecting a different character voice, while retaining the other immersive presentation elements and ensuring they are presented at the appropriate point in the story, irrespective of speaking rates.

[0161] 3. Adjusting the presentation speed for speech. The sound effects and music can be played back at the same or a similar rate to avoid distortions, while still being presented at the appropriate point in the now-accelerated story.

[0162] 4. Navigating by the story. For example after pausing and resuming listening playback can resume at the beginning of the current line, paragraph, or chapter per the user preference. Skipping forward and backward in the story can operate similarly, rather than by time (e.g. forward or back 15s).

[0163] 5. Enabling the listener to take control of the listener’s listening experience with finegrained controls that let the listener adjust the audio and visual elements of these stories exactly to the listener’s liking, by for instance adding, removing, or changing the volume of the music, sound effects, additional character voices, or visual illustrations.

[0164] The above user adjustments can all be made while listening to the presentation, and the playback time can be maintained at the same point in the story, even if the time within the presentation to do so changes significantly.

[0165] Apparatus, Programs, and Storage Media

[0166] While methods of generating media presentations and of processing (e.g., decoding, rendering) media presentations have been described above, it is understood that the present disclosure likewise relates to apparatus (e.g., computer apparatus or apparatus having processing capability in general) for implementing these methods (or techniques in general), and to systems including such apparatus.

[0167] An example of such apparatus 800 is schematically illustrated in Fig. 8. The apparatus 800 comprises a processor 810 (or multiple processors) and a memory 820 coupled to the processor 810. The memory 820 may store instructions for execution by the processor 810. Processor 810 may be adapted to implement the apparatus described throughout the disclosure and / or to perform methods (e.g., methods of generating media presentations or processing media presentations) described throughout the disclosure. The apparatus 800 may receive inputs 830 (e.g., media tracks (audio tracks, visual tracks), user settings, etc.) and generate outputs 840 (e.g., media presentations, rendered audio, etc.) as described throughout the disclosure. Accordingly, the apparatus 800 may relate to any of an encoder-side apparatus or a decoder-side (e.g., user-side) apparatus, as the case may be.

[0168] The apparatus may comprise or may be coupled to a transmission device that may transmit the coded data (e.g., media presentation) in the form of a bitstream to a device or to a digital storage medium or through, for example, a network in the form of a file or streaming. The digital storage medium may include various storage mediums such as USB-C, USB, SD, CD, DVD, Blu-ray, HDD, SSD, and equivalent technologies. The digital storage medium may also be part of the coding unit of transmitter device.

[0169] The transmission device may include an element for generating the bitstream and / or a media file and may include an element for transmission, e.g., through a variety of mediums (Bluetooth, broadcast / communication networks, Internet technologies and equivalents). The transmission may be implemented using a variety of technologies such as, for example, RF, light waves, infrared, Bluetooth, WiFi, and / or acoustic transmission devices.

[0170] The present disclosure further relates to programs (e.g., computer programs) comprising instructions that, when executed by a processor (or multiple processors), cause the processor (or multiple processors) to carry out any of the methods described throughout the disclosure, and to computer-readable storage media storing such programs.

[0171] Fig. 9 shows a schematic block diagram of another, more detailed example electronic device or architecture 900 (e.g., an apparatus 900) suitable for implementing example embodiments of the present disclosure. Architecture 900 includes but is not limited to servers and client devices, systems, modules and methods as described in reference to Fig. 1 to Fig. 8. As shown, the architecture 900 includes central processing unit (CPU) 901 which is capable of performing various processes in accordance with a program stored in, for example, read only memory (ROM) 902 or a program loaded from, for example, storage unit 908 to random access memory (RAM) 903. The CPU 901 may be, for example, an electronic processor 901, which may include one or more processor cores, and in some examples the processor 901 may be multiple processors. In RAM 903, the data used when CPU 901 performs the various processes is also stored, as required. CPU 901, ROM 902 and RAM 903 are connected to one another via bus 904. Input / output (I / O) interface 905 is also connected to bus 904.

[0172] The following components are connected to I / O interface 905: input unit 906, that may include a keyboard, a mouse, or the like; output unit 907 that may include a display such as a liquid crystal display (LCD) and one or more speakers; storage unit 908 including a hard disk, or another suitable storage device; and communication unit 909 which may include a network interface card such as a network card (e.g., wired or wireless).

[0173] In some implementations, input unit 906 includes one or more microphones in different positions (depending on the host device) enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).

[0174] In some implementations, output unit 907 include systems with various number of speakers. Output unit 907 (depending on the capabilities of the host device) can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).

[0175] In some embodiments, communication unit 909 is configured to communicate with other devices (e.g., via a network). Drive 910 is also connected to I / O interface 905, as required. Removable medium 911, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 910, so that a computer program read therefrom is installed into storage unit 908, as required. A person skilled in the art would understand that although apparatus 900 is described as including the above-described components, in real applications, it is possible to add, remove, and / or replace some of these components and all these modifications or alteration all fall within the scope of the present disclosure.

[0176] In accordance with example embodiments of the present disclosure, the processes described above may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine-readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 909, and / or installed from the removable medium 911, as shown in Fig. 9.

[0177] Generally, various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (e.g., control circuitry), software, logic or any combination thereof. For example, the units discussed above can be executed by control circuitry (e.g., CPU 901 in combination with other components of Fig. 9), thus, the control circuitry may be performing the actions described in this disclosure. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, a processor and / or other computing device(s), which may include control circuitry.

[0178] While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques, or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0179] Additionally, various blocks shown in the flowcharts may be viewed as method steps, and / or as operations that result from operation of computer program code, and / or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine-readable medium, the computer program containing program codes configured to carry out the methods as described above.

[0180] In the context of the disclosure, a machine-readable medium may be any tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may be non-transitory and may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine-readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

[0181] Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to one or more processors of a general-purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by one or more processors of the computer or other programmable data processing apparatus, cause the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and / or servers.

[0182] Various Aspects and implementations of the invention may also be appreciated from the following enumerated example embodiments (EEEs), which are not claims.

[0183] EEE-A1. A method of generating a media presentation comprising a plurality of audio tracks and metadata including play out conditions (e.g., playout instructions) for playout of the audio tracks, the method comprising: receiving the plurality of audio tracks; generating, for one or more audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track; and generating the metadata based on the generated playout conditions for the one or more audio tracks.

[0184] EEE-A2. The method according to EEE-A1, further comprising generating, for one or more audio tracks among the plurality of audio tracks that relate to speech, an anchor representing a specific instance within the speech.

[0185] EEE-A3. The method according to EEE-A1 or EEE-A2, wherein the playout condition for a given audio track comprises an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track.

[0186] EEE-A4. The method according to any one of the preceding EEE-As, wherein the playout condition for a given audio track comprises an indication of an anchor in the given audio track at which playout is to be started.

[0187] EEE-A5. The method according to any one of the preceding EEE-As, wherein the playout condition for a given audio track comprises an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track.

[0188] EEE-A 6. The method according to any one of the preceding EEE-As, wherein the playout condition for a given audio track comprises an indication of a fade-out for the given audio track. EEE-A7. The method according to any one of the preceding EEE- As, wherein the play out condition for a given audio track comprises an indication of a first portion of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion of the given audio track that is intended for playout overlapping with the second audio track, wherein playout of the second portion is at a lower volume level than playout of the second audio track.

[0189] EEE-A8. The method according to any one of the preceding EEE- As, wherein the playout condition for a given audio track comprises an indication of a first anchor in a first audio track and a second anchor in a second audio track, wherein playout of the given audio track is to occur between the first anchor and the second anchor.

[0190] EEE-A9. The method according to any one of the preceding EEE-As, further comprising generating, for one or more audio tracks among the plurality of audio tracks, an anchor representing an onset of sound in the respective audio track.

[0191] EEE- A 10. The method according to any one of the preceding EEE-As, further comprising generating, for one or more audio tracks among the plurality of audio tracks, an anchor representing a termination of sound in the respective audio track.

[0192] EEE-A11. The method according to any one of the preceding EEE-As, wherein the anchor in the other audio track relates to a termination of sound in the other audio track.

[0193] EEE-A12. The method according to any one of the preceding EEE-As, wherein the playout condition is further based on a user setting.

[0194] EEE-A13. The method according to any one of the preceding EEE-As, wherein each audio track comprises a unique label or identifier.

[0195] EEE-A14. The method according to any one of the preceding EEE-As, wherein each of the plurality of audio tracks comprises audio relating to at least one of a speech segment, a music segment, an event-based segment, a non-dicgctic clement segment, and an ambience segment.

[0196] EEE-A15. The method according to any one of the preceding EEE-As, wherein the media presentation comprises one or more visual tracks; and the method comprises generating, for one or more visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track. EEE-A16. The method according to EEE-A15, wherein each of the one or more visual tracks comprises visual media relating to at least one of an image segment, a video segment, and a text-based segment.

[0197] EEE-A17. The method according to any one of the preceding EEE- As, further comprising providing the plurality of audio tracks and the metadata as part of the media presentation.

[0198] EEE-A18. A method of processing a media presentation for playout at a playback device, wherein the media presentation comprises a plurality of audio tracks and metadata that includes, for one or more audio tracks among the plurality of audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track, the method comprising: extracting the plurality of audio tracks from the media presentation; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions.

[0199] EEE-A19. The method according to EEE-A18, wherein playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises: playing out a first audio track; and playing out a second audio track in accordance with its playout condition, wherein the second audio track has a playout condition in relation to an anchor in the first audio track.

[0200] EEE-A20. The method according to EEE-A18 or EEE-A19, wherein the playout condition for a given audio track comprises an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises playing out the given audio track with said delay or advancement relative to the anchor in the other audio track.

[0201] EEE-A21. The method according to any one of EEE-A18 to EEE-A20, wherein the playout condition for a given audio track comprises an indication of an anchor in the given audio track at which playout is to be started.

[0202] EEE-A22. The method according to any one of EEE-A18 to EEE-A21, wherein the playout condition for a given audio track comprises an indication of a fade-in for the given audio track wi th a timing in relation to the anchor in the other audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises fading in the given audio track with a timing in relation to the anchor in the other audio track.

[0203] EEE-A23. The method according to any one of EEE-A18 to EEE-A22, wherein the playout condition for a given audio track comprises an indication of a fade-out for the given audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises fading out the given audio track with a timing indicated by the playout condition.

[0204] EEE-A24. The method according to any one of EEE-A18 to EEE-A23, wherein the playout condition for a given audio track comprises an indication of a first portion of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion of the given audio track that is intended for playout overlapping with the second audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises: playing out the first portion of the given audio track during the gap between the first and second audio tracks; and playing out the second portion of the given audio track overlapping with the second audio track, wherein playout of the second portion is at a lower volume level than playout of the second audio track.

[0205] EEE-A25. The method according to any one of EEE- Al 8 to EEE-A24, wherein the playout condition for a given audio track comprises an indication of a first anchor in a first audio track and a second anchor in a second audio track, wherein playout of the given audio track is to occur between the first anchor and the second anchor; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises: starting playout of the given audio track on occurrence of the first anchor; and ending playout of the given audio track on occurrence of the second anchor.

[0206] EEE-A26. The method according to any one of EEE-A18 to EEE-A25, wherein one or more audio tracks among the plurality of audio tracks include a respective anchor representing an onset of sound in the respective audio track. EEE-A27. The method according to any one of EEE-A18 to EEE-A26, wherein one or more audio tracks among the plurality of audio tracks include a respective anchor representing a termination of sound in the respective audio track.

[0207] EEE-A28. The method according to any one of EEE-A18 to EEE-A27, wherein one or more audio tracks among the plurality of audio tracks that relate to speech include a respective anchor representing a specific instance within the speech.

[0208] EEE-A29. The method according to any one of EEE-A18 to EEE-A28, wherein the anchor in the other audio track relates to a termination of sound in the other audio track.

[0209] EEE-A30. The method according to any one of EEE-A18 to EEE-A29, wherein the playout condition is further based on a user setting; the method further comprises: receiving input of the user setting; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions, based on the user setting.

[0210] EEE-A31. The method according to any one of EEE-A18 to EEE-A30, wherein each audio track comprises a unique label or identifier.

[0211] EEE-A32. The method according to any one of EEE-A18 to EEE-A31, wherein each of the plurality of audio tracks comprises audio relating to at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, and an ambience segment.

[0212] EEE-A33. The method according to any one of EEE-A18 to EEE-A32, wherein the media presentation comprises one or more visual tracks; the metadata comprises, for one or more visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track; and the method comprises playing out visual tracks among the one or more visual tracks in accordance with their playout conditions.

[0213] EEE-A34. The method according to EEE-A33, wherein each of the one or more visual tracks comprises visual media relating to at least one of an image segment, a video segment, and a text-based segment. EEE-A35. An apparatus comprising one or more processors and a memory coupled thereto, wherein the one or more processors are configured to perform the method according to any one of EEE-A1 to EEE-A17 or the method according to any one of EEE-A18 to EEE-A34.

[0214] EEE- A36. A computer program including instructions that when executed by one or more processors, cause the one or more processors to perform the method according to any one of EEE- Al to EEE-A17 or the method according to any one of EEE- Al 8 to EEE-A34.

[0215] EEE-A37. A computer-readable storage medium storing the computer program according to EEE- A36.

[0216] EEE-B 1. A method of authoring media content, the method comprising: receiving a representation of the media content; receiving a plurality of story moments, wherein each of story moments of the plurality of story moments comprises at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, an image segment, a video segment, a text-based segment, and an ambience segment; and anchoring each of the plurality of story moments to a corresponding story position in the representation of the media content.

[0217] EEE-B2. The method of EEE-B 1, wherein the media content comprises story-based media content.

[0218] EEE-B3. The method of EEE-B 1 or EEE-B2, wherein when the plurality of story moments comprises a speech segment, the anchoring comprises anchoring the speech segment at a line level story position, a word level story position, or a character level story position in the representation of the media content.

[0219] EEE-B4. The method of any one of EEE-B 1 to EEE-B3, wherein the event-based segment comprises a sound effect.

[0220] EEE-B 5. The method of any one of EEE-B 1 to EEE-B4, wherein the representation of the media content is based on a text-based script of the media content.

[0221] EEE-B6. The method of any one of EEE-B 1 to EEE-B5, wherein the plurality of story moments comprises a speech segment, the speech segment comprises a flag indicating a presence of an event-based segment.

[0222] EEE-B7. The method of EEE-B6, wherein when the flag indicates the presence of an eventbased segment, an offset is added to the speech segment and / or to the event-based segment. EEE-B8. The method of any one of EEE-B1 to EEE-B7, wherein the corresponding story position comprises a specific point in the representation of the media content or a specific region in the representation of the media content.

[0223] EEE-B9. The method of any one of EEE-B 1 to EEE-B8, wherein the story moments are editable and / or selectable by a listener of the media content at playback of the media content.

[0224] EEE-B 10. An apparatus configured to perform the method of any one of EEE-B 1 to EEE-B9.

[0225] EEE-B11. A non-transitory computer readable storage medium comprising a sequence of instructions which, when executed, cause one or more devices to perform the method of any one of EEE-B 1 to EEE-B9.

Claims

Claims1. A method of generating a media presentation comprising a plurality of audio tracks and metadata including playout conditions for playout of the audio tracks, the method comprising: receiving the plurality of audio tracks; generating, for one or more audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track; and generating the metadata based on the generated playout conditions for the one or more audio tracks.

2. The method according to claim 1, further comprising generating, for one or more audio tracks among the plurality of audio tracks that relate to speech, an anchor representing a specific instance within the speech.

3. fhe method according to claim 1 or 2, wherein the playout condition for a given audio track comprises an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track.

4. The method according to any one of the preceding claims, wherein the playout condition for a given audio track comprises an indication of an anchor in the given audio track at which playout is to be started.

5. The method according to any one of the preceding claims, wherein the playout condition for a given audio track comprises an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track.

6. The method according to any one of the preceding claims, wherein the playout condition for a given audio track comprises an indication of a fade-out for the given audio track.

7. The method according to any one of the preceding claims, wherein the playout condition for a given audio track comprises an indication of a first portion of the given audio track that is intended for playout during a gap between first and second audio tracks that areplayed out in sequence, and an indication of a second portion of the given audio track that is intended for playout overlapping with the second audio track, wherein playout of the second portion is at a lower volume level than playout of the second audio track.

8. The method according to any one of the preceding claims, wherein the playout condition for a given audio track comprises an indication of a first anchor in a first audio track and a second anchor in a second audio track, wherein playout of the given audio track is to occur between the first anchor and the second anchor.

9. The method according to any one of the preceding claims, further comprising generating, for one or more audio tracks among the plurality of audio tracks, an anchor representing an onset of sound in the respective audio track.

10. The method according to any one of the preceding claims, further comprising generating, for one or more audio tracks among the plurality of audio tracks, an anchor representing a termination of sound in the respective audio track.

11. The method according to any one of the preceding claims, wherein the anchor in the other audio track relates to a termination of sound in the other audio track.

12. The method according to any one of the preceding claims, wherein the playout condition is further based on a user setting.

13. The method according to any one of the preceding claims, wherein each audio track comprises a unique label or identifier.

14. The method according to any one of the preceding claims, wherein each of the plurality of audio tracks comprises audio relating to at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, and an ambience segment.

15. The method according to any one of the preceding claims, wherein the media presentation comprises one or more visual tracks; andthe method comprises generating, for one or more visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track.

16. The method according to claim 15, wherein each of the one or more visual tracks comprises visual media relating to at least one of an image segment, a video segment, and a text-based segment.

17. The method according to any one of the preceding claims, further comprising providing the plurality of audio tracks and the metadata as part of the media presentation.

18. A method of processing a media presentation for playout at a playback device, wherein the media presentation comprises a plurality of audio tracks and metadata that includes, for one or more audio tracks among the plurality of audio tracks, a playout condition for the respective audio track in relation to an anchor in another audio track, the method comprising: extracting the plurality of audio tracks from the media presentation; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions.

19. The method according to claim 18, wherein playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises: playing out a first audio track; and playing out a second audio track in accordance with its playout condition, wherein the second audio track has a playout condition in relation to an anchor in the first audio track.

20. The method according to claim 18 or 19, wherein the playout condition for a given audio track comprises an indication of a delay or an advancement of playout of the given audio track relative to the anchor in the other audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises playing out the given audio track with said delay or advancement relative to the anchor in the other audio track.

21. The method according to any one of claims 18 to 20, wherein the playout condition for a given audio track comprises an indication of an anchor in the given audio track at which playout is to be started.

22. The method according to any one of claims 18 to 21, wherein the playout condition for a given audio track comprises an indication of a fade-in for the given audio track with a timing in relation to the anchor in the other audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises fading in the given audio track with a timing in relation to the anchor in the other audio track.

23. The method according to any one of claims 18 to 22, wherein the playout condition for a given audio track comprises an indication of a fade-out for the given audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises fading out the given audio track with a timing indicated by the playout condition.

24. The method according to any one of claims 18 to 23, wherein the playout condition for a given audio track comprises an indication of a first portion of the given audio track that is intended for playout during a gap between first and second audio tracks that are played out in sequence, and an indication of a second portion of the given audio track that is intended for playout overlapping with the second audio track; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises: playing out the first portion of the given audio track during the gap between the first and second audio tracks; and playing out the second portion of the given audio track overlapping with the second audio track, wherein playout of the second portion is at a lower volume level than playout of the second audio track.

25. The method according to any one of claims 18 to 24, wherein the playout condition for a given audio track comprises an indication of a first anchor in a first audiotrack and a second anchor in a second audio track, wherein playout of the given audio track is to occur between the first anchor and the second anchor; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions comprises: starting playout of the given audio track on occurrence of the first anchor; and ending playout of the given audio track on occurrence of the second anchor.

26. The method according to any one of claims 18 to 25, wherein one or more audio tracks among the plurality of audio tracks include a respective anchor representing an onset of sound in the respective audio track.

27. The method according to any one of claims 18 to 26, wherein one or more audio tracks among the plurality of audio tracks include a respective anchor representing a termination of sound in the respective audio track.

28. The method according to any one of claims 18 to 27, wherein one or more audio tracks among the plurality of audio tracks that relate to speech include a respective anchor representing a specific instance within the speech.

29. The method according to any one of claims 18 to 28, wherein the anchor in the other audio track relates to a termination of sound in the other audio track.

30. The method according to any one of claims 18 to 29, wherein the playout condition is further based on a user setting; the method further comprises: receiving input of the user setting; and playing out the one or more audio tracks among the plurality of audio tracks in accordance with their playout conditions, based on the user setting.

31. The method according to any one of claims 18 to 30, wherein each audio track comprises a unique label or identifier.

32. The method according to any one of claims 18 to 31, wherein each of the plurality of audio tracks comprises audio relating to at least one of a speech segment, a music segment, an event-based segment, a non-diegetic element segment, and an ambience segment.

33. The method according to any one of claims 18 to 32, wherein the media presentation comprises one or more visual tracks; the metadata comprises, for one or more visual tracks, a playout condition for the respective visual track in relation to an anchor in another audio track or in another visual track; and the method comprises playing out visual tracks among the one or more visual tracks in accordance with their playout conditions.

34. The method according to claim 33, wherein each of the one or more visual tracks comprises visual media relating to at least one of an image segment, a video segment, and a text-based segment.

35. An apparatus comprising one or more processors and a memory coupled thereto, wherein the one or more processors are configured to perform the method according to any one of claims 1 to 17 or the method according to any one of claims 18 to 34.

36. A computer program including instructions that when executed by one or more processors, cause the one or more processors to perform the method according to any one of claims 1 to 17 or the method according to any one of claims 18 to 34.

37. A computer-readable storage medium storing the computer program according to claim 36.