Methods and systems for processing subtitles and performing audio processing based on subtitles

By using dialogue classifiers to align subtitle file timestamps with audio segments, the method corrects timing misalignments, enhancing dialogue intelligibility and accessibility in subtitle files.

WO2026122594A1PCT designated stage Publication Date: 2026-06-11DOLBY LABORATORIES LICENSING CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
DOLBY LABORATORIES LICENSING CORP
Filing Date
2025-12-02
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Existing subtitle files often have timing misalignments between dialogue and text strings, leading to inaccuracies in dialogue intelligibility, especially for viewers with hearing impairments or in low-volume audio environments.

Method used

A method and system that uses dialogue classifiers to accurately identify dialogue and non-dialogue segments in audio signals, adjusting subtitle file timestamps to align with these segments, thereby enhancing dialogue intelligibility and correcting timing errors.

🎯Benefits of technology

Improves dialogue alignment and intelligibility by accurately synchronizing subtitle text with audio content, providing enhanced accessibility for viewers with hearing impairments and improving dialogue clarity in low-volume settings.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025057768_11062026_PF_FP_ABST
    Figure US2025057768_11062026_PF_FP_ABST
Patent Text Reader

Abstract

The present disclosure relates to a method and system for modifying a subtitle file. The method comprises obtaining a subtitle file and an associated audio signal, and processing the audio signal with a dialogue classifier to generate a classifier confidence for each audio frame of the audio signal. Each audio frame is identified as a dialogue or non-dialogue audio frame to form a sequence of dialogue and non-dialogue audio segments. Each dialogue audio segment is associated with a start and end time. The method further comprises identifying a respective corresponding dialogue audio segment of the audio signal and modifying the start time and / or the end time of the at least first text string to obtain a target alignment of the start time and / or end time of the at least first text string with the start time and / or end time of the respective corresponding dialogue audio segment.
Need to check novelty before this filing date? Find Prior Art

Description

METHODS AND SYSTEMS FOR PROCESSING SUBTITLES AND PERFORMING AUDIO PROCESSING BASED ON SUBTITLESCROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority from US Provisional Application Ser. No. 63 / 727,976, filed on 4 December 2024, and EP Patent Application No. 25161268.5, filed on 3 March 2025, each of which is incorporated by reference herein in its entirety.TECHNICAL FIELD

[0002] The present disclosure relates to methods and systems for processing subtitle files, and method and systems for performing audio processing based on subtitles.BACKGROUND

[0001] For many types of broadcast or streamed audiovisual content, subtitles are common and often required to meet accessibility requirements. Subtitles provide visual information typically tied to the dialogue of a video wherein the subtitles are displayed as text overlaying the video that allow viewers, such as hearing impaired, to read the dialogue. Other information tied to the audio content associated with the video may also be described by the subtitles by e.g. displaying texts such as “[radio playing in the background]”, “[dogs barking]” or “[engine sounds]” allowing viewers to understand further aspects of the audio content being played back alongside the video, even if the viewer has difficulty hearing the played back audio content or is watching the video with the audio on low volume or muted.

[0002] Subtitles are typically provided as a manually prepared subtitle file carrying the subtitle text strings alongside time stamps for each text string indicating when each text string is to be displayed. One or more subtitle files are often included as metadata alongside the video. For example, multiple subtitle files in different languages may be provided as metadata for a video, allowing viewers to select the subtitle language they desire. Typical subtitle file formats are Timed Text Markup Language (TTML), Video Text Tracks (VTT), Scenarist Closed Caption (SCC) and SubRip Subtitle (SRT).

[0003] Another method for making dialogue more accessible and / or generally improving dialogue intelligibility is by the application of various forms of dialogue enhancement (DE). Existing DE processing methods improve dialogue intelligibility by e.g. application of suitable EQ filters or even time-frequency gain masks generated by a neural network. However, dialogue is typically not active at all times, and to avoid application of DE processing in non-dialogue segments (which otherwise could lead to unwanted audio artifacts), DE processing is typicallytoggled using a dialogue classifier which determines the confidence of dialogue being active at regular intervals. When the dialogue classifier determines that dialogue is present, the DE processor is activated to improve dialogue intelligibility and when the dialogue classifier determines that dialogue is not present, the DE processor is deactivated to avoid distortion of what is likely non-dialogue audio content.SUMMARY

[0004] It is a purpose of the present disclosure to provide methods and systems for making dialogue more accessible and / or intelligible.

[0005] According to a first aspect there is provided a computer-implemented method for modifying a subtitle file, comprising obtaining a subtitle file and an associated audio signal comprising a sequence of audio frames, the subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal. The method further comprises processing the audio signal with a dialogue classifier to generate a classifier confidence for each audio frame of the sequence of audio frame. The method further comprises, based on the classifier confidence for each audio frame, identifying each audio frame as a dialogue audio frame or non-dialogue audio frame so as to form a sequence of dialogue and nondialogue audio segments, each dialogue audio segment being associated with a start time and an end time and comprising at least one dialogue audio frame and for at least a first text string of the plurality of text strings of the subtitle file, identifying a respective corresponding dialogue audio segment of the audio signal, based on the start and / or end times associated with the at least first text string and the start and / or end times associated with the dialogue audio segments. The method further comprises modifying the start time and / or the end time of the at least first text string to obtain a target alignment of the start time and / or end time of the at least first text string with the start time and / or end time of the respective corresponding dialogue audio segment.

[0006] A text string is herein used to denote a sequence of characters represented in any character coding format. That is, a text string is not limited to the datatype “string” used in some programming languages but rather used to generally refer to text data.

[0007] Hereby, the method of the first aspect enables automatic correction and timing alignment enhancement by modifying a subtitle file. In some subtitle files, the time stamps of text strings are not always accurately aligned with the audio content of the audio signal and the method according to first aspect corrects timing errors based on a classifier confidence value extracted by a dialogue classifier. For example, in some subtitle files there is a delay between the start / end of the dialogue in the associated audio signal and the start / end time of the associateddialogue text string being displayed, or vice versa, and the method of the first aspect may be used to enhance this alignment and reduce the delay.

[0008] According to a second aspect, a subtitle processing system configured to perform the method according to the first aspect.

[0009] According to a third aspect there is provided a computer-readable storage media having software stored thereon, the software comprising instructions configured to control a processor to perform the method according to the first aspect.

[0010] According to a fourth aspect there is provided a computer-implemented method for processing audio content, comprising: obtaining a subtitle file and an audio signal comprising a sequence of audio frames, the subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal, wherein each text string is further associated with at least one label out of a plurality of labels and for each audio frame of the audio signal, determining a dialogue confidence value, wherein the dialogue confidence value is based on the label of a text string overlapping in time with the audio frame. The method further comprises performing dialogue enhancement processing on the audio frames of the audio signal based on the dialogue confidence value.

[0011] Subtitle files carrying accurately timed text strings and labeled information may hereby be used to control dialogue enhancement processing to enhance dialogue intelligibility.

[0012] In some implementations of the second aspect, the method further comprises for each audio frame of the audio signal, processing the audio frame with a first dialogue classifier to obtain a first classifier confidence value, wherein the dialogue confidence value is further based on the first classifier confidence value.

[0013] Accordingly, the labeled subtitle file may be used in conjunction with a dialogue classifier to enable dialogue enhancement control which is more accurate compared to solutions where only a dialogue classifier is used. Especially, in situations where dialogue classifiers are strained the use of a labeled subtitle file may result in more accurate control of dialogue enhancement.

[0014] According to a fifth aspect, there is provided an audio processing system configured to perform the method according to the fourth aspect.

[0015] According to a sixth aspect, there is provided a computer-readable storage media having software stored thereon, the software comprising instructions configured to control a processor to perform the method according to the fourth aspect.DESCRIPTION OF THE DRAWINGS

[0016] Aspects of the present disclosure will be described in more detail with reference to the appended drawings, showing exemplary embodiments.

[0017] Figure 1 is a block diagram illustrating a subtitle processing system, according to some implementations.

[0018] Figure 2 is a flowchart, illustrating a method for processing a subtitle file, according to some implementations.

[0019] Figure 3 illustrates the contents of a subtitle file, according to some implementations.

[0020] Figure 4 illustrates grouping of frames into dialogue and non-dialogue audio segments, according to some implementations.

[0021] Figures 5A and 5B illustrate different methods for modifying the start and / or end time of a text string to reach a target alignment with the dialogue audio segments, according to some implementations.

[0022] Figures 6A - 6H illustrate yet another method for modifying the start and / or end time of a text string to reach a target alignment with the dialogue audio segments, according to some implementations.

[0023] Figure 7 illustrates a method for modifying the start and / or end time of a text string to reach to reach a target alignment with a longer dialogue audio segment, according to some implementations.

[0024] Figure 8 illustrates removal of non-dialogue text strings from a subtitle file prior to identification of respective corresponding dialogue audio segments and start and / or end time modification, according to some implementations.

[0025] Figure 9 illustrates the contents of a modified subtitle file, expanded with label information for each text string, according to some implementations.

[0026] Figure 10 is a block diagram illustrating an audio processing system configured to process a plurality of candidate subtitle files, according to some implementations.

[0027] Figure 11 is a block diagram illustrating an audio processing system, according to some implementations.

[0028] Figure 12 is a flowchart, illustrating a method for processing an audio signal, according to some implementations.

[0029] Figure 13 is a block diagram illustrating a dialogue loudness system for determining an accurate dialogue loudness measure, according to some implementation.

[0030] Figure 14 is a block diagram illustrating an apparatus for performing or implementing methods and techniques described throughout the present disclosure.DETAILED DESCRIPTION

[0031] FIG. 1 is a block diagram schematically illustrating a subtitle processing system 1. With further reference to the flowchart in FIG. 2 subtitle processing system 1 will now be described.

[0032] The subtitle processing system 1 obtains at step SI as an input a subtitle file and an associated audio signal. The subtitle processing system 1 comprises, one or more dialogue classifiers 12a, 12b and (if two or more dialogue classifiers 12a, 12b are used) a classifier confidence combiner 13. The subtitle processing system 1 further comprises a dialogue / non- dialogue segmentation module 14 and a start / end time modifier module 15. The subtitle processing system may optionally further comprise a parser module 11.

[0033] The subtitle file may be any type of subtitle file, such as a TTML file, a VTT file, an SCC file or an SRT file. The subtitle file is typically associated with an audio signal comprising a plurality of audio samples and / or a video file comprising a plurality of video frames.

[0034] While the formatting and coding of different subtitle files differs, the subtitle processing system 1 may be configured to work with any type of subtitle file. The information typically carried in a subtitle file is illustrated in FIG. 3, showing the contents of an exemplary subtitle file 100.

[0035] The subtitle file 100 comprises a number of text strings (represented as individual rows in FIG. 3). Each text string comprises a series of characters and is further associated with at least time information indicating a point in time (in the audio signal and / or video file) when the associated text string is to be displayed. For instance, for the subtitle file 100 in FIG. 3 the first text string “String 1” is to be displayed between timestamps 0:00:02 and 0:00:05 (H:MM:SS) of an associated audio or video file. Some subtitle file formats further allow each string to be associated with display style metadata indicating one or more display properties for how the string is to be displayed, e.g. the font, font size, italics, underlining, boldface, letter case, color, character spacing and position on the display (e.g. bottom center, bottom left, bottom right, top right etc.). The display style metadata is indicated in the second column in FIG. 3. The duration of the video and / or audio content associated with the subtitle file may be as short as only a few seconds or (as is the case for e.g. movies) several hours long. Hereby, the time stamps of the subtitle file may span a time range between a few seconds and several hours or even longer.

[0036] Turning back to the subtitle processing system 1 of FIG. 1, and the flowchart of FIG. 2, the audio signal comprises a sequence of audio frames. An audio frame may comprise a single sample (e.g. if the audio signal is in the time domain an audio frame may comprise a single audio sample). An audio frame may also comprise a plurality of samples (such as a groupof time domain samples). The audio signal may be represented in time domain or in a transform domain, such as a time-frequency domain, a QMF domain and an MDCT domain wherein each audio frame comprises one or more transform domain samples (e.g. a group of MDCT samples spanning time and frequency). Irrespective of the domain in which the audio signal is represented a frame represents a part or “snippet” of the audio content of the audio signal.

[0037] At step S2, the audio signal is provided to a first dialogue classifier 12a configured to process the audio signal to generate a first classifier confidence value CCV 1 for each frame of the audio signal. The first classifier confidence value CCV 1 indicates a degree of confidence for dialogue being present in each frame. For example, the first dialogue classifier 12a processes the audio signal on a frame-by-frame basis and generates the first classifier confidence value CCV1 for one frame at a time. Typically, the classifier confidence value is a value in the range of [0, 1] with 0 indicating low confidence for dialogue being active, 1 indicating high confidence for dialogue being active and values between 0 and 1 indicating various confidence values.

[0038] The first dialogue classifier 12a may be any type of computer implemented dialogue classifier.

[0039] For example, a simple dialogue classifier may be realized by analyzing the energy level in a frequency band associated with human speech (e.g. between 1 kHz - 5 kHz) whereby the classifier confidence value is proportional to the energy level.

[0040] More sophisticated dialogue classifiers utilize machine learning algorithms, such as trained neural networks, to generate more accurate classifier confidence values. To train a neural network to predict classifier confidence values training data in the form of manually labeled audio frames may be provided wherein each frame is labelled as a speech or non- speech frame based on the presence of speech. The training frames are provided as input to the neural network and the neural network outputs a predicted classifier confidence value (e.g. a value between 0 and 1). The learnable parameters of the neural network are updated based on the label of each training frame so as to promote the neural network to output a high speech classifier confidence value for speech frames and a low classifier confidence for non-speech frames.

[0041] Optionally, the audio signal may be processed with one or more additional dialogue classifiers 12b at step S2. In FIG. 1 two dialogue classifiers, the first dialogue classifier 12a generating the first classifier confidence value CCV1 and a second dialogue classifier 12b generating a second classifier confidence value CCV2 are used. However, it is envisaged that three or more dialogue classifiers could be used in an analogous manner.

[0042] By using two or more, different, dialogue classifiers 12a, 12b their different classification characteristics may be leveraged to form a more accurate combined classifier confidence value, which more accurately describes where the boundary between dialogue andnon-dialogue audio content is located. For example, the second dialogue classifier 12b is more conservative for detecting dialogue (leading to few false positives but potentially an increased number of false negatives) and the first dialogue classifier 12a is more liberal (leading to few false negatives but potentially an increased number of false positives). To determine how liberal and conservative a dialogue classifier 12a, 12b is the dialogue classifiers 12a, 12b may be tested to determine their false positive rate and / or false negative rate. The properties of a dialogue classifier 12a, 12b depend on its structure (e.g. the type of neural network used and its training).

[0043] Continuing the example above, wherein the first dialogue classifier 12a is more liberal and the second dialogue classifier 12b more conservative a singing voice may be detected as dialogue by the first dialogue classifier 12a but not by the second dialogue classifier 12b. In some scenarios, it is desired that a singing voice is not to be classified as dialogue (since subsequent dialogue enhancement processing may introduce artifacts to the singing) and while it would be difficult to distinguish between regular spoken dialogue and a singing voice using only the first dialogue classifier 12a, the different characteristics of the two dialogue classifiers 12a, 12b may be leveraged to distinguish between spoken dialogue and a singing voice, to classify the singing voice as non-dialogue.

[0044] The first and second classifier confidence value CCV1, CCV2 are provided to the classifier confidence combiner 13 which combines the classifier confidence values CCV1, CCV2, into a combined classifier confidence value CCVC. The combination of the two or more classifier confidence values CCV1, CCV2 performed by classifier confidence combiner 13 could be of various types. For example, the classifier confidence combiner 13 determines, for each frame, an average, median, highest or lowest value based on the classifier confidence values CCV1, CCV2 received as an input.

[0045] As another example, the classifier confidence combiner 13 may be configured to use the first classifier confidence CCV1 as the combined classifier confidence value CCVC as long as the first and second classifier confidence CCV1, CCV2 are in agreement (for example have a difference smaller than a predetermined threshold). The classifier confidence combiner 13 may be further configured to use the second classifier confidence CCV2 as the combined classifier confidence value CCVC when the first and second classifier confidence CCV1, CCV2 are not in agreement (for example have a difference greater than the predetermined threshold). When the first dialogue classifier 12a is comparatively more liberal than the second dialogue classifier 12b as exemplified above, this configuration of the classifier confidence combiner 13 may avoid assigning high confidence values to frames containing a singing voice.

[0046] In some implementations, only the first dialogue classifier 12a is used. In such implementations, the classifier confidence combiner 13 may be omitted and the first classifier confidence is used as the combined classifier confidence value CCVC directly.

[0047] At step S3 the combined classifier confidence value CCVC is provided to the dialogue / non-dialogue segment module 14 alongside the audio signal and the dialogue / non- segment module 14 groups the frames of the audio signal into the dialogue and non-dialogue audio segments based on the combined classifier confidence value CCVC, each dialogue and non-dialogue audio segment comprising at least one frame.

[0048] An audio frame (being e.g. less than 20 ms or less than 100 ms long) is typically shorter than the duration of even a single word utterance and / or the time a subtitle string is displayed (typically at least 0.5 s). The dialogue / non-dialogue audio segmentation module 14 enables the combined classifier confidence values CCVC having a comparatively high-time resolution to be compared with the comparatively low time resolution of the text strings by grouping multiple frames into dialogue and non-dialogue audio segments based on the combined classifier confidence value CCVC.

[0049] The grouping of frames into dialogue and non-dialogue audio segments may be performed in various ways. FIG. 4 illustrates an exemplary sequence 105 of ten frames Fi - Fio wherein the combined classifier confidence CCVC of each frame is indicated for each frame. For instance, the CCVC of the third frame F3 is 0.9 meaning that it is very likely that dialogue is present in this frame and for the fourth frame F4 the CCVC has dropped to 0.2 meaning that it is less likely that dialogue is present in this frame.

[0050] The grouping may be performed by comparing the CCVC of each frame Fi - Fio to a first threshold and all frames having a CCVC exceeding the first threshold are labeled as dialogue frames and consecutive dialogue frames form a dialogue audio segment. On the other hand, all frames having a CCVC below the first threshold are labeled as non-dialogue frames and consecutive non-dialogue frames form a non-dialogue audio segment. In FIG. 4, the first threshold is 0.5 and frames Fi - F3 as well as frames F9 - Fiohave a CCVC exceeding 0.5 meaning that these are grouped into segments Si and S3, respectively, wherein segments Si and S3 are dialogue audio segments. Similarly, frames F4 - Fs have a CCVC below 0.5 meaning that these are grouped into segment S2 being a non-dialogue audio segment.

[0051] Other ways of grouping the frames Fi - Fio based on the CCVC of each frame are also envisaged. For example, the CCVC values of frames Fi - Fio may be smoothed with a smoothing kernel prior to forming the segments or consecutive frames may be grouped together to form a dialogue audio segment based on a criteria that their mean or median CCVC exceeds athreshold. Alternatively, the CCVC output by the classifier confidence combiner 13 is already smoothed.

[0052] Additionally or alternatively the dialogue audio segments are smoothed as a separate step after grouping of the frames Fi - Fio to form the segments. The smoothing may e.g. serve to remove isolated dialogue segments or serve to reduce abrupt changes. For instance, the smoothing may ensure that no dialogue audio segment and / or non-dialogue audio segment is shorter than a predetermined minimum time to avoid too rapid or frequent switching between dialogue and non-dialogue audio segments.

[0053] The CCVC (and anyone of the CCV1 and CCV2) may be a numerical value. For example, the numerical value may be defined on a numerical range definer between a first numerical value and a second numerical value, wherein the first numerical value indicates a minimum classifier confidence and the second numerical value indicates a maximum classifier confidence. As an illustrative example, the first numerical value is zero and the second numerical value is one conventional dialogue classifiers typically output a classifier confidence in this range.

[0054] With numerical values representing CCVC, CCV1 and CCV2 it is possible to perform mathematical operations such as smoothing, averaging and comparison (e.g. to thresholds). With numerical values it is possible to calculate a (cross) correlation between the active time of the text strings and the CCVC, CCV1 and / or CCV2 as described below.

[0055] Since the classifier confidence may be determined with a dialogue classifier for each individual audio frame, which may be shorter than 200 ms, shorter than 100 ms, shorter than 50 ms or shorter than 20 ms, the presence of dialogue in an audio signal can be determined with an accuracy which is greater than the rate of utterances, wherein each utterance (word) is typically longer than 0.5 seconds.

[0056] To identify each audio frame as a dialogue audio frame or non-dialogue audio frame the CCVC, CCV1 or CCV2 may be smoothed. The smoothing may comprise smoothing the CCVC, CCV1 or CCV2 across time with a smoothing kernel. By comparing the, optionally smoothed, CCVC, CCV1 or CCV2 to a threshold defined on the numerical range each frame may be identified as a dialogue audio frame or non-dialogue audio frame.

[0057] In some implementations, the first numerical value is smaller than the second numerical value, and the threshold is between the first and second numerical value. That is, a high CCVC, CCV1 or CCV2 represents a high classifier confidence and vice versa. However, it is envisaged that the opposite definition may also be used (a low CCVC, CCV1 or CCV2 represents a high classifier confidence and vice versa).

[0058] With the example of the first numerical value being smaller than the second numerical value, and the threshold being between the first and second numerical value and audio frame having an, optionally smoothed, CCVC, CCV1 or CCV2 above the threshold is identified as a dialogue audio frame and each audio frame having a CCVC, CCV1 or CCV2 below the threshold is identified as a non-dialogue audio frame.

[0059] When grouping the dialogue and non-dialogue audio frames into dialogue and nondialogue segments this may be performed under the constraint that each dialogue or non- dialogue segments should have a duration being at least a predetermined minimum duration. This may be accomplished by reassigning one or more specific audio frames identified as dialogue audio frames using a threshold comparison, surrounded by a sequence of temporally earlier non-dialogue audio frames and a sequence of temporally later non-dialogue audio frames as non-dialogue frames so as to form a non-dialogue segment spanning the temporally earlier audio frames, the temporally later audio frames and the one or more specific audio frames. The opposite scenario, with one or more specific audio frames identified as non-dialogue audio frames surrounded by dialogue audio frames, is also envisaged wherein a long dialogue segment is formed comprising the specific audio frames and the surrounding audio frames. Hereby, the step of identifying each audio frame as a dialogue audio frame or non-dialogue audio frame so as to form a sequence of dialogue and non-dialogue audio segments may comprise forming dialogue and non-dialogue audio segments having at least the predetermined minimum duration.

[0060] The predetermined minimum duration may be longer than the audio frame length and e.g. correspond to a plurality of audio frame lengths. Hereby, in some implementations, each dialogue and non-dialogue segment has a minimum duration corresponding to the duration of at least two audio frames.

[0061] The dialogue and non-dialogue audio segments Si, S2, S3 are each associated with a start time and an end time. In the example depicted in FIG. 4, dialogue audio segment Si starts at time ti and ends at time t2, non-dialogue audio segment S2 starts at time t2 and ends at time t3, and dialogue audio segment S3 starts at time t3 and ends at time U, wherein U > t3 > t2 > ti.

[0062] Turning back to FIG. 1 and FIG. 2, the dialogue audio segments, or at least the start and end times of the dialogue audio segments, are provided to the start / end time modifier module 15 configured to process the subtitle file based on the start and end times of the dialogue audio segments Si, S3.

[0063] At step S4 the start / end time modifier module 15 identifies a dialogue audio segment which corresponds to a first text string. This dialogue audio segment is referred to as the respective corresponding dialogue audio segment. Optionally, this process is repeated for each text string meaning that for each text string a respective corresponding dialogue audio segment isidentified. A respective corresponding dialogue audio segment may be the dialogue audio segment having a start and / or end time being closest to the start and / or end time of a text string.

[0064] The start / end time modifier module 15 further forms a modified subtitle file by modifying at least one of the start time and the end time of at least one text string of the subtitle file at step S5 to reach a target alignment between at least one of the start time and end time of the first dialogue text string and at least one of the start time and end time of the corresponding respective dialogue audio segment.

[0065] In FIGS 5A, 5B and 6A-6H various methods for modifying at least one start and / or end time are described to reach a target alignment.

[0066] In FIG. 5A the start and end times of three dialogue audio segments Sa, Sb, Scare indicated in the top graph with the high state D (for dialogue) indicating a dialogue audio segment and the low state ND (for non-dialogue) indicating a non-dialogue audio segment. Accordingly, based on the audio signal, the processing of the one or more dialogue classifiers 12a, 12b and the formation of the dialogue audio segments, in segments Sa, Sb, Scdialogue is active whereas no dialogue is active in between segments Sa, Sb, Sc.

[0067] However, comparing the top graph in FIG. 5A to the middle graph, illustrating the timeslots in which the text strings Ta, Tb, Tcare to be displayed in accordance with the start and end times indicated in the subtitle file reveals that there is a time offset between when a text string is displayed and when the dialogue is active. For example, text string Tais displayed after the corresponding dialogue audio segment Sastarts and text string Taends after corresponding dialogue audio segment Saends. Furthermore, there is also a timing misalignment between text string Tb and its corresponding dialogue audio segment Sb and a timing misalignment between text string Tcand its corresponding dialogue audio segment Sc.

[0068] The start / end time modifier module 15 may adjust the start and / or end time of at least one text string to reach a target alignment with enhanced general alignment for all text strings of the entire subtitle file.

[0069] In the bottom graph of FIG. 5A a constant negative time adjustment At has been used to adjust all start times and end times of the subtitle file to form the modified subtitle file. This is one example of global time alignment which is applied to the entire subtitle file. The constant time adjustment At has been determined by the start / end time modifier module 15 such that there is on average better alignment between dialogue audio segments and the text strings for the whole subtitle file and audio signal. The constant time adjustment At may be found by evaluating a cross-correlation between the text strings and the dialogue audio segments for a range of negative and positive time adjustments of the text strings and selecting the time adjustment At with the greatest cross-correlation in the evaluated range. This type of alignmentworks well if a potential misalignment between the audio signal and the subtitle file is constant over time.

[0070] The cross-correlation may be determined using a cross-correlation function. For example, the start and end times of each text string may be used to form a function x(t) which is equal to one (or a predetermined positive high value) at times t when a text string is active and equal to zero (or a predetermined positive low value) when no text string is active. The audio segments may from a function y(t) which is equal to one (or a predetermined positive high value) for dialogue audio segments and equal to zero (or a predetermined positive low value) for nondialogue audio segments. The functions x(t) and y(t) may be discrete functions where t = ti, t2, ts, .. . of a plurality of samples or frames. A cross-correlation function Corr(x, y) may then be calculated as

[0071] Alternatively, the function y(t) may be replaced in equation 1 with a classifier confidence value (e.g. the first classifier confidence value CCV1 or the combined classifier confidence CCVC) which is a continuous valued function varying between e.g. zero and one. As another option, a centered cross-correlation Corrc(x, y) may be used, such aswherein u and v denote the mean value of x(t) and y(t) respectively. Again, y(t) may be replaced in equation 2 with a classifier confidence value (e.g. the first classifier confidence value CCV1 or the combined classifier confidence CCVC) and v replaced with the mean value of CCV1 or CCVC.

[0072] The cross-correlation (such as the cross-correlation function of equation 1 or 2) may be evaluated over a range of negative and positive candidate time adjustments Ati, At2, Ats, .. . AtN of the text strings. For example, Corr[x(t + Ati), y(t)] is evaluated for i = 1, 2, 3, ... N and the candidate time adjustment Ati which results in the maximum cross-correlation is used as the time adjustment At for modifying the start and / or end times of the text strings. The candidate time adjustments may be spread over a predetermined interval, such as between - 1 second and +1 second or between -5 seconds and + 5 seconds to name a few examples.

[0073] In some implementations, the adjustment of the start and end times of the text strings is performed not for the whole subtitle file and audio signal but for blocks of the subtitle file and audio signal. In the two top graphs of FIG. 5B the same dialogue audio segments Sa, Sb, Scand text strings Ta, Tb, Tcfrom FIG. 5A are depicted showing some misalignment. The subtitle file and audio segment are portioned into a series of blocks B l, B2, B3. The blocks Bl,B2, B3 may each be of the same length or of varying length. In some implementations, the portioning into blocks Bl, B2, B3 is configured to avoid separating a dialogue audio segment and / or text string into two separate adjacent blocks Bl, B2, B3. That is, each block Bl, B2, B3 may begin and end between two text strings and / or in a non-dialogue audio segment.

[0074] For each block Bl, B2, B3, a block specific constant time adjustment AtBi, AtB2, AtB3 may be determined by e.g. evaluating for each block B l, B2, B3 a cross-correlation between the text strings and the dialogue audio segments within the same block B l, B2, B3 for a range of negative and positive block specific time adjustments of the text strings, and selecting for each block the block specific time adjustment AtBi, AtB2, AtB3 with the greatest cross-correlation in the evaluated range. For example, the cross-correlation function of equation 1 or centered crosscorrelation function of equation 2 may be applied for an interval of candidate time adjustments in each block B l, B2, B3 individually and the candidate time adjustment resulting in the greatest (centered) cross-correlation is selected as the block specific time adjustment AtBi, AtB2, AtB3. The start and end times of text strings in the subtitle file of each block Bl, B2, B3 are then modified by the block specific constant time adjustment AtBi, AtB2, AtB3 to achieve the target alignment of the text strings in each block individually. In the example of FIG. 5B block specific time adjustments AtB2 and AtB3 of blocks B2 and B3 are negative time adjustments of different magnitudes and block specific time adjustment AtBi of block Bl is a positive time adjustment.

[0075] In FIG. 5B each block B l, B2, B3 is for illustrative purposes shown as containing a single dialogue audio segment. It is however understood that in general each block B l, B2, B3 may contain a plurality of dialogue audio segments. For example, in some implementations each block is at least 30 seconds long, at least 1 minute long or at least 5 minutes long wherein one or more blocks contains two or more, ten or more or even twenty or more dialogue audio segments. Since the subtitle file of e.g. a movie may cover more than 60, 90 or 120 minutes it is envisaged that in some implementations the subtitle file is divided into more than 10, more than 50 or more than 100 blocks.

[0076] The block-by-block alignment may enable more accurate alignment compared to global alignment of the entire subtitle file (which e.g. may be an hour long or longer), especially if the alignment error is not constant but varies over time throughout the duration of the audio signal. However, depending on the size and number of the blocks it is understood that the alignment in each block may achieve average alignment, but it does not necessarily provide the perfect alignment for each text string to its corresponding dialogue audio segment.

[0077] To this end, it is envisaged that in some implementations, each text string is processed and modified individually. For example, the start and end time of each individual text string is adjusted with a text string specific time adjustment to align with its respectivecorresponding dialogue audio segment. The respective corresponding dialogue audio segment may be the dialogue audio segment which is closest to the text string in time based on the start and / or end time of the text string and the dialogue audio segments. In some implementations, the global time alignment and / or block-by-block time alignment is performed first to form a preliminary modified subtitle file wherein the preliminary modified subtitle file is used when determining respective corresponding dialogue audio segments and performing the modification of individual text strings.

[0078] Individual modification of text strings is especially useful if the mismatch between audio dialogue and text strings is not constant over time. Furthermore, it is envisaged that misalignment between text strings and their respective corresponding dialogue audio segment is not only due to these being displaced in time relative to each other but because these are of unequal duration. For example, a text string may according to the subtitle file be displayed for 4 seconds, between time 0:00:11 and time 0:00:15 but the associated dialogue audio segment may be 6 seconds long, appearing between time 0:00:10 and time 0:00:16. Hereby, in addition to and / or as an alternative to displacing a text string in time with a constant time adjustment for both the start and end time, the duration of a text string may be modified to better align with the corresponding dialogue audio segment.

[0079] FIGS. 6 A - 6H shows examples of how the start and / or end times of an individual text string may be modified.

[0080] In FIG. 6A, a dialogue audio segment Sastarts at time tcand ends at time ta. The corresponding text string Tastarts (according to the subtitle file) at start time taand ends at time tb, wherein tb > ta, ta> tcand ta > tb. The start and end times of the dialogue audio segment Sa, obtained using one or more dialogue classifiers, hereby indicates that there is dialogue which starts before the text string Tais displayed at time ta, and dialogue remaining even after the text string Tastops being displayed at time tb. Hereby modifying the start and / or end time of text string Tamay comprise modifying the start and / or end time of the text string Tasuch that the start time corresponds to the start time of the corresponding dialogue audio segment Saand / or such that the end time corresponds to the end time of the corresponding dialogue audio segment Sa. In FIG. 6E a modified text string T’ais shown wherein the start and end times have been modified from taand tb to tc and ta respectively, to correspond to the start and end time of the dialogue audio segment Sa. By modifying the start and / or end time of the text string Tato form a modified text string T’aa modified subtitle file is formed wherein the time alignment with dialogue of the associated audio signal is enhanced, e.g. more accurate, in the modified subtitle file.

[0081] In FIG. 6B, a dialogue audio segment Sastarts at time tcand ends at time ta. The corresponding text string Tastarts (according to the subtitle file) at start time taand ends at timetb, wherein tb > ta, ta> tcand tb > ta in this example. In this example, modifying the start and / or end time may again comprise modifying the start and / or end time of the text string Tasuch that the start time corresponds to the start time of the corresponding dialogue audio segment Saand / or such that the end time corresponds to the end time of the corresponding dialogue audio segment Sa. In FIG. 6F a modified text string T’ais shown wherein the start and end times have been modified from taand tb to tcand tb respectively, to correspond to the start of the corresponding dialogue audio segment Sa. However, the end time tb of the text string Tais retained to provide a modified text string T’awhich ends after the dialogue audio segment Sa. By modifying the text string Taby placing the start time earlier and / or end time later to align with a dialogue audio segment a conservative time adjustment is achieved where the text strings will overlap with dialogue audio and wherein it is avoided that the dialogue audio is presented with delayed text string display and / or that dialogue audio is presented with text strings ending prematurely.

[0082] Analogously, it is envisaged that if the text string starts earlier than, but also ends earlier than, the respective corresponding dialogue audio segment the start time of text string is retained and the end time of the text string Tais modified to overlap with the end time of the dialogue audio segment Sa. This is case is exemplified in FIG. 6C where a dialogue audio segment Sastarts at time tcand ends at time ta whereas the corresponding text string Tastarts at start time taand ends at time tb, wherein tb > ta, tc> taand ta > tb. The resulting modified text string T’ais shown in FIG. 6G and as seen the modified text string T’astarts at time taand ends at time ta.

[0083] In the examples of FIGS, 6 A - 6C and 6E - 6G it is demonstrated that the start and / or end time of a text string Tacould be modified to correspond to the start and / or end time of the corresponding dialogue audio segment Sa.

[0084] In some implementations, it is envisaged that the start and / or end time of a text string Tais modified so as to start earlier than and / or end later than the corresponding dialogue audio segment Sa. That is, the modification of the start time of the dialogue audio segment Sain FIG. 6A and 6B may involve setting the start time to tc- tmargin wherein tmargin is small margin time (e.g. 20 ms, or 50 ms). Analogously, the modification of the end time of the dialogue audio segment Sain FIG. 6A and 6B may involve setting the start time to ta + tmargin. This conservative modification of the start and / or end time of the dialogue audio segment Samay facilitate future dialogue enhancement processing which relies on the start and end times of the subtitle file to activate and / or deactivate the dialogue processing.

[0085] Additionally or alternatively, conservative start and / or end time modification may involve the processing illustrated in FIGS. 6D and 6H. In FIG. 6D a dialogue audio segment Sastarts at time tc and ends at time td. The corresponding text string Tastarts (according to thesubtitle file) at start time taand ends at time tb, wherein tb > ta, tc> taand tb > ta. That is, both the start time and the end time of the dialogue audio segment Salies between the start time and end time of the corresponding text string Ta. Accordingly, the entire duration of the text string Tais already overlapping with the corresponding dialogue audio segment Sa.

[0086] If a text string Taof this type is encountered the start and / or end time modification may comprise keeping the original start and / or end time for this text string Ta. This is illustrated in FIG. 6H showing the modified text string T’ahaving retained taand tb as the starting time of the (unmodified) text string Taof FIG. 6D.

[0087] In some implementations when there is a large mismatch between the start and / or end time of a text string and the start and / or end time of the corresponding dialogue audio segment the start and / or end time of the text string may be modified in a conservative manner to approach the start and / or end time of the dialogue audio segment but not equal the start and / or end time of the dialogue audio segment. In FIG. 7 the display time of a text string Ta is illustrated. The text string Ta starts at time t = t3 and ends at time t = ta. However, the respective corresponding dialogue audio segment Sa starts at time t = ti and ends at time t = te wherein the dialogue audio segments Sa starts a duration Ati = t3 - ti prior to the text string Ta is displayed and ends a duration At2 = te - after the end time of text string Ta. If Ati and / or At2 exceeds a threshold time the start / end time modifier may be configured to modify the start and / or end time of the text string Ta to form the modified text string T’a such that the start and / or end time approaches but does not reach the start and / or end time of the dialogue audio segment.

[0088] For example, the start time of the text string Ta is modified from t3 to t2, wherein ti < t2 < t3 and the end time is modified from ta to ts, wherein ta < t < te to form the modified text string T’aalso shown in FIG. 7. The modified start time t2 may be calculated based on a percentage of the Ati time difference as t2 = ti + P * Ati or t2 = t3 - (1-P) * Ati wherein P is a percentage. Similarly, the modified end time ts may be calculated based on a percentage of the At2 time difference as ts = ta + P * At2 or ts = te - (1-P) * At2. If P = 1 the modification of the start and / or end time achieves alignment of the start and / or end time of the dialogue audio segment and text string (as show in FIGS. 6A-6C and 6E-6G). However, if 0 < P < 1 the alignment is only partial since the start and / or end time of the modified text string T’a has approached but not reached the start and / or end time of the dialogue audio segment.

[0089] Hereby, it is envisaged that the percentage may be adapted based on the mismatch durations Ati and / or At2. If Ati is below or equal to a predetermined threshold the start time is modified with P = 1. On the other hand, if Ati exceeds the predetermined threshold P is set to a value smaller than 1 (e.g. P = 0.5) for calculating the start time t2 of the modified text string T’a. Similarly, if At2 is below or equal to another predetermined threshold the end time is modifiedwith P = 1. On the other hand, if At2 exceeds the associated predetermined threshold P is set to a value smaller than 1 (e.g. P = 0.5) for calculating the end time ts of the modified text string T’a.

[0090] Performing only partial modification of the start and / or end time of the text string in this manner in situations with a large mismatch between the start and / or end times of the dialogue audio segment and text string allows e.g. the creative intent of the subtitle file to be preserved. For example, a large difference in timings between a text string and the dialogue audio segment may mean that the dialogue classifier(s) has misidentified non-dialogue or background dialogue as dialogue.

[0091] Turning back to FIG. 1, the subtitle processing system 1 also comprises a parser 11 configured to analyze each text string of the subtitle file and assign one more labels to each text string of the subtitle file. The parser 11 may be configured to operate with a predetermined set of labels. In some implementations, at least one label is a dialogue label associated with dialogue, referred to as a dialogue class label. In some implementations, there are at least two dialogue class labels. For example, the set of labels comprises a regular dialogue label and an altered dialogue label wherein both the regular dialogue label and the altered dialogue label are of the dialogue class of labels. The regular dialogue label may e.g. be associated with regular conversational dialogue whereas the altered dialogue label indicates dialogue which e.g. is yelled, whispered or murmured.

[0092] Other dialogue class labels are also envisaged, for example an off-screen dialogue label indicating dialogue which originates from source not visible in an associated video signal and another language label indicating dialogue of a different language.

[0093] In addition to dialogue class labels there may be one or more non-dialogue class labels. That is, labels indicating that the text string is associated with something besides dialogue in the audio signal. For example, a non-dialogue label could be a music label or a lyrics label, wherein a text string providing context information such as “Music playing in the background” is tagged with the music label and text strings reciting the lyrics of a song are tagged with the lyrics label.

[0094] In one illustrative example, twelve labels indicated using numbers 0, 1, .. ., 11 are used. Label 0 is an unknown label. This label is a non-dialogue class label which is assigned to each text string where no other label can be assigned. Label 1 is a dialogue label. This label is a dialogue class label assigned to each text string comprising text associated with the dialogue. Label 2 is an off screen label. Label 2 is a dialogue class label assigned to each text string where the source of the dialogue is not visible in the video signal associated with the subtitle file (e.g. associated with dialogue from a radio, TV or background characters of a video scene). Label 3 is a foreign language label. Label 3 is a dialogue class label indicating that the text string isassociated with dialogue of a language other than a predetermined main language. Label 4 is an alternated dialogue label. Label 4 is a dialogue class label indicating that text string is associated with dialogue which is alternated from regular conversational dialogue, such as dialogue which is yelled, whispered or murmured. Label 5 is a non-verbal label. Label 5 is a non-dialogue class label which indicates that the text string is associated with human made, non-verbal, sounds such as coughing, groaning, laughing etc. Label 6 is a human-like label. Label 6 is a non-dialogue class label indicating that the text string is associated with sounds similar to human-made sounds. For example, label 6 is assigned to text strings associated with dogs barking, cows mooing or saxophone sounds. Label 7 is a new source label. Label 7 is a non-dialogue class label that is assigned to text strings wherein a new or different audio source is being captioned. Label 8 is a music label. Label 8 is a non-dialogue class label used to tag any text string associated with music being played. Label 9 is a lyrics label. Label 9 is a non-dialogue class label used to tag each text string associated with song lyrics. Label 10 is an italics label. Label 10 is a non- dialogue class label used to tag each text string which is displayed in italics. To this end, the parser 11 may obtain as an input not only the text strings as such, but also associated metadata of the subtitle file (which e.g. indicates display color or italics). Label 11 is a credits label. Label 11 is a non-dialogue class label assigned to text strings associated with credits of e.g. the closed captioner.

[0095] The assignation of labels may be automatic, as is the case in FIG. 1 where the parser 11 assigns one or more labels to each text string automatically. The parser 11 may be a keyword based parser and / or employ a neural network to label each text string with one or more labels. It is also envisaged that the labelling may be performed manually.

[0096] A keywords based parser analyzes each text string and identifies keywords, key phrases or key symbols to determine an appropriate label. For example, a text string consisting only of the phrase “speaking German” may be labeled with a dialogue label and a foreign language label. As another example, a text string where all words are capital letters may be indicative of a character screaming or yelling, wherein the text string is labeled with a dialogue label and an altered dialogue label.

[0097] A neural network based parser may be realized by training a neural network with a training data set comprising a plurality of text strings wherein each text string is associated with one or more ground truth labels. During training, the text strings are input to the neural network which predicts one or more labels for each text string. The predicted one or more labels is compared to the ground truth label and based on the difference between the predicted one or more labels and the ground truth one or more labels the learnable parameters of the neural network are updated.

[0098] In some implementations, the subtitle file is already labeled when it is obtained by the subtitle processing system 1. In such implementations, the parser 11 may be omitted.

[0099] The start / end time modifier 15 may utilize the labeled subtitle file by concentrating any identification of corresponding dialogue audio segments and / or the modification of the start and / or end times to text strings that are provided with one or more dialogue class labels. In practice, this could be done by ignoring text strings not associated with a dialogue class label (e.g. setting x(t) to zero in equation 1 or equation 2 for text strings not associated with a dialogue class label).

[0100] Turning to FIG. 8, the time during which each of five text strings Ta, Ta, Tb, Tp, Tcare active (A) and non-active (NA) is illustrated in the top graph. Out of these exemplary text strings, text strings Ta, Tb, and Tchave been tagged with at least one dialogue class label and text strings Taand Tp have not been tagged with a dialogue class label (but they may be tagged with one or more non-dialogue class labels).

[0101] When performing global, block-by-block or individual alignment, the start / end time modifier 15 may be configured to disregard and / or remove any text string which is not associated with a dialogue class label prior to identifying a corresponding dialogue audio segment and / or making any modifications of the start and / or end time of a text string. This ensures that text strings present in the subtitle file, but which are not associated with any audio in the audio signal which is dialogue related and / or audio which will likely not be identified with the dialogue classifier(s) 12a, 12b as dialogue, does not interfere with the identification of corresponding dialogue audio segments and / or the modification of start and / or end times.

[0102] In this example, text strings Ta, Tb, and Tchave each been tagged with a dialogue class label. However, text strings Taand Tp may only be tagged with a human like label since the audio signal contains the sound of dogs barking in the distance and text strings Taand Tp comprises the text “[Dogs barking in the distance]”. None of the dialogue classifier(s) 12a, 12b is expected to reliably detect the sound dogs barking as dialogue meaning that text strings Taand Tp will likely not have a respective corresponding dialogue audio segment to which they may be aligned. If no action is taken, there is a risk of a Taand Tp impeding the accuracy of identifying corresponding dialogue audio segments and / or modifying the start and / or end times of the other text strings Ta, Tb, and Tcsince a method may e.g. attempt to find a global alignment where Taand Tp are aligned with respective dialogue audio segments, which is not desirable. To this end, the start / end time modifier 15 disregards Taand Tp and performs the alignment only for text strings Ta, Tb, and Tchaving one or more dialogue labels.

[0103] Text strings associated with one or more dialogue class labels may be referred to as dialogue text strings.

[0104] In the above, various methods and systems for modifying a subtitle file have been described. The resulting modified subtitle file has been provided with labels and / or subtitles that are more accurately aligned with the dialogue of the associated audio signal. FIG. 9 illustrates a modified subtitle file 101 provided with labels. Comparing the modified subtitle file 101 with the original subtitle file 100 illustrated in FIG. 3 it is seen that some information of the original subtitle file 100 (e.g. the strings and display style metadata) is maintained in the modified subtitle file 101. However, the start and end time of one or more text strings have been modified. For example, String 1 of the modified subtitle file 101 starts at 0:00:02 and ends at 0:00:06 (instead of 0:00:05), String 2 starts at 0:00:10 (instead of 0:00:11) and ends at 0:00:16 (instead of 0:00:15) and String 3 starts at 0:00:43 and ends at 0:00:46 (instead of 0:00:45). That is, the start and / or end times of multiple text strings have been modified to reach a target alignment with the dialogue audio segments.

[0105] Furthermore, labels have been added for each text string. With the twelve label types exemplified above String 1 has been labeled with label 1 indicating dialogue. For example, String 1 is associated with regular spoken dialogue. String 2 is labeled with labels 1, 2 and 3. For example, String 2 is associated with dialogue spoken in a foreign language by a source which is not visible in the video. String 3 has been labeled with label 0. For example, for String 3 no other label has been identified whereby String 3 is tagged with the unknown label.

[0106] In some implementations, it is envisaged that further information is included in the subtitle file 101. For example, for each text string the combined classifier confidence value may be included in the subtitle file as well. Additionally or alternatively, a dialogue density value (indicating the energy proportion of dialogue audio content) may be included in the subtitle file for each text string. This information, together with the labels and / or modified text string start and end times may be leveraged for dialogue enhancement processing. Hereby, the modified subtitle file 101 may be used for dialogue enhancement processing as will be described below.

[0107] In some implementations, the subtitle processing system is configured to process a plurality of candidate subtitle files. FIG. 10 shows an exemplary subtitle processing system 1’ configured to obtain an audio signal and a plurality of candidate subtitle files as the system input. The subtitle processing system 1’ is similar to the subtitle processing system 1 of FIG. 1. A difference is that the subtitle processing system 1’ comprises a candidate subtitle file selector 16 which obtains the dialogue and non-dialogue audio segments from the dialogue / non-dialogue audio segmentation unit and a plurality of candidate subtitle files. Each candidate subtitle file comprises a plurality of text strings wherein each text string is associated with a start time and an end time in the audio signal. The candidate selector 16 is configured to select a subtitle filewhich best matches timings of the dialogue audio segments and pass it on to the start / end time modifier 15 for further processing.

[0108] The candidate selector 16 may e.g. be configured to determine a correlation between each subtitle file and the dialogue audio segments wherein the candidate subtitle file comprising text strings having the highest degree of (cross-) correlation with the audio segments is selected as the selected subtitle file.

[0109] As a further example, the candidate selector may be configured to perform a global alignment or block-by-block alignment (as described in connection to FIG. 5A) of each candidate subtitle file with respect to the dialogue audio segments and determine the correlation between each candidate subtitle file (after global alignment) and the audio signal. The candidate subtitle file for which the highest degree of (cross-) correlation was achieved after global alignment is selected and passed on to the start / end time modifier 15 for further processing such as block- wise alignment as described in connection to FIG. 5B, or alignment of each text string individually as described in connection to FIGS. 6A-H and FIG. 7.

[0110] Processing a plurality of candidate subtitle files has the advantage of identifying the most suitable subtitle file for the audio signal. For example, in some situations a plurality of different subtitle files of varying quality are available for the same audio signal (or video signal) and it is generally a time consuming task to identify which subtitle file is most accurate in terms of alignment with the audio or video content. Using the global alignment process, it is possible to automatically identify which subtitle file achieves the best alignment and likely offers the best starting point for finer alignment.

[0111] In FIG. 11 an audio signal processing system 2 for processing and audio signal based on a subtitle file with labeled text strings is shown. With further reference to the flowchart of FIG. 12, the processing performed by the audio signal processing system 2 will now be described in detail.

[0112] The audio signal processing system 2 obtains an audio signal, and a subtitle file associated with the audio signal at step Si l. The subtitle file comprises a plurality of text strings, a start time for each text string, an end time for each text string and a label for each text string. For example, see the illustrated subtitle file 101 of FIG. 9.

[0113] The audio signal is provided to a dialogue enhancement processor 24 configured to selectively perform dialogue enhancement processing of the audio signal. At step S13, the label(s) of each text string are provided to a dialogue enhancement controller 23 which is configured to, based on the one or more text string labels, determine control data for controlling the dialogue enhancement processor. The control data may be or comprise a dialogue confidence value, such as a binary value (0 or 1) or a continuous defined on the interval [0, 1], wherein 0indicates that no dialogue is active and 1 indicates that dialogue is active. There are many ways in which the dialogue confidence value could be extracted based on the label. In one exemplary implementation the dialogue confidence value is 1 if at least one dialogue class label is detected and 0 if no dialogue class label is detected.

[0114] The control data is provided to the dialogue enhancement processor 24 to control the dialogue enhancement processing and at step S14 the dialogue enhancement processor 24 performs the dialogue enhancement processing. The dialogue enhancement processing may involve applying a boosting gain or an attenuating gain based on the control data.

[0115] For example, if the text string is associated with a dialogue class label and a dialogue confidence value of the control data is 1, a first boosting gain is applied in the dialogue enhancement processor 24 in response. If the text string is not associated with any dialogue class label and the dialogue confidence value of the control data is 0, a second (lower) boosting gain is applied or no gain is applied at all. As another example, if the text string is associated with a dialogue class label, then the control data indicates that a first attenuating gain or no attenuating gain is applied. If the text string is not associated with any dialogue class label, then the control data indicates that a second (greater) attenuating gain is applied.

[0116] This type of gain application may make dialogue clearer and more intelligible.

[0117] As another example, the dialogue enhancement processor 24 may be configured to apply different gains for different types of dialogue as signaled by the text string labels. For instance, if a text string is labeled with a regular dialogue label the dialogue enhancement controller 23 extracts control data indicating that the dialogue enhancement processor 24 should apply a first boosting gain (e.g. 9 dB or 12 dB). If a text string is labeled with an altered dialogue label the dialogue enhancement controller 23 generates control data indicating that the dialogue enhancement processor 24 should apply a second, lower, boosting gain (e.g. 3 dB or 6 dB). If a text string is not associated with any dialogue class label the dialogue enhancement controller 23 generates control data indicating that the dialogue enhancement processor 24 should not apply a gain. In this way, regular spoken dialogue is boosted and altered dialogue (e.g. whispering) is also boosted, but to a lesser extent. This enhances dialogue intelligibility while maintaining the original creative intent of the audio signal.

[0118] In addition to, or as an alternative to, gain application the dialogue enhancement processor 24 may apply other types of dialogue enhancement. For example, there are dialogue enhancement processors that rely on neural networks to isolate dialogue. Generally, the dialogue enhancement processing is not used for each frame of an audio signal but toggled so as to be active when dialogue is present and deactivated when dialogue is not present. The control data ( e.g. indicating a dialogue confidence value) extracted by the dialogue enhancement controller 23may be used for this type of toggling. In a simple case, control data is generated such that dialogue enhancement is activated for the duration of each text string associated with at least one dialogue class label and deactivated otherwise.

[0119] Yet another example of dialogue enhancement processing is adaptation of panning in spatial audio processing. For example, a text string associated with the off screen label and / or display style metadata indicating an intended captioning position that is non-center (e.g. to the right or to the left) may be useful for more accurately capturing and boosting dialogue using spatial processing. In some implementations, the audio signal is a stereo signal comprising a left L and right channel R. To process the stereo signal the dialogue enhancement processor 24 may convert the left-right stereo signal into a mid-side stereo signal having a mid-channel M and a side-channel S, wherein the mid-channel emphasizes audio content that is center-panned (i.e. in common for both the right and left channels) and the side-channel emphasizes audio content that is different between the left and right channels. Typically, dialogue is center-panned and appears mainly in the mid-channel whereas non-dialogue audio appears mainly in the side-channel. To enhance dialogue, the dialogue enhancement processor 24 may accordingly boost the midchannel M and / or attenuate the side-channel S prior to reconstructing the left and right channels L, R from the boosted / attenuated mid- and side-channels M, S.

[0120] However, when a text string with the off screen label is active and / or a text string associated with non-center positioning metadata is active this may be used to control the dialogue enhancement processor 24 so as to apply a smaller than nominal boosting gain to the mid-channel M and / or a smaller than nominal attenuating gain to the side signal S when this type of text string is active, to target the dialogue source and increase dialogue intelligibility.

[0121] It is possible to control the dialogue enhancement processor 24 based on only labels of the text string. In such implementations, the dialogue enhancement controller 23 may be omitted and / or integrated into the dialogue enhancement processor 24 wherein the labels (and optionally display type metadata) is provided directly to the dialogue enhancement processor 24.

[0122] In some implementations, one or more dialogue classifiers 22a, 22b are used alongside the labels of the subtitle file, wherein the dialogue enhancement controller 23 determines the dialogue confidence value based on the classifier confidence value CCV1, CCV2 of each dialogue classifier 22a, 22b. The dialogue classifiers 22a, 22b may be similar or identical to the dialogue classifiers 12a, 12b of FIG. 1. Accordingly, the method of FIG. 12 optionally comprises step S12 involving processing the audio signal with one or more dialogue classifiers 22a, 22b to obtain respective classifier confidence values CCV1, CCV2. Accordingly, the labels of the subtitle file text strings are used together with classifier confidence values CCV1, CCV2 to determine the control data (e.g. a dialogue confidence value) at step S13.

[0123] The dialogue enhancement controller 23 may combine the classifier confidence values CCV1, CCV2 and the text string labels to form control data that is used to control the dialogue enhancement. For example, one or more classifier confidence values may be combined to form a combined classifier confidence value CCVC and the combined classifier confidence value CCVC is penalized when a text string not associated with any dialogue class label is active. The control data may then be based on the penalized combined classifier confidence value CCVC.

[0124] The dialogue enhancement processor 24 outputs a dialogue enhanced audio signal. The dialogue enhanced audio signal may be combined with text strings (and optionally the display format information) of the subtitle file into a bitstream B. The bitstream B may subsequently be distributed to devices for playback.

[0125] Optionally, the audio signal processing system 2 is provided in a playback device wherein the audio signal and subtitle may be played back immediately.

[0126] In some implementations, the labels of the text string may be removed from the subtitle file after having been used for the dialogue enhancement processing, reverting the subtitle file to a traditional subtitle format which may easily be converted to a format suitable for a distribution platform.

[0127] FIG. 13 shows a dialogue loudness system 3, the dialogue loudness system 3 comprises a gating controller 33 configured to, based on the label of a text string and optionally the classifier confidence value CCV1, CCV2 of one or more dialogue classifiers 32a, 32b, determine a gating gain which is provided to the gating unit 34. The dialogue classifiers 32a, 32b may be similar or identical to the dialogue classifiers 12a, 12b of FIG. 1 or dialogue classifiers 22a, 22b of FIG. 11.

[0128] The gating gain may be a binary gain wherein the gating gain is 0 dB when dialogue is active (as determined based at least on the text string label) and -co dB (i.e. complete silencing) when dialogue is not active. As an alternative to complete silencing it is envisaged that the gating gain, when dialogue is not active is a predetermined attenuating gain such as -30 dB or -60 dB which effectively silences the audio signal when no dialogue is active.

[0129] The gating unit 34 applies the gating gain and forms a gated audio signal with isolated dialogue. The gated audio signal is provided to a loudness calculator 35 which calculates the loudness of the gated audio signal (i.e. the loudness of the dialogue). Calculation of the dialogue loudness may be performed using a variety of algorithms, known as loudness algorithms. In one example, the loudness calculator 35 employs loudness algorithm ITU-R BS.1770 to calculate the average dialogue loudness.

[0130] Accurate dialogue loudness estimates are useful for classifying audio content and are often used in delivery specifications. While methods for calculating dialogue loudness for providing an isolated dialogue audio signal are well established the challenge lies in accurately extracting isolated dialogue content from an audio signal comprising a mix of dialogue content and other non-dialogue audio content. Since the gating controller 33 relies at least partially on the text string labels the dialogue loudness system 3 of FIG. 13 facilitates more accurate detection of dialogue, which in turn results in a gating gain which more efficiently extracts dialogue content.

[0131] FIG. 14 shows a schematic block diagram of an example electronic device or architecture 200 (e.g., an apparatus 200) suitable for implementing example embodiments of the present disclosure. Architecture 200 includes but is not limited to servers and client devices, systems, modules and methods as described in reference to FIGS. 1 - 13. As shown, the architecture 200 includes central processing unit (CPU) 201 which is capable of performing various processes in accordance with a program stored in, for example, read only memory (ROM) 202 or a program loaded from, for example, storage unit 208 to random access memory (RAM) 203. The CPU 201 may be, for example, an electronic processor 201, which may include one or more processor cores, and in some examples the processor 201 may be multiple processors. In RAM 203, the data used when CPU 201 performs the various processes is also stored, as required. CPU 201, ROM 202 and RAM 203 are connected to one another via bus 204. Input / output (I / O) interface 205 is also connected to bus 204.

[0132] The following components are connected to I / O interface 205: input unit 206, that may include a keyboard, a mouse, or the like; output unit 207 that may include a display such as a liquid crystal display (LCD) and one or more speakers; storage unit 208 including a hard disk, or another suitable storage device; and communication unit 209 which may include a network interface card such as a network card (e.g., wired or wireless).

[0133] In some implementations, input unit 206 includes one or more microphones in different positions (depending on the host device) enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).

[0134] In some implementations, output unit 207 include systems with various number of speakers. Output unit 207 (depending on the capabilities of the host device) can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).

[0135] In some embodiments, communication unit 209 is configured to communicate with other devices (e.g., via a network). Drive 210 is also connected to I / O interface 205, as required. Removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 210, so that a computerprogram read therefrom is installed into storage unit 208, as required. A person skilled in the art would understand that although apparatus 200 is described as including the above-described components, in real applications, it is possible to add, remove, and / or replace some of these components and all these modifications or alteration all fall within the scope of the present disclosure.

[0136] In accordance with example embodiments of the present disclosure, the processes described above may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 209, and / or installed from the removable medium 211, as shown in FIG. 14.

[0137] Generally, various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (e.g., control circuitry), software, logic or any combination thereof. For example, the units discussed above can be executed by control circuitry (e.g., CPU 201 in combination with other components of FIG. 14), thus, the control circuitry may be performing the actions described in this disclosure. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, a processor and / or other computing device(s), which may include control circuitry. While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques, or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0138] Additionally, various blocks shown in the flowcharts may be viewed as method steps, and / or as operations that result from operation of computer program code, and / or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the methods as described above.

[0139] In the context of the disclosure, a machine-readable medium may be any tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine- readable signal medium or a machine-readable storage medium. A machine-readable mediummay be non-transitory and may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine -readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

[0140] Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to one or more processors of a general-purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by one or more processors of the computer or other programmable data processing apparatus, cause the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and / or servers.

[0141] Various aspects of the present disclosure may be appreciated from the following Enumerated Example Embodiments (EEEs):EEE 1. A computer-implemented method for modifying a subtitle file, comprising: obtaining a subtitle file and an associated audio signal comprising a sequence of audio frames, the subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal; processing the audio signal with a dialogue classifier to generate a classifier confidence for each audio frame of the sequence of audio frames; based on the classifier confidence for each audio frame, identifying each audio frame as a dialogue audio frame or non-dialogue audio frame so as to form a sequence of dialogue and non-dialogue audio segments, each dialogue audio segment being associated with a start time and an end time and comprising at least one dialogue audio frame; for at least a first text string of the plurality of text strings of the subtitle file, identifying a respective corresponding dialogue audio segment of the audio signal, based on the start and / or end times associated with the at least first text string and the start and / or end times associated with the dialogue audio segments; andmodifying the start time and / or the end time of the at least first text string to obtain a target alignment of the start time and / or end time of the at least first text string with the start time and / or end time of the respective corresponding dialogue audio segment.EEE 2. The method according to EEE 1, further comprising: determining that the start time of the at least first text string lies between the start and end time of the respective corresponding dialogue audio segment; and modifying the start time of the at least first text string based on the start time of the respective corresponding dialogue audio segment.EEE 3. The method according to EEE 2, wherein modifying the start time of the at least first text string comprises modifying the start time of the at least first text string such that it corresponds to the start time of the respective corresponding dialogue audio segment.EEE 4. The method according to EEE 2, wherein modifying the start time of the at least first text string comprises modifying the start time of the at least first text string such that it is temporally earlier than and closer to the start time of the respective corresponding dialogue audio segment, or temporally later than and closer to the start time of the respective corresponding dialogue audio segment.EEE 5. The method according to any of the preceding EEEs, further comprising: determining that the end time of the at least first text string lies between the start and end time of the respective corresponding dialogue audio segment; and modifying the end time of the at least first text string based on the end time of the respective corresponding dialogue audio segment.EEE 6. The method according to EEE 5, wherein modifying the end time of the at least first text string comprises modifying the end time of the at least first text string such that it corresponds to the end time of the respective corresponding dialogue audio segment.EEE 7. The method according to EEE 5, wherein modifying the end time of the at least first text string comprises modifying the end time of the at least first text string such that it is temporally earlier than and closer to the end time of the respective corresponding dialogue audio segment, or temporally later than and closer to the end time of the respective corresponding dialogue audio segment.EEE 8. The method according to any of EEEs 2 - 7, further comprising for at least a second text string of the plurality of text strings of the subtitle file, identifying a respective corresponding dialogue audio segment of the audio signal, based on the start and / or end times associated with the at least second text string and the start and / or end times associated with the dialogue audio segments; determining that the start time and end time of the at least second text string segment of the plurality of text strings lies temporally earlier than, and temporally later than, the start and end time of its respective corresponding dialogue audio segment, respectively; and keeping the start time and end time of the at least second text string such that the start time and end time of the at least second text string is preserved in the modified subtitle file.EEE 9. The method according to EEE 1, further comprising: modifying the start time and / or the end time of each first text string with a same time adjustment to obtain the target alignment between the first text strings and the respective corresponding dialogue audio segments.EEE 10. The method according to any of EEEs 2 - 8, wherein identifying a respective corresponding dialogue audio segment of the audio signal comprises: forming global aligned text strings by adjusting the start time and end time of each text string with a same global time adjustment to obtain a target degree of time correlation between the text strings and the dialogue audio segments or classifier confidence; and wherein each first and / or second text string is a global aligned text string.EEE 11. The method according to EEE 1, further comprising: partitioning the plurality of text strings of the subtitle file to form a sequence of subtitle blocks, each subtitle block comprising at least one text string; partitioning the dialogue audio segments to form a sequence of dialogue audio segment blocks, each dialogue audio segment block comprising at least one dialogue audio segment and being time aligned with a corresponding subtitle block to define a time aligned block pair; the method comprising, for each time aligned block pair:for at least a first text string of the subtitle block of the block pair, identifying a respective corresponding dialogue audio segment of the dialogue audio segment block of the block pair, based on the start and / or end time of the at least first text string and the start and / or end time of the respective corresponding dialogue audio segment of the dialogue audio segment block; and modifying all start times and end times of each text string in the subtitle block with a block-specific time adjustment.EEE 12. The method according to any of EEEs 2 - 9, wherein identifying a respective corresponding dialogue audio segment of the audio signal comprises: forming block aligned text strings by: partitioning the plurality of text strings of the subtitle file to form a sequence of subtitle blocks, each subtitle block comprising at least one text string; partitioning the dialogue audio segments to form a sequence of dialogue audio segment blocks, each dialogue audio segment block comprising at least one dialogue audio segment and being time aligned with a corresponding subtitle block to define a time aligned block pair; the method comprising, for each time aligned block pair: modifying all start times and end times of each text string in the subtitle block with a block- specific time adjustment to obtain a target degree of time correlation between the text strings and the dialogue audio segments of the time aligned block pair.EEE 13. The method according to any of the preceding EEEs, wherein the at least first text string is at least two first text strings.EEE 14. The method according to any of the preceding EEEs, wherein the text strings of the subtitle file are dialogue text string, and wherein the subtitle file further comprises non-dialogue text strings.EEE 15. The method according to EEE 14, further comprising: obtaining an initial subtitle file, the initial subtitle file comprising a plurality of initial text strings; and processing the initial subtitle file with a file parser to form the subtitle file, wherein the file parser is configured to label each initial text string of the initial subtitle file withat least one of a plurality of labels, wherein at least one label is a dialogue label, and wherein each initial text string provided with at least one dialogue label is a dialogue text string and each remaining initial text string is a non-dialogue text string.EEE 16. The method according to any of the preceding EEEs, wherein processing the audio signal with a dialogue classifier to generate a classifier confidence comprises: processing the audio signal with a first dialogue classifier submodule to generate a first classifier confidence value for each audio frame of the audio signal; processing the audio signal with a second dialogue classifier submodule to generate a second classifier confidence value for each audio frame of the audio signal; and combining the first and second classifier confidence value to form the classifier confidence.EEE 17. The method according to any of the preceding EEEs, wherein identifying each audio frame as a dialogue audio frame or non-dialogue audio frame comprises, for each audio frame: identifying the audio frame as a dialogue audio frame based on the classifier confidence of the audio frame exceeding a predetermined first threshold.EEE 18. The method according to any of the preceding EEEs, further comprising: obtaining a plurality of candidate subtitle files, each candidate subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal; for each candidate subtitle file, determining global aligned text strings by adjusting the start time and end time of each text string with a global time adjustment to obtain a target degree of time correlation between the text strings of the candidate subtitle file and the dialogue audio segments; and selecting a candidate subtitle file as the selected subtitle file, based on the target degree of time correlation obtained for each candidate subtitle file.EEE 19. The method according to EEE 18, wherein determining global aligned text strings for each candidate subtitle file comprises: calculating the time correlation between the text strings of the candidate subtitle file and the dialogue audio segments for a plurality of sample time adjustments; andselecting a sample adjustment to use as the global time adjustment based on the time correlation associated with each sample time adjustment.EEE 20. A subtitle processing system configured to perform the method according to any of the preceding EEEs.EEE 21. A computer-readable storage media having software stored thereon, the software comprising instructions configured to control a processor to perform the method according to any of EEEs 1 - 19.EEE 22. A computer-implemented method for processing audio content, comprising: obtaining a subtitle file and an audio signal comprising a sequence of audio frames, the subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal, wherein each text string is further associated with at least one label out of a plurality of labels; for each audio frame of the audio signal, determining a dialogue confidence value, wherein the dialogue confidence value is based on the label of a text string overlapping in time with the audio frame; and performing dialogue enhancement processing on the audio frames of the audio signal based on the dialogue confidence value.EEE 23. The method according to EEE 22, further comprising: for each audio frame of the audio signal, processing the audio frame with a first dialogue classifier to obtain a first classifier confidence value, wherein the dialogue confidence value is further based on the first classifier confidence value.EEE 24. The method according to EEE 23, further comprising: for each audio frame of the audio signal, processing the audio frame with a second dialogue classifier to obtain a second classifier confidence value, wherein the dialogue confidence value is further based on the second classifier confidence value.EEE 25. The method according to any of EEEs 22 - 24, wherein the dialogue enhancement processing comprises: calculating a respective gain for at least one frequency band of each audio frame of the audio signal based on the dialogue confidence value;applying the respective gain to the at least one frequency band of each audio frame of the audio signal.EEE 26. The method according to EEE 25, wherein the dialogue confidence value for an audio frame indicates a degree of confidence for dialogue being present in the audio frame, and wherein the respective gain for an audio frame is an attenuating gain responsive to the degree of confidence for the audio frame being below a predetermined threshold and / or wherein the respective gain for an audio frame is a boosting gain responsive to the degree of confidence for the audio frame being above the predetermined threshold.EEE 27. The method according to any of EEEs 22 - 26, further comprising: for each audio frame of the audio signal: determining a type of dialogue enhancement processing to be applied based on the label of a text string overlapping in time with the audio frame; and performing the determined type of dialogue enhancement processing on the audio frame based on the dialogue confidence value.EEE 28. The method according to EEE 27, further comprising: determining that a first audio frame is associated with a first label; performing a first type of dialogue enhancement processing associated with the first label on the first audio frame; determining that a second audio frame is associated with a second label different from the first; and performing a second type of dialogue enhancement processing associated with the second label on the second audio frame.EEE 29. The method according to EEE 28, wherein the first label indicates dialogue of a first type and wherein the second label indicates dialogue of a second type, and wherein the first type of dialogue enhancement processing comprises applying a first gain and the second type of dialogue enhancement processing comprises applying a second gain, wherein the second gain is different from the first gain.EEE 30. The method according to EEE 29, wherein the first label indicates regular dialogue and wherein the second label indicates dialogue which is whispered, screamed or muttered,wherein the first label indicates male dialogue and the second label indicates female dialogue, or wherein the first label indicates dialogue from a first speaker and the second label indicates dialogue from a second speaker, and wherein the first gain is different from the second gain.EEE 31. The method according to EEE 24, wherein the audio signal comprises at least two audio channels, and wherein a first type of dialogue enhancement processing associated with a first label comprises performing a first type of panning processing of the at least two channels and a second type of dialogue enhancement associated with a second label comprises applying a second type of panning processing.EEE 32. The method according to EEE 31, wherein the first label indicates dialogue of a first type and wherein the second label indicates dialogue of a second type, and wherein the first label is an on-screen dialogue label indicating dialogue associated with on-screen dialogue in a video file associated with the audio signal and wherein the second label is an off-screen dialogue label indicating dialogue associated with off-screen dialogue in the video file associated with the audio signal.EEE 33. An audio processing system configured to perform the method according to any of EEEs 22 - 32.EEE 34. A computer-readable storage media having software stored thereon, the software comprising instructions configured to control a processor to perform the method according to any of EEEs 22 - 32.

Claims

CLAIMS1. A computer- implemented method for modifying a subtitle file, comprising: obtaining a subtitle file and an associated audio signal comprising a sequence of audio frames, the subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal; processing the audio signal with a dialogue classifier configured to generate a classifier confidence for each audio frame of the sequence of audio frames based on the audio frame, the classifier confidence indicating a degree of confidence for dialogue being present in each audio frame; based on the classifier confidence for each audio frame, identifying each audio frame as a dialogue audio frame or non-dialogue audio frame so as to form a sequence of dialogue and non-dialogue audio segments, each dialogue audio segment being associated with a start time and an end time and comprising at least one dialogue audio frame; for at least a first text string of the plurality of text strings of the subtitle file, identifying a respective corresponding dialogue audio segment of the audio signal, based on the start and / or end times associated with the at least first text string and the start and / or end times associated with the dialogue audio segments; and modifying the start time and / or the end time of the at least first text string to obtain a target alignment of the start time and / or end time of the at least first text string with the start time and / or end time of the respective corresponding dialogue audio segment.

2. The method according to claim 1, further comprising: determining whether the start time of the at least first text string lies between the start and end time of the respective corresponding dialogue audio segment; and in response to determining that the start time of the at least first text string lies between the start and end time of the respective corresponding dialogue audio segment, modifying the start time of the at least first text string based on the start time of the respective corresponding dialogue audio segment.

3. The method according to claim 2, wherein modifying the start time of the at least first text string comprises modifying the start time of the at least first text string such that it corresponds to the start time of the respective corresponding dialogue audio segment.

4. The method according to claim 2, wherein modifying the start time of the at least first text string comprises modifying the start time of the at least first text string such that it is temporally earlier than and closer to the start time of the respective corresponding dialogue audio segment, or temporally later than and closer to the start time of the respective corresponding dialogue audio segment.

5. The method according to any of the preceding claims, further comprising: determining whether the end time of the at least first text string lies between the start and end time of the respective corresponding dialogue audio segment; and in response to determining that the end time of the at least first text string lies between the start and end time of the respective corresponding dialogue audio segment, modifying the end time of the at least first text string based on the end time of the respective corresponding dialogue audio segment.

6. The method according to claim 5, wherein modifying the end time of the at least first text string comprises modifying the end time of the at least first text string such that it corresponds to the end time of the respective corresponding dialogue audio segment.

7. The method according to claim 5, wherein modifying the end time of the at least first text string comprises modifying the end time of the at least first text string or such that it is temporally earlier than and closer to the end time of the respective corresponding dialogue audio segment, or temporally later than and closer to the end time of the respective corresponding dialogue audio segment.

8. The method according to any of claims 2 - 7, further comprising for at least a second text string of the plurality of text strings of the subtitle file, identifying a respective corresponding dialogue audio segment of the audio signal, based on the start and / or end times associated with the at least second text string and the start and / or end times associated with the dialogue audio segments; determining whether the start time and end time of the at least second text string segment of the plurality of text strings lies temporally earlier than, and temporally later than, the start and end time of its respective corresponding dialogue audio segment, respectively; and in response to determining that the start time and end time of the at least second text string segment of the plurality of text strings lies temporally earlier than, and temporallylater than, the start and end time of its respective corresponding dialogue audio segment, respectively, keeping the start time and end time of the at least second text string such that the start time and end time of the at least second text string is preserved in the modified subtitle file.

9. The method according to claim 1, further comprising: modifying the start time and / or the end time of each first text string with a same time adjustment to obtain the target alignment between the first text strings and the respective corresponding dialogue audio segments.

10. The method according to any of claims 2 - 8, wherein identifying a respective corresponding dialogue audio segment of the audio signal comprises: forming global aligned text strings by adjusting the start time and end time of each text string with a same global time adjustment to obtain a target degree of time correlation between the text strings and the dialogue audio segments or classifier confidence; and wherein each first and / or second text string is a global aligned text string.

11. The method according to claim 1, further comprising: partitioning the plurality of text strings of the subtitle file to form a sequence of subtitle blocks, each subtitle block comprising at least one text string; partitioning the dialogue audio segments to form a sequence of dialogue audio segment blocks, each dialogue audio segment block comprising at least one dialogue audio segment and being time aligned with a corresponding subtitle block to define a time aligned block pair; the method comprising, for each time aligned block pair: for at least a first text string of the subtitle block of the block pair, identifying a respective corresponding dialogue audio segment of the dialogue audio segment block of the block pair, based on the start and / or end time of the at least first text string and the start and / or end time of the respective corresponding dialogue audio segment of the dialogue audio segment block; and modifying all start times and end times of each text string in the subtitle block with a block-specific time adjustment.

12. The method according to any of claims 2 - 9, wherein identifying a respective corresponding dialogue audio segment of the audio signal comprises: forming block aligned text strings by: partitioning the plurality of text strings of the subtitle file to form a sequence of subtitle blocks, each subtitle block comprising at least one text string; partitioning the dialogue audio segments to form a sequence of dialogue audio segment blocks, each dialogue audio segment block comprising at least one dialogue audio segment and being time aligned with a corresponding subtitle block to define a time aligned block pair; the method comprising, for each time aligned block pair: modifying all start times and end times of each text string in the subtitle block with a block- specific time adjustment to obtain a target degree of time correlation between the text strings and the dialogue audio segments of the time aligned block pair.

13. The method according to any of the preceding claims, wherein the at least first text string is at least two first text strings.

14. The method according to any of the preceding claims, wherein the text strings of the subtitle file are dialogue text strings, and wherein the subtitle file further comprises nondialogue text strings.

15. The method according to claim 14, further comprising: obtaining an initial subtitle file, the initial subtitle file comprising a plurality of initial text strings; and processing the initial subtitle file with a file parser to form the subtitle file, wherein the file parser is configured to label each initial text string of the initial subtitle file with at least one of a plurality of labels, wherein at least one label is a dialogue label, and wherein each initial text string provided with at least one dialogue label is a dialogue text string and each remaining initial text string is a non-dialogue text string.

16. The method according to any of the preceding claims, wherein processing the audio signal with a dialogue classifier to generate a classifier confidence comprises: processing the audio signal with a first dialogue classifier submodule to generate a first classifier confidence value for each audio frame of the audio signal;processing the audio signal with a second dialogue classifier submodule to generate a second classifier confidence value for each audio frame of the audio signal; and combining the first and second classifier confidence value to form the classifier confidence.

17. The method according to any of the preceding claims, wherein identifying each audio frame as a dialogue audio frame or non-dialogue audio frame comprises, for each audio frame: identifying the audio frame as a dialogue audio frame based on the classifier confidence of the audio frame exceeding a predetermined first threshold.

18. The method according to any of the preceding claims, further comprising: obtaining a plurality of different candidate subtitle files, each candidate subtitle file comprising a plurality of text strings associated with the audio signal and, for each text string, a start time and an end time; for each candidate subtitle file, determining global aligned text strings by adjusting the start time and end time of each text string with a global time adjustment to obtain a target degree of time correlation between the text strings of the candidate subtitle file and the dialogue audio segments; and selecting a candidate subtitle file as the selected subtitle file, based on the target degree of time correlation obtained for each candidate subtitle file.

19. The method according to claim 18, wherein determining global aligned text strings for each candidate subtitle file comprises: calculating the time correlation between the text strings of the candidate subtitle file and the dialogue audio segments for a plurality of sample time adjustments; and selecting a sample adjustment to use as the global time adjustment based on the time correlation associated with each sample time adjustment.

20. The method according to any of the preceding claims, wherein each audio frame is shorter than 100 ms, or shorter than 20 ms.

21. The method according to any of the preceding claims, wherein the classifier confidence is numerical value.

22. The method according to claim 21, wherein the classifier confidence is a numerical value on a range defined between a first numerical value and a second numerical value, the first numerical value indicating a minimum classifier confidence value and the second numerical value indicating a maximum classifier confidence value.

23. The method according to claim 22, wherein identifying each audio frame as a dialogue audio frame or non-dialogue audio frame so as to form a sequence of dialogue and nondialogue audio segments comprises: forming the dialogue and non-dialogue audio segments by comparing the classifier confidence to a threshold.

24. A subtitle processing system configured to perform the method according to any of the preceding claims.

25. A computer-readable storage media having software stored thereon, the software comprising instructions configured to control a processor to perform the method according to any of claims 1 - 19.

26. A computer-implemented method for processing audio content, comprising: obtaining a subtitle file and an audio signal comprising a sequence of audio frames, the subtitle file comprising a plurality of text strings and, for each text string, a start time and an end time associated with the audio signal, wherein each text string is further associated with at least one label out of a plurality of predetermined labels, wherein each label is a dialogue label indicating that the text string is associated with dialogue in the audio signal or a non-dialogue label indicating that the text-string is not associated with dialogue in the audio signal; for each audio frame of the audio signal, determining a dialogue confidence value indicating a degree of confidence for dialogue being present in each frame, wherein the dialogue confidence value is based on the label of a text string overlapping in time with the audio frame; and controlling a dialogue enhancement processing of the audio frames of the audio signal based on the dialogue confidence value.

27. The method according to claim 22, further comprising:for each audio frame of the audio signal, processing the audio frame with a first dialogue classifier to obtain a first classifier confidence value, wherein the dialogue confidence value is further based on the first classifier confidence value.

28. The method according to claim 23, further comprising: for each audio frame of the audio signal, processing the audio frame with a second dialogue classifier to obtain a second classifier confidence value, wherein the dialogue confidence value is further based on the second classifier confidence value.

29. The method according to any of claims 22 - 24, wherein the dialogue enhancement processing comprises: calculating a respective gain for at least one frequency band of each audio frame of the audio signal based on the dialogue confidence value; applying the respective gain to the at least one frequency band of each audio frame of the audio signal.

30. The method according to claim 25, wherein the respective gain for an audio frame is an attenuating gain responsive to the degree of confidence for the audio frame being below a predetermined threshold and / or wherein the respective gain for an audio frame is a boosting gain responsive to the degree of confidence for the audio frame being above the predetermined threshold.

31. The method according to any of claims 22 - 26, further comprising: for each audio frame of the audio signal: determining a type of dialogue enhancement processing to be applied based on the label of a text string overlapping in time with the audio frame; and performing the determined type of dialogue enhancement processing on the audio frame based on the dialogue confidence value.

32. The method according to claim 27, further comprising: determining that a first audio frame is associated with a first label; performing a first type of dialogue enhancement processing associated with the first label on the first audio frame; determining that a second audio frame is associated with a second label different from the first; andperforming a second type of dialogue enhancement processing associated with the second label on the second audio frame.

33. The method according to claim 28, wherein the first label indicates dialogue of a first type and wherein the second label indicates dialogue of a second type, and wherein the first type of dialogue enhancement processing comprises applying a first gain and the second type of dialogue enhancement processing comprises applying a second gain, wherein the second gain is different from the first gain.

34. The method according to claim 29, wherein the first label indicates regular dialogue and wherein the second label indicates dialogue which is whispered, screamed or muttered, wherein the first label indicates male dialogue and the second label indicates female dialogue, or wherein the first label indicates dialogue from a first speaker and the second label indicates dialogue from a second speaker, and wherein the first gain is different from the second gain.

35. The method according to claim 24, wherein the audio signal comprises at least two audio channels, and wherein a first type of dialogue enhancement processing associated with a first label comprises performing a first type of panning processing of the at least two channels and a second type of dialogue enhancement associated with a second label comprises applying a second type of panning processing.

36. The method according to claim 31, wherein the first label indicates dialogue of a first type and wherein the second label indicates dialogue of a second type, and wherein the first label is an on-screen dialogue label indicating dialogue associated with on-screen dialogue in a video file associated with the audio signal and wherein the second label is an off-screen dialogue label indicating dialogue associated with off-screen dialogue in the video file associated with the audio signal.

37. An audio processing system configured to perform the method according to any of claims 22 - 32.

8. A computer-readable storage media having software stored thereon, the software comprising instructions configured to control a processor to perform the method according to any of claims 22 - 32.