Real-time subtitle display method and system

A technology of subtitles and screens, applied in the field of real-time subtitle display methods and systems, which can solve problems such as the small amount of information transmitted, the inability to re-view subtitle texts on the spot, and the inability to meet application requirements, so as to achieve the effect of improving intelligibility and improving effects

Active Publication Date: 2017-01-11
IFLYTEK CO LTD
6 Cites 30 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0003] At present, for the display of audio and video subtitles, generally for pre-recorded audio and video, manually add subtitle text according to the content of the speaker in the audio and video, and directly display the subtitle text on the screen of the audio and video; in addition, considering the audio and video subtitles When the subtitles are displayed, only one or two lines of subtitle text are displayed on one screen, and the amount of info...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

For the existing problem of subtitle display method, the embodiment of the present invention provides a kind of real-time subtitle display method and system, adds punctuation to the subtitle text to be displayed that identification obtains, obtains the complete subtitle text clause of semantics, determines And mark whether the end position of the subtitle text sentence needs to be segmented, then determine the subtitle display basic unit according to the speaker's prosodic feature, and display the subtitle text sentence according to the subtitle display basic unit, thereby increasing the context of the subtitle text display, greatly improving The intelligibility of the speaker's speech content is improved, and the effect of the speaker's information transmission is improved.
Specifically, subtitle text can be segmented using a method based on model training, such as conditional random field, support vector machine or neural network, such as using a bidirectional long-short-term memory neural network (Bidirectional Long -Short Term Memory, BiLSTM) segment subtitle text, which can effectively memorize longer contextual information and improve the accuracy of segmentation. The model input is the subtitle text sentence vector, and the output is the segmentation result, that is, whether the end position of the sentence can be segmented; for example, "1" and "0" are used to indicate that the end position of the sentence needs to be segmented or not.
The real-time subtitle display method and system of the embodiment of the present invention can be applied to the real-time subtitle text display of the live broadcast or the speaker's scene, and the context information of the subtitle text is increased to help the user understand the speaker's speech content and improve the accuracy of the subtitle text. intelligibility. For example, in a conference scene, the speech content of each speaker is displayed on the screen in real time, and the participants can see the...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a real-time subtitle display method and system. The real-time subtitle display method comprises the steps of receiving speech data of a speaker; performing speech recognition on the current voice data to acquire a subtitle text to be displayed; adding punctuations to the subtitle text to acquire subtitle text clauses; determining and marking whether each text clause requires to be paragraphed at the end position or not; determining a subtitle display basic unit according to prosodic features of the speaker; and displaying the subtitle text according to the subtitle display basic unit. By using the real-time subtitle display method and system disclosed by the invention, an information transferring effect of the speaker can be improved.

Application Domain

Technology Topic

Image

  • Real-time subtitle display method and system
  • Real-time subtitle display method and system
  • Real-time subtitle display method and system

Examples

  • Experimental program(1)

Example Embodiment

[0082] In order to enable those skilled in the art to better understand the solutions of the embodiments of the present invention, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings and implementation manners.
[0083] In view of the problems existing in the existing subtitle display method, the embodiments of the present invention provide a real-time subtitle display method and system, which add punctuation to the recognized subtitle text to be displayed, obtain a semantically complete subtitle text clause, determine and mark all subtitles Describe whether the end position of the subtitle text sentence needs to be segmented, and then determine the basic unit of subtitle display according to the speaker's prosody characteristics, and display the subtitle text sentence according to the basic unit of subtitle display, thereby increasing the context of the subtitle text display and greatly improving the speaker The intelligibility of the speech content further improves the effectiveness of the speaker's information transmission.
[0084] Such as figure 1 Shown is a flowchart of a method for displaying real-time captions according to an embodiment of the present invention, which includes the following steps:
[0085] Step 101: Receive speaker voice data.
[0086] The voice data is determined according to actual application requirements. For example, it can be the voice data corresponding to each speaker during the meeting, or the voice data of the interviewer and the interviewee during the interview, or the voice data of the speaker or the guest during the speech. Voice data, etc.
[0087] Step 102: Perform voice recognition on the current voice data to obtain the caption text to be displayed.
[0088] To perform voice recognition on the current voice data, the specific process: firstly perform endpoint detection on the voice data to obtain the start point and end point of the effective voice segment; then perform feature extraction on the effective voice segment detected by the endpoint; then use the extracted feature data and The pre-trained acoustic model and language model are decoded to obtain the recognized text corresponding to the voice data, and the recognized text is used as the subtitle text to be displayed. The specific process of speech recognition is the same as that of the prior art, and will not be detailed here.
[0089] Step 103: Add punctuation to the subtitle text to obtain subtitle text clauses.
[0090] To add punctuation to the caption text, a model-based method, such as using a conditional random field model, can be used to add punctuation in the recognized text. The specific process is the same as that in the prior art, and will not be detailed here.
[0091] Step 104: Determine and mark whether the end position of each subtitle text sentence needs to be segmented.
[0092] Specifically, the subtitle text can be segmented by a method based on model training, such as a conditional random field, a support vector machine, or a neural network, such as a bidirectional long-short-term memory neural network (Bidirectional Long-Short Term Memory, BiLSTM) segment subtitle text, which can effectively memorize longer context information and improve the accuracy of segmentation. The input of the model is the subtitle text sentence vector, and the output is the segmentation result, that is, whether the end of the sentence can be segmented; for example, "1" and "0" are used to indicate that the end of the sentence needs to be segmented or not.
[0093] The training method of the segmentation model is as follows: first collect a large amount of recognized text data, and mark whether the end position of each sentence needs to be segmented as a labeling feature; then extract the sentence vector of the text data, the sentence vector can be based on The word vector of each word in the clause is obtained. The specific method is the same as that of the prior art. For example, the word vector of each word in the clause is summed and used as the clause vector; finally, the clause vector and label features are used as training data , Train the model parameters, and get the corresponding segmented model after training.
[0094] When the segmentation model is used to determine whether the end position of each subtitle text sentence needs to be segmented, the sentence vector of the subtitle text sentence is extracted, and the extracted sentence vector is input into the segmentation model to obtain the subtitle The segment marker at the end of the text clause.
[0095] Step 105: Determine the basic unit of subtitle display according to the prosodic characteristics of the speaker.
[0096] The speaker’s prosody feature refers to the speaker’s speech rate and pause duration when speaking. In order to prevent the speaker’s speech rate from being too fast or the pause duration being too short, causing a large delay in subtitle display, in the embodiment of the present invention, use The subtitle display basic unit displays subtitles. The basic subtitle display unit refers to the subtitle unit received by the display module at one time.
[0097] When determining the basic unit of subtitle display, first calculate the speaker’s current speaking rate, that is, the number of words spoken per second; then calculate the speaker’s pause time, which mainly refers to the pause time between semantically complete sentences; and finally; Determine whether the speaker’s speech rate exceeds the preset speech rate threshold or whether the pause duration between subtitle text clauses is lower than the pause duration threshold; if so, use the subtitle text clause as the basic unit of subtitle display; otherwise, use voice During recognition, the recognized text corresponding to the valid speech segment is used as the basic unit of subtitle display, and the recognized text corresponding to each valid speech segment generally contains multiple clauses.
[0098] Step 106: Display the subtitle text according to the basic subtitle display unit.
[0099] During specific display, the entire screen or a part of the screen can be used to display the subtitle text, and the subtitle text on the screen can be updated according to the subtitle display basic unit and subtitle information. During the specific update, it needs to be based on the number of words in the current subtitle display basic unit, the maximum number of words that can be displayed on the screen, the current number of words in the subtitle text on the screen, and whether the subtitle text in the current subtitle display basic unit and the subtitle text on the screen belong to the same paragraph Update the subtitle text on the screen, and display the content of the speaker on the screen in real time.
[0100] The maximum number of words that can be displayed on the screen can be set according to application requirements, for example, the entire screen can display 70 words.
[0101] The specific process of subtitle text display will be described in detail below.
[0102] Such as figure 2 As shown, it is a flowchart of subtitle text display in an embodiment of the present invention, where N represents the maximum number of words that can be displayed on the screen. The process is as follows:
[0103] Step 201: Receive a subtitle text of a subtitle display basic unit as the current subtitle text;
[0104] Step 202: Determine whether the sum of the number of words in the current caption text and the number of words in the last caption display basic unit on the screen exceeds the maximum number of words N that can be displayed on the screen; if so, go to step 203; otherwise, go to step 204;
[0105] Step 203, clear all the subtitle text on the screen, and display the current subtitle text on the screen; then perform step 201;
[0106] Step 204: Determine whether the sum of the number of words of the current subtitle text and all the words of the subtitle text on the screen exceeds the maximum number of words N that can be displayed on the screen; if so, go to step 205; otherwise, go to step 207;
[0107] Step 205: Determine whether the subtitle text of the last subtitle display basic unit on the screen has a segment mark; if so, go to step 203; otherwise, go to step 206;
[0108] Step 206, clear all text before the caption text of the last caption display unit on the screen, and then execute step 207;
[0109] Step 207: Display the current caption text directly behind the caption text of the last caption display unit; then step 201 is executed.
[0110] The real-time subtitle display method provided by the embodiment of the present invention adds punctuation to the recognized subtitle text to be displayed to obtain a semantically complete subtitle text sentence, determines and marks whether the end position of the subtitle text sentence needs to be segmented, and then according to The speaker’s prosody feature determines the basic unit of subtitle display, and displays the subtitle text clause according to the basic unit of subtitle display, thereby increasing the context of the subtitle text display, greatly improving the intelligibility of the speaker’s speaking content, and improving the speaker’s information transmission Effect.
[0111] Further, in another embodiment of the method of the present invention, when subtitles are displayed, the nomenclature and clue words in the subtitle text can be highlighted, such as using different colors or different fonts to display the nomenclature and clue words, so that Highlight the key points of the text and improve the display effect.
[0112] The nomenclature refers to words with key meanings such as person names, place names, and organization names; the clue words refer to words that express transition, explanation, causality, and other relationships. The nomenclature and clue words have important meanings to the understanding of the subtitle text, and they are also words that users pay more attention to. Therefore, the embodiment of the present invention recognizes the corresponding nomenclature and clue words and highlights them. Specifically, in the embodiment of the present invention, the recognition of nomenclature and clue words is taken as a sequence-to-sequence translation process, and an Encoder-Decoder sequence-to-sequence model is constructed to compare nomenclature and clue words in the subtitle text. Identify it.
[0113] Such as image 3 It is the Encoder-Decoder sequence-to-sequence model structure diagram in the embodiment of the present invention, including the following parts:
[0114] 1) Input layer: the word vector of each word segmentation of text data;
[0115] 2) Word coding layer: Use a one-way Long-Short Term Memory (LSTM) to encode each input word vector from left to right;
[0116] 3) Sentence coding layer: The output of the last word coding node of each sentence is used as the input of the sentence coding layer to build the relationship between sentences;
[0117] 4) Sentence decoding layer: take the output of the last node of the sentence coding layer as the input of the sentence decoding layer;
[0118] 5) Word decoding layer: Use a one-way long and short-term memory neural network to decode each word in turn from right to left;
[0119] 6) Output layer: output the labeling characteristics of each word, that is, whether each word is a named subject or clue word;
[0120] The construction process of Encoder-Decoder sequence to sequence model is as follows Figure 4 As shown, including the following steps:
[0121] Step 401: Collect a large amount of text data.
[0122] Step 402: Mark the nomenclature and clue words in the text data as marking features.
[0123] Step 403: Perform word segmentation on the text data, and extract the word vector of each word.
[0124] The specific methods for word segmentation and word vector extraction are the same as those in the prior art, and will not be detailed here.
[0125] Step 404: Train the Encoder-Decoder sequence-to-sequence model using the word vector of the text data and the annotation feature to obtain model parameters.
[0126] When using this model to identify the nomenclature and clue words of the subtitle text, it is necessary to extract the word vector of the subtitle text, and then input the word vector into the codec sequence to the sequence model to obtain the codec sequence to sequence The recognition result output by the model.
[0127] Correspondingly, the embodiment of the present invention also provides a real-time caption display system, such as Figure 5 Shown is a schematic diagram of the system.
[0128] In this embodiment, the system includes:
[0129] The receiving module 501 is used to receive the speaker's voice data;
[0130] The voice recognition module 502 is configured to perform voice recognition on current voice data to obtain the caption text to be displayed;
[0131] The punctuation adding module 503 is used to add punctuation to the subtitle text to obtain subtitle text clauses;
[0132] The segment marking module 504 is used to determine and mark whether the end position of the subtitle text sentence needs to be segmented;
[0133] The basic unit determining module 505 is configured to determine the basic unit of subtitle display according to the prosodic characteristics of the speaker;
[0134] The display module 506 is configured to display the subtitle text according to the basic subtitle display unit.
[0135] In practical applications, the aforementioned speech recognition module 502 may specifically adopt some existing speech recognition methods to obtain the recognized text, that is, the subtitle text to be displayed.
[0136] The punctuation adding module 503 can use a model-based method such as using a conditional random field model to add punctuation in the recognized text.
[0137] The segmentation marking module 504 may adopt a method based on model training to segment the subtitle text. The segmentation model can collect a large amount of recognized text data by the corresponding segmentation model training module, and mark whether the end position of each sentence needs to be segmented as a labeling feature; then extract the sentence vector of the text data; finally, the sentence The vector and label features are used as training data, and the model parameters are trained to obtain the corresponding segmented model. The segmented model training module may be used as a part of the system, or may be independent of the system, which is not limited in the embodiment of the present invention. Correspondingly, when the segmentation marking module 504 uses the segmentation model to segment the subtitles, it can first extract the clause vectors of the subtitle text clauses, and then input the clause vectors into the segmentation model. Obtain the segment mark of the end position of the subtitle text sentence.
[0138] In the embodiment of the present invention, the speaker's prosody characteristics include: the speaking rate and pause duration of the speaker when speaking. In order to prevent the speaker from speaking too fast or the pause duration being too short, resulting in a large delay in the display of subtitles, in the embodiment of the present invention, the basic unit of subtitle display is used to display the subtitles. The basic subtitle display unit refers to the subtitle unit received by the display module at one time. Correspondingly, the above-mentioned basic unit determination module 505 includes: a calculation unit and a determination unit, wherein:
[0139] The calculation unit is used to calculate the current speaking rate of the speaker and the pause time between subtitle text clauses;
[0140] The determining unit is used to determine whether the speaking rate exceeds a set speech rate threshold, or whether the pause duration is lower than a preset pause duration threshold; if so, determine to use a subtitle text sentence as a subtitle display Basic unit; otherwise, it is determined to use the recognized text corresponding to the valid speech segment during speech recognition as the basic unit for displaying captions. The recognized text corresponding to each valid speech segment generally contains one or more clauses.
[0141] Correspondingly, the above-mentioned display module 506 updates the subtitle text on the screen according to the subtitle display basic unit and the subtitle segment information. During the specific update, it needs to be based on the number of words in the current subtitle display basic unit, the maximum number of words that can be displayed on the screen, the current number of words in the subtitle text on the screen, and whether the subtitle text in the current subtitle display basic unit and the subtitle text on the screen belong to the same paragraph Update the subtitle text on the screen, and display the content of the speaker on the screen in real time. A specific structure of the display module 506 may include: a receiving unit, a first judgment unit, a second judgment unit, a third judgment unit, and a display execution unit. among them:
[0142] The receiving unit is configured to receive a subtitle text of a subtitle display basic unit as the current subtitle text;
[0143] The first judging unit is used to judge whether the sum of the number of words in the current subtitle text and the number of words in the last subtitle display basic unit on the screen exceeds the maximum number of words that can be displayed on the screen; if so, trigger the display execution unit to clear all subtitles on the screen Text, display the current subtitle text on the screen; otherwise, trigger the second judgment unit;
[0144] The second judging unit is used to judge whether the sum of the current word count of the subtitle text and the word count of all subtitle text on the screen exceeds the maximum number of words that can be displayed on the screen; if it is, the third judgment unit is triggered; otherwise, the display execution unit is triggered to change the current subtitle The text is displayed directly after the subtitle text of the last subtitle display unit;
[0145] The third judgment unit is used to judge whether the subtitle text of the last subtitle display basic unit on the screen has a segment mark; if so, trigger the display execution unit to clear all the subtitle text on the screen and display the current subtitle text on the screen; otherwise , Trigger the display execution unit to clear all the text before the subtitle text of the last subtitle display unit on the screen, and then display the current subtitle text directly behind the subtitle text of the last subtitle display unit.
[0146] The real-time subtitle display system provided by the embodiment of the present invention adds punctuation to the recognized subtitle text to be displayed to obtain a semantically complete subtitle text sentence, determines and marks whether the end position of the subtitle text sentence needs to be segmented, and then according to The speaker’s prosody feature determines the basic unit of subtitle display, and displays the subtitle text clause according to the basic unit of subtitle display, thereby increasing the context of the subtitle text display, greatly improving the intelligibility of the speaker’s speaking content, and improving the speaker’s information transmission Effect.
[0147] Further, as Image 6 As shown, in another embodiment of the system of the present invention, the system may further include:
[0148] The word recognition module 601 is configured to recognize the nomenclature and clue words of the caption text by using a pre-built codec sequence-to-sequence model to obtain a recognition result;
[0149] The display processing module 602 is configured to highlight the recognition result when the display module displays the subtitle text.
[0150] The codec sequence-to-sequence model may be constructed by a corresponding codec sequence-to-sequence model building module, and the codec sequence-to-sequence model building module may include the following units:
[0151] Data collection unit, used to collect a large amount of text data;
[0152] The labeling unit is used to label the nomenclature and clue words in the text data as labeling features;
[0153] The data processing unit is used to segment the text data and extract the word vector of each word;
[0154] The parameter training unit is used to train the codec sequence-to-sequence model by using the word vector of the text data and the annotation feature to obtain model parameters.
[0155] It should be noted that the foregoing encoding and decoding sequence-to-sequence model building module may be used as a part of the system, or may be independent of the system, which is not limited in the embodiment of the present invention.
[0156] Correspondingly, the aforementioned word recognition module 801 may include the following units:
[0157] A word vector extraction unit for extracting the word vector of the caption text;
[0158] The recognition unit is used to input the word vector into the codec sequence to the sequence model to obtain the recognition result output from the codec sequence to the sequence model.
[0159] The real-time subtitle display system in the embodiment of the present invention can not only display the subtitle text segmentation according to the basic unit of subtitle display, but also can highlight the nomenclature and clue words in the subtitle text when the subtitle is displayed, such as using different colors Or display the named body and clue words in different fonts, which can highlight the key points of the text and improve the display effect.
[0160] The real-time subtitle display method and system of the embodiment of the present invention can be applied to live broadcast or real-time subtitle text display at the speaker’s site, adding contextual information of the subtitle text to help users understand the content of the speaker’s speech and improve the intelligibility of the subtitle text . For example, in a meeting scenario, the content of each speaker’s speech is displayed on the screen in real time, and participants can see the corresponding content and the context of the current content while hearing the speaker’s voice, thereby helping other participants Understand the content of the current speaker; for example, when the teacher is in class, the teacher's lecture content is displayed on the screen in real time to help students better understand the teacher's lecture content. The subtitle text can be displayed on the entire screen to increase the information content of the displayed text.
[0161] The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the part of the description of the method embodiment. The system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
[0162] The embodiments of the present invention are described in detail above, and specific implementations are used to illustrate the present invention. The descriptions of the above embodiments are only used to help understand the method and system of the present invention; meanwhile, for those of ordinary skill in the art, According to the idea of ​​the present invention, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as limiting the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

A sorting method, device, machine-readable medium and equipment

ActiveCN112417197BImprove intelligibilityCharacter and pattern recognitionStill image data clustering/classificationMachine-readable mediumData mining
Owner:GUANGZHOU YUNCONG INFORMATION TECH CO LTD

Classification and recommendation of technical efficacy words

  • Improve the effect
  • Improve intelligibility

Prosody based audio/visual co-analysis for co-verbal gesture recognition

ActiveUS7321854B2Improve intelligibilityIncrease ratingsProgramme controlComputer controlIndividual analysisPublic place
Owner:SAMSUNG ELECTRONICS CO LTD

Audio spectral noise reduction method and apparatus

InactiveUS20060200344A1Improve intelligibilityEnhance voice featureSpeech recognitionAudio signalSpectral component
Owner:KOSEK DANIEL A

Receiver Intelligibility Enhancement System

InactiveUS20080312916A1Improve intelligibilityMonitor noiseSpeech analysisLoudspeakerBackground noise
Owner:NOISE FREE WIRELESS

Hearing Eyeglass System and Method

InactiveUS20170245065A1Improve intelligibilitySets with desired directivityHearing aids signal processingLoudspeakerAuditory system
Owner:SUHAMI AVRAHAM +1

Ligand molecule massive characteristic screening method in drug design

ActiveCN106778032AImprove learning efficiencyImprove intelligibilityMolecular designChemical machine learningFingerprintFeature screening
Owner:NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products