A text broadcast method, an electronic device and a storage medium
By standardizing the text to be broadcast and predicting the prosody and phonemes of the language text, the problem of inaccurate broadcast duration prediction in existing technologies has been solved, achieving more accurate broadcast duration prediction and a better user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HONOR DEVICE CO LTD
- Filing Date
- 2023-11-30
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, predicting the broadcast duration based on the number of characters in the text to be broadcast is inaccurate, leading to a decline in user experience.
By standardizing the text to be broadcast, numbers and symbols represented in non-target language forms are converted into target language text, unnecessary punctuation is removed, and letter abbreviations are converted into full pinyin. The broadcast duration is predicted based on the prosody and phoneme prediction results of the language text.
It improves the accuracy of broadcast duration prediction, adapts to the broadcast needs of different users, and enhances the user experience.
Smart Images

Figure CN122293784A_ABST
Abstract
Description
[0001] This application is a divisional application. The original application has the application number 202311653641.2 and the original application date is November 30, 2023. The entire contents of the original application are incorporated herein by reference. Technical Field
[0002] This application relates to the field of data processing technology, and in particular to a text broadcasting method, electronic device and storage medium. Background Technology
[0003] When the on-demand text-to-speech function is enabled on a mobile phone, if a user opens a news app, the on-demand text-to-speech function can read aloud (i.e., broadcast) the content on the phone's display screen or the content corresponding to the area selected by the user on the display screen. At the same time, it can predict the broadcast duration of the content on the phone's display screen or the content corresponding to the area selected by the user on the display screen.
[0004] In some solutions, the predicted playback duration of the text to be played is predicted based on the number of characters in the text to be played. However, different playback tones result in different playback durations for the content on the mobile phone's display interface or the content corresponding to the area selected by the user on the display interface. Predicting the playback duration based on the number of characters in the text to be played leads to inaccurate prediction results displayed to the user, which in turn affects the user experience. Summary of the Invention
[0005] To address the problem of inaccurate prediction of the broadcast duration of the text to be broadcast, embodiments of this application provide a text broadcasting method, an electronic device, and a storage medium, including:
[0006] In a first aspect, this application provides a text broadcasting method applied to an electronic device, comprising: displaying a first broadcasting interface, the first broadcasting interface including a first predicted broadcasting duration of a first text, wherein the first text includes N characters, of which M are first-class characters; and displaying a second broadcasting interface, the second broadcasting interface including a second predicted broadcasting duration of a second text, wherein the second text includes N characters, of which P are first-class characters, wherein M and P are different, and the first predicted broadcasting duration and the second predicted broadcasting duration are different.
[0007] Based on the above scheme, for texts with the same number of characters but different numbers of first-type characters, the broadcast duration of the text can be predicted to be different. In this way, the problem of inaccurate prediction caused by predicting the broadcast duration of the text solely based on the number of characters can be avoided, and the accuracy of predicting the broadcast duration of the text can be improved.
[0008] It is understandable that both the first and second texts can be used as the text to be broadcast. The text to be broadcast can include only one language, such as pure Chinese text, pure English text, etc. It can also include a combination of multiple languages, such as some text being Chinese text and some text being English text.
[0009] It is understandable that in some optional instances, the first predicted broadcast duration and the second predicted broadcast duration can be the same. For example, the number of characters obtained after standardizing M first-class characters in the first text is equal to the number of characters obtained after standardizing P first-class characters in the second text. When predicting the broadcast duration based on the number of characters in the two standardized texts, the first predicted duration and the second predicted duration may be equal. However, when predicting the broadcast duration based on the prosody prediction results and / or phoneme prediction results of the two standardized texts, the first predicted duration and the second predicted duration may not be equal.
[0010] In some optional examples of the first aspect of this application, the broadcast language of the first text includes at least one of the first class of languages, and the first class of characters includes numbers represented in a form other than the corresponding broadcast language, symbols to be broadcast, letter abbreviations, and punctuation marks not to be broadcast.
[0011] It is understandable that the language used to broadcast the second text may also include at least one of the languages in the first category.
[0012] In some optional instances of the first aspect of this application, the first language includes Chinese and English, and the abbreviations include English abbreviations.
[0013] Understandably, the first category of languages may also include Japanese, Latin, Malay, Catalan, Czech, Danish, German, Estonian, English, Spanish, Basque, Filipino, French, Galician, Croatian, Indonesian, Italian, Latvian, Liwano, Hungarian, Dutch, Norwegian, Polish, Portuguese, Romanian, Finnish, Swedish, Turkish, Greek, Vietnamese, and other languages where one character represents one syllable.
[0014] It is understandable that abbreviations include at least one abbreviation from Japanese, Latin, Malay, Catalan, Czech, Danish, German, Estonian, English, Spanish, Basque, Filipino, French, Galician, Croatian, Indonesian, Italian, Latvian, Liwano, Hungarian, Dutch, Norwegian, Polish, Portuguese, Romanian, Finnish, Swedish, Turkish, Greek, and Vietnamese.
[0015] In some optional examples of the first aspect of this application, the first predicted broadcast duration is determined by: converting a first type of character in a first text to obtain a first language text; and determining the first predicted broadcast duration based on the first language text.
[0016] Understandably, for text to be broadcast that contains multiple languages, numbers and symbols that need to be broadcast in a form other than the language's corresponding text portion can be converted to the corresponding language's characters. Punctuation marks that don't need to be broadcast can be deleted, and letter abbreviations can be replaced with full pinyin to obtain the language text. For example, if the text to be broadcast includes both Chinese and English, Arabic numerals in the Chinese text portion can be converted to lowercase Chinese numerals, and Arabic numerals in the English text portion can be converted to words or phrases.
[0017] In some optional instances, the playback duration of the text to be played can be predicted based on the number of characters in the language text and the playback duration of each character.
[0018] In some optional instances, the broadcast duration of the text to be broadcast can be predicted based on the pause duration corresponding to the prosodic levels of complete sentences, short sentences, phrases, and words in the language text.
[0019] In some optional instances, the broadcast duration of the text to be broadcast can be predicted based on the broadcast duration corresponding to the phonemes of the characters and / or words in the language text.
[0020] In some optional instances, the broadcast duration of the text to be broadcast can be predicted based on the pause duration corresponding to the prosodic levels of complete sentences, short sentences, phrases, and words in the language text, and the broadcast duration corresponding to the phonemes of characters and / or words.
[0021] In some optional instances, the broadcast duration of the text to be broadcast can be predicted based on the number of characters in the language text, the broadcast duration of each character, and the pause duration corresponding to the prosodic level of complete sentences, short sentences, phrases, and words.
[0022] In this embodiment of the application, the accuracy of predicting the broadcast duration of the text to be broadcast can be improved by predicting the broadcast duration of the text to be broadcast based on the language text, compared with predicting the broadcast duration of the text to be broadcast based on the number of characters in the text to be broadcast.
[0023] It is understandable that the method for determining the second predicted broadcast duration can also be the same as the method for predicting the first broadcast duration described above. To avoid repetition, the following text will use the prediction of the first broadcast duration as an example to introduce the method for predicting the broadcast duration of the text.
[0024] In some optional examples of the first aspect of the present application, converting the first type of characters in the first text includes at least one of the following: converting the numbers represented in a form other than the first language and the symbols to be broadcasted in the text broadcasted in the first language into the corresponding language characters of the first language; deleting the punctuation marks that do not need to be broadcasted in the first text; converting the abbreviations of letters in the first text into full spellings.
[0025] For example, for a Chinese text, the Arabic numerals and Roman numerals in the text to be broadcasted can be converted into Chinese lowercase numerals. For example, converting the Arabic numeral "360" in "Qihoo 360 Company" into the language character "three six zero", converting the Arabic numeral in "5.5" into the language character "five point five", and so on. The symbols to be broadcasted in the text to be broadcasted can be converted into the corresponding language characters in Chinese. For example, converting the symbol "~" to be broadcasted in "3~4" into the language character "to", converting "≤" into the language character "less than or equal to", converting "&" into the language character "and", and so on. The symbols "※", "R", "¤" that do not need to be broadcasted are not converted. The punctuation marks that do not need to be broadcasted in the text to be broadcasted can be deleted. For example, deleting the symbol "~" that does not need to be broadcasted in "Ah~", deleting the punctuation marks 《》 and "" in "《The Road to Getting Rich》" that do not need to be broadcasted, and so on.
[0026] In the embodiments of the present application, by converting the first type of characters in the first text, a language text more in line with the broadcast form can be obtained, so that the accuracy of predicting the broadcast duration of the text to be broadcasted can be improved.
[0027] In some optional examples of the first aspect of the present application, determining the first predicted broadcast duration based on the first language text includes: determining the first predicted broadcast duration based on the number of characters in the first language text and the corresponding broadcast duration of each character.
[0028] It can be understood that a character can refer to a word, a word, etc. represented in the form of broadcast speech.
[0029] In the embodiments of the present application, by converting the first type of characters in the first text, a language text more in line with the broadcast form can be obtained, so that based on the number of characters in the language text, the broadcast duration of the first text is predicted, and the accuracy of predicting the broadcast duration of the first text can be improved.
[0030] In some optional examples of the first aspect of the present application, determining the first predicted broadcast duration based on the first language text includes: determining the first predicted broadcast duration based on at least one of the prosody levels corresponding to complete sentences, short sentences, phrases, and words in the first language text, the phonemes of the characters and / or words in the first language text, the broadcast tone color, and the speech rate parameters. <00000�4>Under normal circumstances, prosody can be divided into four levels, represented by #4 (SEN), #3 (IP), #2 (PP), and #1 (PW). Among them, #4 can represent the prosodic level corresponding to a complete sentence, and the pause length corresponding to this prosodic level is generally obvious or very long. #3 can represent the prosodic level corresponding to a short sentence composed of phrases, and the pause length corresponding to this prosodic level is generally relatively long. #2 can represent the prosodic level corresponding to a short word composed of words, and the pause length corresponding to this prosodic level is generally relatively short. #1 can represent the prosodic level corresponding to a word, and the pause length corresponding to this prosodic level is generally nonexistent or very short.
[0032] It is understandable that the phonemes of a written word can include pronunciation units such as initials, finals, and tones, while the phonemes of a word can include pronunciation units such as vowels and consonants.
[0033] It is understood that the broadcast tone can include a first broadcast tone (e.g., male voice), a second broadcast tone (e.g., female voice), a third broadcast tone (e.g., custom broadcast tone), etc., and this application embodiment does not make specific limitations.
[0034] It is understood that the speech rate parameter may include 1.0, 0.5, 0.9, 2.0, etc., and the embodiments of this application do not make specific limitations.
[0035] In this embodiment of the application, by converting the first type of characters in the first text, a language text that is more in line with the broadcast format can be obtained. Based on the pause duration corresponding to the prosodic level of complete sentences, short sentences, phrases, and words in the language text, as well as the broadcast duration corresponding to the phonemes of characters and / or words, the broadcast duration of the text to be broadcast can be predicted, which can further improve the accuracy of predicting the broadcast duration of the first text.
[0036] In some optional examples of the first aspect of this application, the prosodic levels corresponding to complete sentences, short sentences, phrases, and words in the first language text are determined based on a prosodic prediction model, and the phonemes of characters and / or words in the first language text are determined based on a phoneme prediction model.
[0037] In some optional examples of the first aspect of this application, determining the first predicted broadcast duration based on the first language text includes: determining the first predicted broadcast duration based on the prosodic level corresponding to complete sentences, short sentences, phrases, and words in the first language text, the number of characters in the first language text, and the broadcast duration corresponding to each character.
[0038] In this embodiment of the application, by converting the first type of characters in the first text, a language text that is more in line with the broadcast format can be obtained. Thus, based on the pause duration corresponding to the prosodic level of complete sentences, short sentences, phrases, and words in the language text, as well as the number of characters, the broadcast duration of the text to be broadcast can be predicted, which can further improve the accuracy of predicting the broadcast duration of the first text.
[0039] In some optional instances of the first aspect of this application, the method includes: displaying a third predicted playback duration of the first text in response to a user switching the playback rate of the first text from a first playback rate parameter to a second playback rate parameter, and / or in response to a user switching the playback tone of the first text from a first playback tone to a second playback tone.
[0040] For example, based on the prosodic levels of complete sentences, short sentences, phrases, and words in the first-language text, the initials, finals, and tones of the characters in the first-language text, the first broadcast timbre, and a speech rate parameter of 1.0, the predicted first broadcast duration of the first text can be 10 seconds. If the first broadcast timbre is switched to the second broadcast timbre, the predicted first broadcast duration of the first text can be 20 seconds. If the first broadcast timbre is switched to the second broadcast timbre, and the speech rate parameter is switched from 1.0 to 2.0, the predicted first broadcast duration of the first text can still be 10 seconds.
[0041] In this embodiment of the application, by providing users with speech speed switching and timbre switching functions, it is possible to adapt to the different habits and different broadcasting needs of different users, thereby improving the user experience.
[0042] Secondly, this application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the one or more processors of the electronic device, for executing the text broadcasting method mentioned in this application.
[0043] Thirdly, this application provides a readable storage medium storing instructions that, when executed on an electronic device, cause the electronic device to perform the text broadcasting method mentioned in this application.
[0044] Fourthly, embodiments of this application provide a computer program product, including: a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium containing computer program code for performing the text broadcasting method mentioned in this application. Attached Figure Description
[0045] Figure 1 Based on some examples of this application, a schematic diagram of a text broadcast scenario is shown;
[0046] Figure 2Based on some examples of this application, another schematic diagram of a text broadcast scenario is shown.
[0047] Figure 3 Based on some examples of this application, a schematic diagram of the structure of a text broadcasting system is shown;
[0048] Figure 4 Based on some examples of this application, a flowchart of a text broadcasting method is shown;
[0049] Figure 5 Based on some examples of this application, a schematic diagram of a first display interface of an electronic device is shown;
[0050] Figure 6 Based on some examples of this application, a schematic diagram of a second display interface of an electronic device is shown;
[0051] Figure 7 Based on some examples of this application, a flowchart illustrating a method for performing regular expression processing on text to be broadcast is shown;
[0052] Figure 8 Based on some examples of this application, a flowchart of a character regular expression is shown;
[0053] Figure 9 Based on some examples of this application, a flowchart of a single-sentence regular expression is shown;
[0054] Figure 10 Based on some examples of this application, a schematic diagram of regular expression processing for date-type numbers is shown;
[0055] Figure 11 Based on some examples of this application, a schematic diagram of the structure of a prosody prediction model is shown;
[0056] Figure 12 Based on some examples of this application, a schematic diagram of the structure of a phoneme prediction model is shown;
[0057] Figure 13 Based on some examples of this application, a schematic diagram of the structure of a duration prediction model is shown;
[0058] Figure 14 Based on some examples of this application, a schematic diagram of the hardware structure of an electronic device is shown;
[0059] Figure 15 Based on some examples of this application, a schematic diagram of the software structure of an electronic device is shown;
[0060] Figure 16 Based on some examples of this application, interactive schematic diagrams of a text broadcasting method based on software architecture are shown. Detailed Implementation
[0061] The illustrative embodiments of this application include, but are not limited to, a text broadcasting method, an electronic device, and a storage medium.
[0062] It is understood that the text broadcasting method mentioned in the embodiments of this application can provide text broadcasting functionality for ordinary users in special scenarios, such as when users are washing their teeth, fatigued, or driving, making it inconvenient for them to view the text on the display interface of electronic devices. For example, news can be broadcast using the text broadcasting function of an electronic device while a user is brushing their teeth. The text broadcasting method mentioned in the embodiments of this application can also provide text broadcasting functionality for special users with cognitive impairments, low levels of education, or non-standard pronunciation in ordinary scenarios. For example, visually impaired people can use the text broadcasting function of electronic devices to receive text messages.
[0063] It is understood that the electronic devices in the embodiments of this application may also be referred to as terminals, user equipment (UE), mobile stations (MS), mobile terminals (MT), etc. Electronic devices can be mobile phones, smart TVs, wearable devices, tablets, computers with wireless transceiver capabilities, virtual reality (VR) electronic devices, augmented reality (AR) electronic devices, wireless terminals in industrial control, wireless terminals in self-driving, wireless terminals in remote medical surgery, wireless terminals in smart grids, wireless terminals in transportation safety, wireless terminals in smart cities, wireless terminals in smart homes, etc. The embodiments of this application do not impose any restrictions on the specific type of electronic device.
[0064] The following uses a mobile phone as an example to illustrate some of the text broadcasting methods mentioned in the embodiments.
[0065] Figure 1 The diagram illustrates a scenario of text broadcasting, such as... Figure 1 As shown, when the phone's on-demand text-to-speech function is enabled, for example, when the user clicks on... Figure 1When the "Second Announcement Tone" in the "Read Aloud" application shown in a is exited and saved, if the user opens the news and information application again, the Read Aloud function can announce (i.e., read aloud) the content in the display interface of the mobile phone or the content corresponding to the area selected by the user in the display interface. At the same time, it can predict the announcement duration of the content in the display interface of the mobile phone or the content corresponding to the area selected by the user in the display interface. For example, it announces the content "Economic growth of 5.5%" corresponding to the area selected by the user in the display interface shown in Fig. 1b and predicts the announcement duration to be 0.6s.
[0066] The following introduces the text announcement methods mentioned in some embodiments.
[0067] In some specific implementations, the announcement duration of the text to be announced (i.e., the content in the display interface of the mobile phone or the content corresponding to the area selected by the user in the display interface) can be predicted according to the number of characters. Specifically, the announcement duration of one character (including text, words, symbols, punctuation, etc.) in the text to be announced can be set to 0.1s, and then the number of characters in the text to be announced is counted to obtain the predicted announcement duration of the text to be announced.
[0068] For example, if the text to be announced is "Economic growth of 5.5%", the number of characters in the text to be announced is 8. Since the announcement duration of one character is 0.1s, in the current prediction scheme, the predicted announcement duration of the text to be announced can be predicted to be 0.8s according to the number of characters in the text to be announced. However, when announcing, the announcement durations of numbers and symbols are different from those of text. Numbers and symbols do not necessarily correspond to one character. For example, the number "5.5" corresponds to three characters "five point five", and the symbol "%" corresponds to three characters "percent". Therefore, the actual announcement duration of the text to be announced is 1s. That is, the method of predicting the announcement duration based on the number of characters in the text to be announced does not consider the differences in the announcement durations of special characters (such as numbers and symbols) and text, resulting in a situation where the predicted announcement duration does not match the actual announcement duration and the accuracy of the predicted duration is relatively poor.
[0069] For another example, the text to be broadcast is "The Road to Riches - Dr Li", and the number of characters in the text to be broadcast is 12. In the current prediction scheme, based on the number of characters in the text to be broadcast, the predicted broadcast duration of the text to be broadcast can be predicted to be 1.2 s. However, during the broadcast, the punctuation marks 《》, ——, and “” do not need to be broadcast, and the letter abbreviation "Dr" is "doctor" during the broadcast, and the broadcast duration of "doctor" is 0.3 s. Therefore, the actual broadcast duration of the text to be broadcast is 0.9 s. That is, the method of predicting the broadcast duration based on the number of characters in the text to be broadcast treats punctuation marks and letter abbreviations as one character, without considering that punctuation marks do not need to be broadcast and that the broadcast duration of letter abbreviations is different from that of the full spelling, which will result in a situation where the predicted broadcast duration does not match the actual broadcast duration, leading to poor accuracy of the predicted duration.
[0070] To solve the above problems, an embodiment of the present application provides another text broadcast method. In this method, the text to be broadcast can be standardized. For example, for a Chinese text, the numbers represented in a form other than Chinese and the symbols that need to be broadcast in the text to be broadcast can be converted into the corresponding Chinese language characters, and the punctuation marks that do not need to be broadcast can be deleted to obtain a language text. Then, based on the language text, the broadcast duration of the text to be broadcast is predicted. In this way, the accuracy of predicting the broadcast duration of the text to be broadcast can be improved.
[0071] For example, for a Chinese text, the Arabic numerals and Roman numerals in the text to be broadcast can be converted into Chinese lowercase numerals. For example, the Arabic numeral "360" in "Qihoo 360 Company" is converted into "three six zero", and the Arabic numeral in "five point five" is converted into "five point five", and so on.
[0072] The symbols that need to be broadcast in the text to be broadcast can be converted into the corresponding Chinese language characters. For example, the symbol "~" that needs to be broadcast in "3~4" is converted into the language character "to", the symbol "≤" is converted into the language character "less than or equal to", the symbol "&" is converted into the language character "and", and so on. The symbols "※", "R", and "¤" that do not need to be broadcast are not converted.
[0073] The punctuation marks that do not need to be broadcast in the text to be broadcast can be deleted. For example, the symbol "~" that does not need to be broadcast in "ah~" is deleted, the symbols "※", "R", and "¤" are deleted, and the punctuation marks "《》" that do not need to be broadcast in "The Road to Riches - Dr Li" are deleted, and so on.
[0074] For texts in languages such as English, German, and French, numbers and symbols that need to be broadcast in a form other than English, German, or French can be converted into their corresponding English, German, or French languages. Punctuation marks that do not need to be broadcast are removed, and abbreviations are converted to full spelling to obtain the spoken text. Then, the broadcast duration of the text is predicted based on the spoken text. This improves the accuracy of predicting the broadcast duration of the text.
[0075] Understandably, for text to be broadcast that contains multiple languages, numbers and symbols that need to be broadcast in a form other than the language's corresponding text portion can be converted to the corresponding language's characters. Punctuation marks that don't need to be broadcast can be deleted, and letter abbreviations can be replaced with full pinyin to obtain the language text. For example, if the text to be broadcast includes both Chinese and English, Arabic numerals in the Chinese text portion can be converted to lowercase Chinese numerals, and Arabic numerals in the English text portion can be converted to words or phrases.
[0076] It is understandable that the language used to broadcast the text can be any language where each character has a single syllable, such as Chinese, Japanese, Latin, Malay, Catalan, Czech, Danish, German, Estonian, English, Spanish, Basque, Filipino, French, Galician, Croatian, Indonesian, Italian, Latvian, Liwano, Hungarian, Dutch, Norwegian, Polish, Portuguese, Romanian, Finnish, Swedish, Turkish, Greek, Vietnamese, etc. The text to be broadcast can be any text with a basic phoneme.
[0077] For example, standardizing the text "economic growth 5.5%" yields the spoken text "economic growth five point five percent," which has 10 characters. Based on the number of characters in the target language within the spoken text, the broadcast duration can be determined to be 1 second. Thus, it is possible to... Figure 2 The content displayed is shown to be broadcast in the area selected by the user on the display interface, and the predicted broadcast duration is 1 second.
[0078] In some optional instances, after obtaining the language text, prosodic prediction can be performed. For example, the language text can be segmented into complete sentences, short sentences, phrases, and words, and the prosodic levels corresponding to these segments can be determined to obtain the prosodic prediction results. Furthermore, phoneme prediction can be performed on the language text. For example, the phoneme units (initials, finals, and tones) of the characters in the language text can be identified, and / or the phoneme units (vowels and consonants) of the words in the language text can be identified, thus obtaining the phoneme prediction results. Then, based on the prosodic prediction results and phoneme prediction results, the broadcast duration of the text to be broadcast can be predicted, which can improve the accuracy of predicting the broadcast duration.
[0079] For example, prosodic prediction can be performed on the language text "economic growth of 5.5 percent", yielding the prosodic prediction result "economic growth #2 percent #2 5.5", where "#2" indicates the prosodic level is the second prosodic level. Furthermore, phoneme prediction can be performed on the language text "economic growth of 5.5 percent", yielding the phoneme prediction result "j ing1 j i4 z eng1 zhang3 b ai3 f en1 zh i1 w u5 d ian3 w u3", which represents multiple phoneme units in the language text. The pause duration corresponding to the second prosodic level can be 0.2 seconds, and the broadcast duration corresponding to each phoneme unit can be shown in Table 1. Therefore, based on the prosodic prediction result and the phoneme prediction result of the language text, the broadcast duration of the text to be broadcast can be predicted to be 4.5 seconds.
[0080] Table 1
[0081]
[0082] In other alternative instances, after obtaining the prosodic prediction results and phoneme prediction results of the language text, the playback duration of the text to be played can be predicted based on the prosodic prediction results and phoneme prediction results of the language text, as well as the user-selected broadcast timbre (e.g., male voice, female voice, custom timbre, etc.) and speech rate parameters (e.g., 1.0, 0.5, 0.9, 2.0, etc.). This can further improve the accuracy of predicting the playback duration of the text to be played.
[0083] In some alternative instances, prosodic prediction can be performed only on the language text to obtain the prosodic prediction results. Then, the broadcast duration of the text to be broadcast can be predicted based on the number of characters in the language text and the pause duration corresponding to each prosodic level in the prosodic prediction results. Compared to directly predicting the broadcast duration of the text to be broadcast based on the number of characters in the target language in the language text, this can improve the accuracy of predicting the broadcast duration of the text to be broadcast.
[0084] For example, prosodic prediction can be performed on the language text "economic growth of 5.5 percent", resulting in the prosodic prediction result "economic growth #2 percent #2 5.5 percent", where "#2" indicates that the prosodic level is the second prosodic level, and the pause duration corresponding to the second prosodic level is 0.2 seconds. Furthermore, based on the number of characters in the target language within the language text to be broadcast and the pause duration corresponding to the prosodic level, the broadcast duration of the text to be broadcast can be determined to be 1.4 seconds.
[0085] In some alternative instances, phoneme prediction can be performed only on the language text to obtain the phoneme prediction results. Then, the broadcast duration of the text to be broadcast can be predicted based on the broadcast duration of each pronunciation unit (phoneme) in the language text's phoneme prediction results. Compared to directly predicting the broadcast duration of the text to be broadcast based on the number of characters in the language text, this can improve the accuracy of predicting the broadcast duration of the text to be broadcast.
[0086] For example, phoneme prediction can be performed on the language text "economic growth of 5.5 percent" to obtain the phoneme prediction result "j ing1j i4 z eng1 zh ang3 b ai3 f en1 zh i1 w u5 d ian3 w u3", which represents multiple pronunciation units (phonemes) of the language text. Table 1 above shows the broadcast duration corresponding to each pronunciation unit. Furthermore, based on the broadcast duration of each pronunciation unit (phoneme) in the phoneme prediction result of the language text, the broadcast duration of the text to be broadcast can be determined to be 4.1 seconds.
[0087] Before introducing the text broadcasting method mentioned in the embodiments of this application, we will first introduce the text broadcasting system to which the text broadcasting method is applied.
[0088] Figure 3 A schematic diagram of the structure of a text broadcasting system 300 is shown, such as... Figure 3 As shown, the text broadcasting system 300 may include a text standardization module 310, a prosody prediction module 320, a pronunciation unit prediction module 330, and a duration prediction module 340.
[0089] The text standardization module 310 can be used to standardize the text to be broadcast, obtaining the corresponding language text. For example, it can convert numbers and symbols that need to be broadcast in a form other than the corresponding broadcast language into the language text corresponding to the target language, delete punctuation marks that do not need to be broadcast, and convert letter abbreviations into full pinyin, thus obtaining the corresponding language text.
[0090] In some optional instances, the text standardization module 310 may include a character regular expression module and a sentence regular expression module. The character regular expression module performs character regularization on the input text to be broadcast, that is, deleting obvious punctuation marks that do not need to be broadcast and replacing obvious letter abbreviations, to obtain the preprocessed text to be broadcast. The sentence regular expression module performs number regularization, symbol regularization, and word regularization on the input sentence, that is, converting numbers and symbols that need to be broadcast in forms other than the corresponding broadcast language in the input text to be broadcast into the corresponding characters of the target language, deleting unnecessary punctuation marks, and replacing letter abbreviations with full spelling.
[0091] It is understandable that the target language can refer to the language in which the text to be broadcast is spoken.
[0092] The prosody prediction module 320 can be used to predict the prosody of the language text corresponding to the text to be broadcast, and obtain the prosody prediction result of the language text. For example, the language text can be divided into complete sentences, short sentences, phrases, and words, and the prosody level corresponding to the divided complete sentences, short sentences, phrases, and words can be determined to obtain the prosody prediction result of the language text.
[0093] In some optional instances, the prosody prediction module 320 may include a prosody prediction model, which can be a model for predicting the prosodic level of complete sentences, short sentences, phrases, and words in the input language text.
[0094] The pronunciation unit prediction module 330 can be used to predict phonemes in the language text corresponding to the text to be broadcast, and obtain phoneme prediction results. For example, it can determine the pronunciation units (phonemes) of the initials, finals and tones of the characters in the language text, and / or determine the pronunciation units (phonemes) of the vowels and consonants of the words in the language text, thereby obtaining the phoneme prediction results of the language text.
[0095] In some optional instances, the articulation unit prediction module 330 may include a phoneme prediction model that can be used to predict articulation units (phonemes) such as initials, finals and tones of the input text and / or to predict articulation units (phonemes) such as vowels and consonants of the input word.
[0096] The duration prediction module 340 can be used to predict the broadcast duration of the text to be broadcast based on the prosody prediction results and phoneme prediction results of the language text, and obtain the predicted broadcast duration of the text to be broadcast.
[0097] In some optional instances, the duration prediction module 340 may include a duration prediction model that can be used to predict the broadcast duration of each articulation unit (phoneme) in the input language text.
[0098] In some optional instances, the duration prediction module 340 can respond to the user's selection action by obtaining the user-selected playback tone (e.g., Figure 1 The system uses the first, second, and third timbre of the broadcast voice (a) and speech rate parameters (e.g., 1.0, 0.5, 2.0, etc.) to predict the broadcast duration of the text to be broadcast, and combines the broadcast voice, speech rate parameters, prosody prediction results, and phoneme prediction results.
[0099] For example, based on the prosodic levels of complete sentences, short sentences, phrases, and words in the text to be broadcast, the initials, finals, and tones of the characters in the text, and with the first broadcast timbre and speech rate parameter of 1.0, the predicted broadcast duration of the text to be broadcast can be predicted to be 10 seconds. If the first broadcast timbre is switched to the second broadcast timbre, the first predicted broadcast duration can be predicted to be 20 seconds. If the first broadcast timbre is switched to the second broadcast timbre, and the speech rate parameter is switched from 1.0 to 2.0, the first predicted broadcast duration of the text to be broadcast can still be predicted to be 10 seconds.
[0100] The following is based on Figure 3 The text broadcasting system 300 shown introduces a text broadcasting method, which can be executed by an electronic device, such as... Figure 4 As shown, text broadcasting methods may include:
[0101] 401: Get the text to be broadcast.
[0102] In some optional instances, the text to be read can be obtained based on the content displayed on the electronic device's screen. For example, Figure 5 A schematic diagram of a first display interface of an electronic device is shown. The first display interface includes a first control 501, a second control 502, and a third control 503. The first control 501 displays first display content, the second control 502 displays second display content, and the third control 503 displays a portion of third display content. Since the first display interface includes portions of the first, second, and third controls, the first and second display content can be used as the text to be read, or the first, second, and a portion of the third display content can be used as the text to be read.
[0103] In other alternative instances, the text to be read can be obtained based on the content corresponding to the area selected by the user on the display interface, for example, Figure 6 A schematic diagram of a second display interface of an electronic device is shown. This second display interface includes first display content 601 and second display content 602. Therefore, in response to a user's long-press and drag operation, the first display content corresponding to the long-press and drag operation (such as the content corresponding to the shaded area) can be obtained and used as the text to be played. It is understood that the user can select the text to be played from the second display interface not only through long-press and drag operations, but also through voice or touch. For example, the user can select the first display content as the text to be played by saying "read the title" via voice, or by double-tapping the second display interface with two fingers. This application does not specifically limit the scope of the embodiment.
[0104] 402: Input the text to be broadcast into the text standardization module 310 to obtain the language text corresponding to the text to be broadcast.
[0105] It is understandable that the text standardization module 310 can be used to standardize the text to be broadcast, obtaining the corresponding language text. For example, it can convert numbers and symbols that need to be broadcast in a form other than the corresponding broadcast language in the text to be broadcast into the language text corresponding to the target language, delete punctuation marks that do not need to be broadcast, and convert letter abbreviations into full pinyin, thereby obtaining the corresponding language text of the text to be broadcast.
[0106] It is understandable that the target language can refer to the language in which the text to be broadcast is spoken.
[0107] In some optional instances, text regularization can be applied to the text to be broadcast to obtain the corresponding spoken text. This text regularization process can involve converting a non-standardized word into a spoken word to eliminate ambiguity. By using text regularization to process the text to be broadcast into spoken text, the computational power of text regularization can be utilized, saving the computing resources and storage space of electronic devices. The specific process of performing text regularization on the text to be broadcast will be discussed later. Figures 7-10 This will be discussed in detail later. For example, based on Figure 8 The method shown applies character regular expressions to the text to be broadcast, that is, it deletes obvious punctuation marks that do not need to be broadcast and replaces obvious letter abbreviations, to obtain the preprocessed text to be broadcast, thus realizing the preprocessing of the file to be broadcast. Based on Figure 9The method shown applies number regularization, symbol regularization, and word regularization to a single sentence. This involves converting numbers and symbols in the text to be broadcast that are not represented in the corresponding broadcast language into the target language's corresponding characters, deleting unnecessary punctuation, and replacing letter abbreviations with full pinyin. Based on... Figure 10 This method performs regular expressions on numbers in date classes that are represented in a form other than the target language.
[0108] 403: Input the language text into the prosody prediction module 320 to obtain the prosody prediction result of the text to be broadcast.
[0109] It is understandable that the prosody prediction module 320 can be used to predict the prosody of the language text corresponding to the text to be broadcast, and obtain the prosody prediction result of the language text. For example, the language text can be divided into complete sentences, short sentences, phrases, and words, and the prosody level corresponding to the divided complete sentences, short sentences, phrases, and words can be determined to obtain the prosody prediction result of the language text.
[0110] Understandably, prosody can generally be divided into four levels, represented by #4 (SEN), #3 (IP), #2 (PP), and #1 (PW). #4 represents the prosodic level of a complete sentence, where the corresponding pause is usually significant or very long. #3 represents the prosodic level of a short phrase, where the corresponding pause is usually relatively long. #2 represents the prosodic level of a short word, where the corresponding pause is usually short. #1 represents the prosodic level of a single word, where the corresponding pause is usually absent or very short. The specific process of prosodic prediction for language text will be discussed later. Figure 11 This will be discussed in detail later.
[0111] 404: Input the language text into the pronunciation unit prediction module 330 to obtain the phoneme prediction results of the text to be broadcast.
[0112] It is understood that the pronunciation unit prediction module 330 can be used to predict phonemes in the language text corresponding to the text to be broadcast, and obtain phoneme prediction results. For example, it can determine the pronunciation units (phonemes) of the initials, finals and tones of the characters in the language text, and / or determine the pronunciation units (phonemes) of the vowels and consonants of the words in the language text, thereby obtaining the phoneme prediction results of the language text.
[0113] 405: Input the prosody prediction results and phoneme prediction results into the duration prediction module 340, and output the broadcast duration of the text to be broadcast based on the duration prediction model.
[0114] In some optional instances, the duration prediction module 340 may include a duration prediction model that can be a model for predicting the broadcast duration of each articulation unit (phoneme) in a language text.
[0115] In some optional examples, the prosodic level corresponding to a word can be preset as the first prosodic level, with a pause duration of 0.1s; the prosodic level corresponding to a phrase can be preset as the second prosodic level, with a pause duration of 0.2s; the prosodic level corresponding to a short sentence can be preset as the third prosodic level, with a pause duration of 0.3s; and the prosodic level corresponding to a complete sentence can be preset as the fourth prosodic level, with a pause duration of 0.4s. It is understood that other prosodic levels and corresponding pause durations for each prosodic level can also be preset, and this application does not specifically limit the specific settings.
[0116] It is understandable that after obtaining the prosodic prediction results and phoneme prediction results of the language text, the broadcast duration of the text to be broadcast can be predicted based on the pause duration corresponding to each prosodic level in the prosodic prediction results and the broadcast duration of each articulation unit (phoneme) in the phoneme prediction results, thus obtaining the broadcast duration of the text to be broadcast.
[0117] It is understandable that Chinese texts contain both polyphonic and non-polyphonic characters. For non-polyphonic characters, the phonetic units (phonemes) can be determined by consulting a pinyin mapping table, such as determining the initial consonant, final vowel, and tone. For polyphonic characters, the phonemes can be determined using a phoneme prediction model. The specific implementation method for determining the phonetic units (phonemes) using a phoneme prediction model will be discussed later. Figure 12 This will be discussed in detail later.
[0118] It is understandable that in English text, the phoneme of a word can be determined by consulting a word pronunciation map, such as determining the phonemes of vowels and consonants. However, due to the limited storage space of electronic devices, word pronunciation maps cannot be expanded indefinitely, and they cannot cover the phonemes of newly appearing words. Therefore, phoneme prediction models can also be used to predict the phonemes of words.
[0119] In some optional instances, the pronunciation duration of each initial consonant, final vowel, and tone can be obtained by repeatedly testing the pronunciation duration of each initial consonant, final vowel, and tone, and then stored in tabular form to obtain a Chinese phoneme pronunciation duration reference table. Similarly, the pronunciation duration of each vowel and consonant can be obtained by repeatedly testing the pronunciation duration of each vowel and consonant, and then stored in tabular form to obtain an English phoneme pronunciation duration reference table. Understandably, for other languages, the same method of obtaining Chinese and English phoneme pronunciation duration reference tables can be used to obtain a pronunciation duration reference table for the smallest pronunciation unit representing that language.
[0120] In some optional instances, the broadcast duration of each target language character in the text can be determined based on a Chinese phoneme broadcast duration reference table, an English phoneme broadcast duration reference table, or a minimum pronunciation unit broadcast duration reference table for other languages. For example, the broadcast duration of the initials, finals, and tones of the characters can be determined, and / or the broadcast duration of the vowels and consonants of the words can be determined.
[0121] In some alternative instances, a duration prediction model can be used to determine the broadcast duration of the text corresponding to each target language. The specific implementation method for determining the broadcast duration of the text corresponding to each target language based on the duration prediction model will be discussed later. Figure 13 This will be discussed in detail later.
[0122] The following details the specific process of performing regular expression processing on the text to be broadcast. Figure 7 The diagram illustrates a method for performing regular expression processing on the text to be broadcast. This method can be executed by an electronic device, such as... Figure 7 As shown, methods for performing regular expression processing on the text to be broadcast can include:
[0123] 701: Get the text to be broadcast.
[0124] Understandable, acceptable Figure 4 The method described in step 401 for obtaining the text to be broadcast will not be repeated here.
[0125] 702: Perform character regular expressions on the text to be broadcast.
[0126] It is understandable that a regular expression processing system can be used to process the text to be broadcast. This system can primarily include a character regular expression module and a sentence regular expression module. In the character regular expression module, character regular expressions can be applied to the text to be broadcast, specifically deleting obvious punctuation marks that should not be broadcast and replacing obvious letter abbreviations, resulting in preprocessed text to be broadcast, thus achieving preprocessing of the file to be broadcast. The specific methods for preprocessing the text to be broadcast will be discussed later. Figure 8 This will be discussed in detail later.
[0127] 703: Sentence segmentation of the text to be broadcast.
[0128] Understandably, the text to be broadcast can be segmented into multiple single sentences based on punctuation marks that represent a complete sentence but do not need to be broadcast. For example, the text can be segmented into sentences based on punctuation marks that do not need to be broadcast, such as ".", "?", "!", "...", etc.
[0129] In some optional instances, steps 703 and 702 can be processed in parallel to obtain the sentence-segmented text to be broadcast.
[0130] 704: Extract a single sentence from the text to be broadcast after sentence segmentation.
[0131] It is understandable that a single sentence can be randomly selected from the text to be broadcast after sentence segmentation, or a single sentence can be selected sequentially from the text to be broadcast based on the order of multiple sentences in the text to be broadcast.
[0132] 705: Perform sentence regularization on each sentence to obtain the language text corresponding to the text to be broadcast.
[0133] It is understandable that for each sentence in the text to be broadcast after sentence segmentation, the sentence regularization module can perform number regularization, symbol regularization, and word regularization on the sentence. This means converting numbers and symbols that need to be broadcast in a form other than the corresponding broadcast language into the corresponding characters of the target language, deleting unnecessary punctuation, and replacing letter abbreviations with full pinyin. The specific methods for performing sentence regularization will be discussed later. Figure 9 This will be discussed in detail later.
[0134] Figure 8 A flowchart illustrating a character regular expression is shown. This method can be executed by an electronic device, specifically by a character regular expression module within the electronic device. Figure 8 As shown, character regular expressions can include the following methods:
[0135] 801: Remove punctuation from the text to be broadcast.
[0136] It is understandable that punctuation marks that are obviously unnecessary to be broadcast in the text to be broadcast can be deleted. For example, the unnecessary punctuation mark ":)" can be deleted from the text to be broadcast.
[0137] 802: Replace letter abbreviations in the text to be broadcast.
[0138] It is understandable that obvious letter abbreviations in the text to be broadcast can be replaced with full pinyin. For example, the letter abbreviation "Dr" in the text to be broadcast can be replaced with the full pinyin "doctor".
[0139] 803: Perform sentence segmentation regular expression processing on the text to be broadcast.
[0140] It's understandable that punctuation marks such as "," ".", ";", and "..." that represent clauses in the file to be broadcast can be deleted and are not required to be broadcast.
[0141] It is understandable that steps 801-803 above can be performed as follows: Figure 8 The steps shown can be executed in any order, such as 802, 801, 803, or 803, 802, 801. They can also be executed in parallel. This application does not specify the specific execution order of the above steps.
[0142] 804: Retrieves the text to be played after character regular expression processing.
[0143] It is understandable that the text to be broadcast, after undergoing punctuation removal, letter abbreviation replacement, and clause segment regularization, is treated as the pre-processed text to be broadcast after character regularization, thus achieving the preprocessing of the file to be broadcast.
[0144] Figure 9 A flowchart illustrating a single-line regular expression is shown. This method can be executed by an electronic device, specifically by a single-line regular expression module within the electronic device. Figure 9 As shown, methods for single-sentence regular expressions can include:
[0145] 901: Input a single sentence into the single sentence regular expression module.
[0146] In some optional instances, a single sentence input regular expression module can be randomly obtained from the text to be broadcast after sentence segmentation, or the single sentence input regular expression module can be obtained sequentially from the text to be broadcast based on the order of multiple sentences in the text to be broadcast.
[0147] 902: Determine whether a sentence contains numbers expressed in a form other than the target language.
[0148] It is understandable that for each sentence obtained from the text to be broadcast after sentence segmentation, the sentence regular expression module can determine whether the sentence contains numbers expressed in a form other than the target language. If the sentence contains numbers expressed in a form other than the target language, the process can proceed to step 904; otherwise, the process can proceed to step 906.
[0149] 903: Determine whether a sentence contains a single word.
[0150] It is understandable that for each sentence obtained from the text to be broadcast after sentence segmentation, the sentence regular expression module can determine whether the sentence contains words. If the sentence contains words, it can proceed to step 905; otherwise, it can proceed to step 906.
[0151] 904: Performs numeric regularization on numbers in a sentence that are expressed in a form other than the target language.
[0152] It is understandable that when a sentence contains numbers expressed in a form other than the target language, in Chinese text, these numbers can be pronounced as numerical values or as characters. In English text, these numbers can be pronounced as numerical values, as characters, or as ordinal numbers. Therefore, when a sentence contains numbers expressed in a form other than the target language, regular expressions can be applied to these numbers based on their context within the sentence. Table 2 shows a comparison of regular expressions applied to numbers expressed in a form other than the target language in different sentences.
[0153] Table 2
[0154]
[0155] In some optional instances, numbers expressed in a form other than the target language can be categorized based on their meaning in a single sentence. For example, numbers expressed in a form other than the target language can be categorized into time, date, telephone, amount, unit, and symbol categories, and regular expressions can be applied to the numbers expressed in a form other than the target language based on the categorization.
[0156] The following example illustrates how to use regular expressions to represent dates. Figure 10 This diagram illustrates a regular expression processing method for date-type numbers, such as... Figure 10 As shown, the numbers in the DateProcessor class include year xx month xx day, month xx day, year, month, day.
[0157] Among them, xx year xx month xx day can be divided into xx year xx month and xx day.
[0158] Year and month can be divided into integer year and integer month, decimal year and decimal month, integer year and decimal month, and decimal year and integer month.
[0159] Integer years and integer months can be divided into years and months.
[0160] Years can be divided into years starting with 0 and years not starting with 0.
[0161] Years starting with 0 can include two - digit years and others. Among two - digit years, when followed by months from 01 - 09 or 1 - 12, the year is read digit - by - digit. For example, 1 is read as "one". In others, the year is read digit - by - digit. For example, 1 is read as "yao".
[0162] Years not starting with 0 can include one - digit years, two - digit years, three - digit years, four - digit years and others. In one - digit years, the year is read by its value. For example, "two years", 1 is read as "one". In two - digit years, when followed by months from 01 - 12 or 1 - 12, the year is read digit - by - digit. For example, 1 is read as "one". When followed by other months, the year is read by its value, and the thousands place and above are pronounced as "liang". In three - digit years, the year is read by its value, and the thousands place and above are pronounced as "liang". In four - digit years, 2000 is read by its value, and the thousands place and above are pronounced as "liang". When non - 2000 is followed by months from 0,1 - 12 or 1 - 12 and is between 1000 and 2000, the year is read digit - by - digit. For example, 1 is read as "one". Otherwise, the year is read by its value, and the thousands place and above are pronounced as "liang". When non - 2000 is followed by other months and is between 1500 and 2000, 200 - 2099, the year is read digit - by - digit. For example, 1 is read as "one". Otherwise, the year is read by its value, and the thousands place and above are pronounced as "liang". Others can include years starting with non - 0 and the rest being 0. Among them, for years with 17 digits (including 17 digits) or less, the year is read by its value, and the thousands place and above are pronounced as "liang". For years with more than 17 digits, the year is read digit - by - digit. For example, 1 is read as "yao". The remaining years are pronounced digit - by - digit. For example, 1 is read as "yao".
[0163] Months can be divided into months from 01 - 12, months from 1 - 12, other months not starting with 0 and other months starting with 0.
[0164] Among them, for months from 01 - 12 and months from 1 - 12, the month is read by its value. For example, "February". For other months not starting with 0, the month is read by its value, and the thousands place and above are pronounced as "liang". For other months starting with 0, the month is read digit - by - digit. For example, 1 is read as "yao".
[0165] Decimal years and decimal months, integer years and decimal months, decimal years and integer months can all be split into "xx years" and "xx months".
[0166] xx [day] can be divided into days from 01 - 09, days from 01 - 09, other days not starting with 0 and other days starting with 0. Among them, for days from 01 - 09, the day is read digit - by - digit. For example, "the second day". For days from 01 - 09, the day is read by its value. For example, "the second day". For other days not starting with 0, the day is read by its value, and the thousands place and above are pronounced as "liang". For other days starting with 0, the day is read digit - by - digit. For example, 1 is read as "yao".
[0167] xx years can be divided into integer years and decimal years.
[0168] Among them, integer years can be divided into positive integers and negative integers.
[0169] Positive integer years can include combinations of digits starting with 0 and the year, where the year is read digit by digit, e.g., 1 is pronounced "yao", 2 is pronounced "er". Positive integer years can also include combinations of digits not starting with 0 and the year, where years between 1500 and 2000 are read digit by digit, e.g., 1 is pronounced "yao", other years are read by value, e.g., two years, thousands place and above are read with two digits. Negative integer years can include combinations of digits starting with 0 and the year, where the year is read digit by digit, and the minus sign is not translated, e.g., 1 is pronounced "yao", 2 is pronounced "er". Positive integer years can also include combinations of digits not starting with 0 and the year, where the year is read by value, and the minus sign is translated as a "bar", e.g., two years, thousands place and above are read with two digits.
[0170] Decimal years can be divided into positive decimals and negative decimals. In positive decimals, the year is read as its value, and in negative decimals, the year is read as its value, and the sign is translated as "bar".
[0171] 905: Perform word regularization on words in a single sentence.
[0172] It is understandable that when a sentence contains letter abbreviations, these abbreviations can be further converted based on their context within the sentence, transforming them into full spelling. This avoids inaccurate predictions of the broadcast duration of the text due to failure to recognize letter abbreviations or incorrect conversion during character regularization. Table 3 shows a comparison of word regularization before and after applying it to words in different sentences.
[0173] Table 3
[0174]
[0175] 906: Perform symbolic regularization on the symbols that need to be broadcast in a single sentence.
[0176] It is understandable that when a sentence contains symbols that need to be broadcast, these symbols can be converted into the corresponding language text of the target language, that is, the symbols needing to be broadcast can be converted into broadcast forms. Table 4 shows a comparison before and after applying symbolic regularization to the symbols that need to be broadcast in a sentence.
[0177] Table 4
[0178]
[0179] 907: Perform punctuation regularization on punctuation marks in a single sentence that do not need to be announced.
[0180] It is understandable that when a sentence contains punctuation marks that do not need to be announced, the repeated, unnecessary punctuation marks in the sentence can be deleted. Table 5 shows a comparison before and after applying punctuation regularization to unnecessary punctuation marks in a sentence.
[0181] Table 5
[0182]
[0183] 908: Get the language text corresponding to the text to be broadcast.
[0184] It can be understood that the text to be broadcast after performing number regularization, word regularization, symbol regularization, and punctuation regularization on each sentence in the text to be broadcast after sentence segmentation is taken as the text to be broadcast after sentence regularization, thus obtaining the language text corresponding to the text to be broadcast.
[0185] The following section describes the specific process of prosodic prediction of language text, based on the structure of the prosodic prediction model in the prosodic prediction module 320.
[0186] Figure 11 A schematic diagram of a prosodic prediction model is shown. This prosodic prediction model can be used to predict the prosodic level of complete sentences, short sentences, phrases, and words in input language text.
[0187] like Figure 11 As shown, the prosody prediction model includes a word segmentation module 1101, a character embedding (Char ebedding) 1102, a first feedforward neural network (FNN) 1103, a part-of-speech (Pos ebedding) encoder 1104, a fusion network (Add) 1105, a multi-convolutional neural network (Multi CNN) 1106, a bi-direction long short-term memory (BiLSTM) network 1107, a second feedforward neural network (FNN) 1108, and a regression network 1109.
[0188] It is understood that the structure of the above prosody prediction model is only an example of a model in some embodiments, and may include more or fewer network layers, or may split, replace or merge some network layers, or may use other network structures.
[0189] Specifically, the text words output by the word segmentation module 1101 can be used as input to the text encoder 1102. The multiple text word vectors output by the text encoder 1102 can be used as input to the first feedforward neural network 1103. The part-of-speech tag of each text word output by the word segmentation module 1101 can be used as input to the part-of-speech encoder 1104. The physical features of each text word vector output by the first feedforward neural network 1103 and the multiple part-of-speech features output by the part-of-speech encoder 1104 can be used as input to the fusion network 1105. The fusion features output by the fusion network 1105 can be used as input to the multi-layer convolutional neural network 1106 and the bidirectional long short-term memory network 1107. The global features of each text word vector output by the multi-layer convolutional neural network 1106 and the local features of each text word vector output by the bidirectional long short-term memory network 1107 can be used as input to the second feedforward neural network 1108. The semantic features of the word vectors output by the second feedforward neural network 1108 can be used as input to the regression network 1109.
[0190] In some specific implementations, the word segmentation module 1101 can be used to segment the input language text to obtain text words and determine the part of speech of each text word. In other words, the language text can be segmented to obtain multiple text words. The part of speech of a text word can include nouns, verbs, pronouns, adjectives, etc., and can also include subject, predicate, object, attributive, adverbial, etc.
[0191] The text encoder 1102 can be used to encode each text word in the input to obtain multiple text word vectors.
[0192] The first feedforward neural network 1103 is used to perform feature extraction processing on each text word vector in multiple text word vectors to obtain the physical features of each text word vector. It can be understood that the physical features can be features that represent the order and left-right relationship between text words and other words in the text.
[0193] The part-of-speech encoder 1104 can be used to encode the part of speech of each text word to obtain multiple part-of-speech features.
[0194] The fusion network 1105 can be used to fuse the physical features and part-of-speech features of each text word vector to obtain fused features.
[0195] A multi-layer convolutional neural network (1106) can be used to extract features from fused features, obtaining global features for each text word vector. This can be understood as representing the prosodic level of the text word vector corresponding to the fused features.
[0196] The bidirectional long short-term memory network 1107 can be used to extract features from the fused features, obtaining local features for each text word vector. These local features can represent the prosodic level corresponding to the preceding text word vector.
[0197] The second feedforward neural network 1108 can be used to determine the semantic features of word vectors based on global and local features.
[0198] Regression Network 1109 can be used to determine the probability values of different prosodic levels corresponding to text word vectors based on the semantic features of text word vectors, and take the prosodic level corresponding to the highest probability value as the prosodic level corresponding to the text word vector, thus obtaining the prosodic prediction result of the language text.
[0199] The following section describes the specific process of phoneme prediction in language text, based on the structure of the phoneme prediction model in the phoneme prediction module 330.
[0200] Figure 12 A schematic diagram of a phoneme prediction model is shown. This phoneme prediction model can be used to predict the pronunciation units (phonemes) such as initials, finals and tones of input text.
[0201] like Figure 12 As shown, the phoneme prediction model may include an encoder 1201, a long short-term memory (LSTM) network 1202, and a fully connected (FC) network 1203.
[0202] The encoded text output by encoder 1201 can be used as input to long short-term memory network 1202, and the multidimensional Gaussian distribution of phonemes corresponding to word vectors in the encoded text output by long short-term memory network 1202 can be used as input to fully connected network 1203.
[0203] It is understood that the structure of the above phoneme prediction model is only an example of a model in some embodiments, and may include more or fewer network layers, or may split, replace or merge some network layers, or may use other network structures.
[0204] In some specific implementations, encoder 1201 can be used to encode characters in spoken text to obtain encoded text. It can be understood that encoder 1201 can encode characters in spoken text into computer-recognizable character vectors to obtain encoded text.
[0205] The Long Short-Term Memory (LSTM) network 1202 can be used to predict encoded text, obtaining a multidimensional Gaussian distribution of phonemes corresponding to word vectors in the encoded text. In essence, the LSM network 1202 can calculate the multidimensional Gaussian distribution of phonemes corresponding to word vectors in the encoded text based on the contextual encoding values of the word vectors.
[0206] The fully connected network 1203 can be used to obtain phoneme prediction results for language text based on the multidimensional Gaussian distribution of phonemes corresponding to character vectors in encoded text. In other words, the fully connected network 1203 can use the multidimensional Gaussian distribution of phonemes corresponding to character vectors in encoded text, taking the phoneme corresponding to the highest probability in the multidimensional Gaussian distribution as the phoneme corresponding to the character vector, to obtain the pronunciation units (phonemes) such as initials, finals, and tones of the corresponding characters in the language text, thus obtaining the phoneme prediction results for the language text.
[0207] It is understandable that the method for determining the phonemes of words based on the phoneme prediction model is the same as the method for determining the phonemes of characters based on the phoneme prediction model. To avoid repetition, it will not be elaborated here.
[0208] Table 6 shows the phoneme prediction results obtained by using a phoneme prediction model to predict phonemes in different language texts, that is, multiple pronunciation units (phonemes) of characters and / or words in the language text.
[0209] Table 6
[0210]
[0211] The following section describes the specific process of determining the broadcast duration of each phoneme, based on the structure of the duration prediction model in the duration prediction module 340.
[0212] Figure 13 A schematic diagram of a duration prediction model is shown. This duration prediction model can be used to predict the broadcast duration of each pronunciation unit (phoneme) in the input language text.
[0213] like Figure 13 As shown, the duration prediction model may include an encoder 1301, a multi-convolutional neural network (Multi CNN) 1302, a bi-directional long short-term memory (BiLSTM) network 1303, a feedforward neural network (FNN) 1304, and a softmax regression network 1305.
[0214] Specifically, the encoded text output by encoder 1301 can be used as input to multi-layer convolutional neural network 1302. The global features of phoneme vectors in the encoded text output by multi-layer convolutional neural network 1302 and the local features of phoneme vectors in the encoded text output by bidirectional long short-term memory network 1303 can be used as input to feedforward neural network 1304. The semantic features of phoneme vectors output by feedforward neural network 1304 can be used as input to regression network.
[0215] It is understood that the structure of the above duration prediction model is only an example of a model in some embodiments. It may include more or fewer network layers, and may also split, replace or merge some network layers, and may also use other network structures.
[0216] In some specific implementations, encoder 1301 can be used to encode pronunciation units (phonemes) to obtain encoded text. It can be understood that encoder 1301 can encode pronunciation units (phonemes) into computer-recognizable characters, for example, encoding pronunciation units (phonemes) into phoneme vectors to obtain encoded text.
[0217] A multi-layer convolutional neural network 1302 can be used to extract features from encoded text, specifically the global features of phoneme vectors within the encoded text. These global features can be understood as representing the duration of the phoneme vectors' playback in the encoded text.
[0218] The bidirectional long short-term memory network 1303 can be used to extract features from encoded text, obtaining local features of phoneme vectors in the encoded text. These local features can be understood as representing the playback duration of a phoneme vector relative to the previous phoneme vector.
[0219] The feedforward neural network 1304 can be used to determine the semantic features of phoneme vectors based on global and local features.
[0220] The regression network 1305 can be used to determine the probability value of different broadcast durations corresponding to phoneme vectors based on semantic features of phoneme vectors, and take the broadcast duration corresponding to the highest probability value as the broadcast duration corresponding to the phoneme vector, thus obtaining the phoneme prediction result of the language text. For example, Table 1 above shows the broadcast duration of each pronunciation unit (phoneme) in the language text determined by a duration prediction model.
[0221] The hardware structure of electronic devices will be described below. For example... Figure 14As shown, the electronic device 1400 may include a processor 1410, an external memory interface 1420, an internal memory 1421, a universal serial bus (USB) interface 1430, a charging management module 1440, a power management module 1441, a battery 1442, an audio module 1450, a speaker 1450A, a receiver 1450B, a microphone 1450C, a headphone jack 1450D, a sensor module 1460, a display screen 1470, etc.
[0222] It is understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device. In other embodiments of this application, the electronic device 1400 may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
[0223] Processor 1410 may include one or more processing units, such as an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, a video codec, a digital signal processor (DSP), a baseband processor, and / or a neural network processing unit (NPU). These different processing units may be independent devices or integrated into one or more processors. In some optional instances, processor 1410 may execute the text broadcasting method mentioned in the embodiments of this application.
[0224] The processor 1410 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 1410 is a cache memory. This memory can store instructions or data that the processor 1410 has just used or is recurring. If the processor 1410 needs to use the instruction or data again, it can directly retrieve it from the memory. This avoids repeated accesses, reduces the waiting time of the processor 1410, and thus improves system efficiency. In some optional instances, the memory can store instructions or data for the text broadcasting method mentioned in the embodiments of this application.
[0225] The USB 1430 interface conforms to the USB standard specification, specifically including Mini USB, Micro USB, and USB Type-C interfaces. The USB 1430 interface can be used to connect chargers to charge electronic devices, and also for data transfer between electronic devices and peripherals. It can also be used to connect headphones for audio playback. Furthermore, this interface can be used to connect other electronic devices, such as AR devices.
[0226] The charging management module 1440 receives charging input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 1440 receives charging input from the wired charger via a USB interface 1430. In some wireless charging embodiments, the charging management module 1440 receives wireless charging input via the wireless charging coil of the electronic device. While charging the battery 1442, the charging management module 1440 can also supply power to the electronic device via the power management module 1441.
[0227] The power management module 1441 connects the battery 1442, the charging management module 1440, and the processor 1410. The power management module 1441 receives input from the battery 1442 and / or the charging management module 1440, supplying power to the processor 1410, internal memory 1421, display screen 1470, etc. The power management module 1441 can also monitor parameters such as battery capacity, battery cycle count, and battery health status (leakage current, impedance). In some other embodiments, the power management module 1441 may also be located within the processor 1410. In other embodiments, the power management module 1441 and the charging management module 1440 may be housed in the same device.
[0228] Electronic device 1400 can realize display functions through GPU, display screen 1470, and application processor.
[0229] The display screen 1470 includes a display panel. The display panel may be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a minimized LED, a microLED, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 1400 may include one or N displays 1470, where N is a positive integer greater than 1. In some embodiments, the display screen 1470 may be a touchscreen.
[0230] The external storage interface 1420 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 1400. The external memory card communicates with the processor 1410 through the external storage interface 1420 to perform data storage functions. For example, music, video, and other files can be saved on the external memory card.
[0231] Internal memory 1421 can be used to store computer executable program code, which includes instructions. Internal memory 1421 may include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback, image playback, etc.), etc. The data storage area may store data created during the use of electronic device 1400 (such as audio data, phonebook, etc.). Furthermore, internal memory 1421 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc. Processor 1410 executes various functional applications and data processing of electronic device 1400 by running instructions stored in internal memory 1421 and / or instructions stored in memory located in the processor.
[0232] Electronic device 1400 can implement audio functions through audio module 1450, speaker 1450A, receiver 1450B, microphone 1450C, headphone jack 1450D, and application processor. Examples include music playback, recording, and on-demand text-to-speech.
[0233] The audio module 1450 is used to convert digital audio information into analog audio signal output, and also to convert analog audio input into digital audio signal. The audio module 1450 can also be used for encoding and decoding audio signals. In some embodiments, the audio module 1450 may be located in the processor 1410, or some functional modules of the audio module 1450 may be located in the processor 1410.
[0234] The 1450A speaker, also known as a "loudspeaker," is used to convert audio electrical signals into sound signals. Electronic devices can listen to music or make hands-free calls through the 1450A speaker.
[0235] The receiver 1450B, also known as the "earpiece," is used to convert audio electrical signals into sound signals. When an electronic device answers a phone call or voice message, the receiver 1450B can be brought close to the ear to hear the voice.
[0236] Microphone 1450C, also known as a "microphone" or "voice transducer," is used to convert sound signals into electrical signals. When making a phone call or sending a voice message, the user can speak by bringing their mouth close to microphone 1450C, inputting the sound signal into microphone 1450C. Electronic devices can have at least one microphone 1450C. In some embodiments, electronic devices can have two microphones 1450C, which, in addition to collecting sound signals, can also perform noise reduction. In other embodiments, electronic devices can have three, four, or more microphones 1450C, enabling sound signal collection, noise reduction, sound source identification, and directional recording, among other functions.
[0237] The 1450D headphone jack is used to connect wired headphones. The 1450D headphone jack can be a USB 1430 interface or a 3.5mm Open Mobile Terminal Platform (OMTP) standard interface, a CTIA (Cellular Telecommunications Industry Association of the USA) standard interface.
[0238] The sensor module 1460 may include a pressure sensor 1460A, a touch sensor 1460B, etc.
[0239] Pressure sensor 1460A is used to sense pressure signals and convert them into electrical signals. In some embodiments, pressure sensor 1460A can be disposed on display screen 1470. Pressure sensor 1460A can be of many types, such as resistive pressure sensors, inductive pressure sensors, and capacitive pressure sensors. A capacitive pressure sensor may include at least two parallel plates with conductive material. When force is applied to pressure sensor 1460A, the capacitance between the electrodes changes. Electronic device 1400 determines the pressure intensity based on the change in capacitance. When a touch operation is applied to display screen 1470, electronic device 1400 detects the intensity of the touch operation based on pressure sensor 1460A. Electronic device 1400 can also calculate the touch position based on the detection signal from pressure sensor 1460A. In some embodiments, touch operations applied to the same touch position but with different touch operation intensities can correspond to different operation commands. For example, when a touch operation with a touch operation intensity less than a first pressure threshold is applied to the SMS application icon, a command to view SMS messages is executed. When a touch operation with a strength greater than or equal to the first pressure threshold is applied to the SMS application icon, the instruction to create a new SMS message is executed.
[0240] Touch sensor 1460B, also known as a "touch device," can be located on display screen 1470. The touch sensor 1460B and display screen 1470 together form a touchscreen, also known as a "touchscreen." Touch sensor 1460B is used to detect touch operations applied to or near it. The touch sensor can transmit the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through display screen 1470. In other embodiments, touch sensor 1460B may also be located on the surface of electronic device 1400, in a different position than display screen 1470.
[0241] It is understood that the structures illustrated in the embodiments of this application do not constitute a specific limitation on the electronic device 1400. In other embodiments of this application, the electronic device 1400 may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware, and the embodiments of this application do not limit this.
[0242] The software architecture of electronic devices is described below. For example, the software operating system of an electronic device can adopt a layered architecture, event-driven architecture, microkernel architecture, or cloud architecture, etc. This application embodiment uses the Android® system with a layered architecture in an electronic device as an example to illustrate the software architecture of the electronic device. Please refer to... Figure 15This is a schematic diagram of the software architecture of an electronic device provided in an embodiment of this application. The layered architecture divides the software into several layers, each with a clear role and function. Layers communicate with each other through software interfaces. In some embodiments, the operating system of the electronic device is divided into four layers, from top to bottom: the application layer, the application framework layer, the Android runtime, the system layer, and the kernel layer.
[0243] The application layer can include a series of application packages. For example, such as... Figure 15 As shown, the application package may include a camera, gallery, calendar, call, news, SMS, WLAN, on-demand text messaging, user experience (UX) display module, etc.
[0244] Optionally, on-demand reading can be a service (or function) that comes with the software operating system of an electronic device. That is, on-demand reading can be a service developed by the software operating system developer and configured in the software operating system.
[0245] Read-aloud is used to provide voice feedback to users. For example, it can read aloud the content on the display screen of an electronic device or the content corresponding to the area selected by the user on the display screen, so that the user can know the content on the display screen of the electronic device without reading it.
[0246] The application framework layer provides application programming interfaces (APIs) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.
[0247] like Figure 15 As shown, the application framework layer may include a window manager, content provider, notification manager, view system, text-to-speech (TTS) service, etc.
[0248] The window manager is used to manage windowed applications. It can retrieve screen size, determine the presence of a status bar, lock the screen, and capture screenshots, among other things.
[0249] Content providers store and retrieve data, making that data accessible to applications. This data may include videos, images, audio, made and received phone calls, browsing history and bookmarks, phone books, etc.
[0250] The notification manager allows applications to display notifications in the status bar. These notifications can be used to deliver informational messages and can disappear automatically after a short pause, requiring no user interaction. For example, the notification manager can be used to notify users of completed downloads or message alerts. The notification manager can also display notifications as icons or scrolling text in the top status bar, such as notifications from background applications, or as dialog boxes on the screen. Examples include displaying text messages in the status bar, emitting sounds, vibrating electronic devices, and flashing indicator lights.
[0251] A view system includes visual controls, such as controls for displaying text and controls for displaying images. View systems can be used to build applications. A display interface can consist of one or more views. For example, a display interface including a text notification icon could include views for displaying text and views for displaying images.
[0252] It is understood that in some embodiments, TTS service may also be referred to as text-to-speech service.
[0253] Optionally, the TTS service can be a service built into the software operating system of the electronic device. That is, the TTS service can be developed by the software operating system developer and configured within the software operating system. In this case, the TTS service can reside in the application framework layer of the electronic device's software operating system.
[0254] Optionally, the TTS service can be developed by a third-party software developer and installed by the user on the electronic device. In this case, the TTS service can reside in the application layer of the electronic device's software operating system.
[0255] It is understood that, for ease of description, the following embodiments of this application will be exemplified by the example of the TTS service being located in the application framework layer of the software operating system of an electronic device.
[0256] The Android Runtime consists of core libraries and a virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.
[0257] The core library consists of two parts: one part is the functionalities that need to be called by the Java language, and the other part is the Android core library.
[0258] The application layer and application framework layer run in a virtual machine. The virtual machine executes the Java files of the application layer and application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, security and exception management, and garbage collection.
[0259] System libraries can include multiple functional modules. For example, a surface manager, media libraries, 3D graphics processing libraries (such as OpenGL ES), and 2D graphics engines (such as SGL).
[0260] The Surface Manager is used to manage the display subsystem and to provide the blending of 2D and 3D layers for multiple applications.
[0261] The media library supports playback and recording of various common audio and video formats, as well as still image files. It supports multiple audio and video encoding formats, such as MPEG4, H.264, MP3, AAC, AMR, JPG, and PNG.
[0262] The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, layer processing, etc.
[0263] A 2D graphics engine is a graphics engine for 2D drawing.
[0264] The kernel layer is the layer between hardware and software. The kernel layer includes at least display drivers, camera drivers, audio drivers, and sensor drivers.
[0265] The text broadcasting method provided in this application embodiment can be applied to scenarios where users need to obtain content from the display interface by listening or the content corresponding to the area selected by the user on the display interface. That is, when users need to obtain content from the display interface of an electronic device and the broadcasting duration of the content by listening, the text broadcasting method provided in this application embodiment can improve the accuracy of predicting the broadcasting duration of the content.
[0266] In one specific embodiment of this application, the text broadcasting method can be implemented through multiple services (i.e., software modules) in the software operating system of an electronic device. For example, the text broadcasting method can be specifically implemented through on-demand reading services, text-to-speech (TTS) services, and UX display modules in the software operating system of the electronic device. Combined with... Figure 15 The on-demand text-to-speech service and UX display module can reside in the application layer of the electronic device's software operating system, while the TTS service can reside in either the application layer or the application framework layer of the electronic device's software operating system. This application embodiment illustrates the example of the TTS service residing in the application framework layer.
[0267] Based on this, such as Figure 16 As shown, text broadcasting methods may include:
[0268] S1601: The Read-Aloud service detects a click operation on the voice tone control for the Read-Aloud function and sends a message to the UX display module indicating that the Read-Aloud function is enabled.
[0269] It is understandable that the voice tone control for the on-demand reading function can be, for example, as follows: Figure 1 The controls shown in figure a correspond to "First Broadcast Tone", "Second Broadcast Tone", and "Custom Broadcast Tone".
[0270] 1602: When the on-demand reading function is enabled, the UX display module obtains the text to be read and sends the text to be read to the on-demand reading service.
[0271] It is understandable that the text to be read can be the content displayed on the screen, or it can be the content corresponding to the area selected by the user on the screen when the user's long press and drag operation is received.
[0272] In some optional instances, the method for obtaining the text to be broadcast can be found in [reference needed]. Figure 4 Step 401 in the previous section will not be repeated here.
[0273] 1603: The on-demand reading service predicts the playback duration of the text to be read and sends the corresponding language text to the speech synthesis service.
[0274] Understandably, the method used by on-demand text-to-speech services to predict the playback duration of the text to be read can be referenced. Figures 4-13 The steps involved are not repeated here.
[0275] 1604: The speech synthesis service converts spoken text into corresponding audio information and then plays the audio information.
[0276] 1605: The on-demand reading service sends the reading duration of the text to be read to the UX display module.
[0277] 1606: The UX display module displays the broadcast duration.
[0278] In this embodiment of the application, predicting the broadcast duration of the text to be broadcast based on language text can improve the accuracy of predicting the broadcast duration of the text to be broadcast.
[0279] Various embodiments of the mechanisms disclosed in this application can be implemented in hardware, software, firmware, or combinations of these implementation methods. Embodiments of this application can be implemented as computer programs or program code executable on a programmable system, the programmable system including at least one processor, a storage system (including volatile and non-volatile memory and / or storage elements), at least one input device, and at least one output device.
[0280] Program code can be applied to input instructions to execute the functions described in this application and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application-specific integrated circuit (ASIC), or a microprocessor.
[0281] The program code can be implemented using a high-level procedural language or an object-oriented programming language to communicate with the processing system. Assembly language or machine language can also be used when needed. In fact, the mechanisms described in this application are not limited to any particular programming language. In either case, the language can be a compiled language or an interpreted language.
[0282] In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried or stored thereon on one or more temporary or non-temporary machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or through other computer-readable media. Therefore, machine-readable media may include any mechanism for storing or transmitting information in a machine-readable (e.g., computer-readable) form, including but not limited to floppy disks, optical disks, CD-ROMs, magneto-optical disks, read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic cards or optical cards, flash memory, or tangible machine-readable storage for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the Internet in the form of electrical, optical, acoustic, or other propagation signals. Therefore, machine-readable media include any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a machine-readable (e.g., computer-readable) form.
[0283] It should be noted that all units / modules mentioned in the device embodiments of this application are logical units / modules. Physically, a logical unit / module can be a physical unit / module, a part of a physical unit / module, or a combination of multiple physical units / modules. The physical implementation of these logical units / modules themselves is not the most important factor; the combination of functions implemented by these logical units / modules is the key to solving the technical problems proposed in this application. Furthermore, to highlight the innovative aspects of this application, the above-described device embodiments of this application have not introduced units / modules that are not closely related to solving the technical problems proposed in this application. This does not mean that the above-described device embodiments do not contain other units / modules.
[0284] It should be noted that in the examples and description of this patent, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0285] Although this application has been illustrated and described with reference to certain embodiments thereof, those skilled in the art will understand that various changes in form and detail may be made thereto without departing from the scope of this application.
Claims
1. A text broadcasting method, applied to electronic devices, characterized in that, include: The first broadcast interface is displayed. The first broadcast interface includes a first predicted broadcast duration of a first text, wherein the first text includes N characters, and the N characters include M first-class characters. The second broadcast interface is displayed, which includes a second predicted broadcast duration for a second text. The second text comprises N characters, of which P are first-class characters. Among them, M and P are different, and the first predicted broadcast duration and the second predicted broadcast duration are different.
2. The method according to claim 1, characterized in that, The broadcasting language of the first text includes at least one of the first type of languages, and the first type of characters includes numbers represented in a form other than the corresponding broadcasting language, symbols to be broadcast, letter abbreviations, and punctuation marks that do not need to be broadcast.
3. The method according to claim 2, characterized in that, The first category of languages includes Chinese and English, and the letter abbreviations include English letter abbreviations.
4. The method according to claim 2, characterized in that, The first predicted broadcast duration is determined in the following manner: The first type of characters in the first text are converted to obtain the first language text; The first predicted broadcast duration is determined based on the first language text.
5. The method according to claim 4, characterized in that, The conversion of the first type of characters in the first text includes at least one of the following: In the first text, numbers and symbols that need to be broadcast in a form other than the first language are converted into the language and script corresponding to the first language. Delete the punctuation marks that do not need to be broadcast in the first text; Convert the letter abbreviations in the first text to full pinyin.
6. The method according to claim 4, characterized in that, Determining the first predicted broadcast duration based on the first language text includes: The first predicted broadcast duration is determined based on the number of characters in the first language text and the broadcast duration corresponding to each character.
7. The method according to claim 4, characterized in that, Determining the first predicted broadcast duration based on the first language text includes: Based on the prosodic levels corresponding to complete sentences, short sentences, phrases, and words in the first language text, the phonemes of characters and / or words in the first language text, and at least one of the parameters of broadcast timbre and speech rate, the first predicted broadcast duration is determined.
8. The method according to claim 7, characterized in that, The prosodic levels corresponding to complete sentences, short sentences, phrases, and words in the first language text are determined based on a prosodic prediction model. The phonemes of characters and / or words in the first language text are determined based on a phoneme prediction model.
9. The method according to claim 4, characterized in that, Determining the first predicted broadcast duration based on the first language text includes: Based on the prosodic levels of complete sentences, short sentences, phrases, and words in the first language text, the number of characters in the first language text, and the broadcast duration corresponding to each character, the first predicted broadcast duration is determined.
10. The method according to any one of claims 1-9, characterized in that, include: In response to the user switching the speech rate of the first text from a first speech rate parameter to a second speech rate parameter, and / or in response to the user switching the timbre of the first text from a first timbre to a second timbre, the third predicted speech duration of the first text is displayed.
11. An electronic device, characterized in that, include: A memory for storing instructions executed by one or more processors of the electronic device, and a processor, being one of one or more processors of the electronic device, for executing the text broadcasting method according to any one of claims 1-10.
12. A readable storage medium, characterized in that, The readable storage medium stores instructions that, when executed on an electronic device, cause the electronic device to perform the text broadcasting method according to any one of claims 1-10.