[0043] The following will describe the implementation of the present invention in detail with reference to the accompanying drawings and embodiments, so that the implementers of the present invention can fully understand how the present invention applies technical means to solve technical problems, and achieve the realization process of technical effects and based on the above realization process Specific implementation of the present invention. It should be noted that, as long as there is no conflict, each embodiment of the present invention and each feature in each embodiment can be combined with each other, and the technical solutions formed are all within the protection scope of the present invention.
[0044] In traditional human daily life, text reading is the main way for people to appreciate literary works. However, in certain specific scenes, people also appreciate literary works through sound, for example, listening to storytelling, listening to recitations, etc. The most common of these is for children with insufficient text reading ability, who usually listen to literary works through the narration of others (listening to stories from others).
[0045] With the continuous development of multimedia technology, more and more multimedia devices are used in human daily life. With the support of multimedia technology, the sound form of literary works, especially storytelling, has gradually shifted to multimedia devices.
[0046] Generally, using multimedia equipment to tell a story is usually manually telling the story in advance and recording an audio file. The multimedia device just plays the recorded audio file. With the development of computer technology, in order to obtain the sound source simply and conveniently, in the prior art, a method of converting text data into audio data is also adopted. In this way, there is no need to manually recite and record the text, and only need to provide the story text to realize the use of multimedia equipment to tell the story. However, the use of computer technology to directly convert text to speech can only guarantee the direct conversion of text content. It cannot achieve the same voice and emotion of real people when telling stories. This leads to the existing technology based on text conversion technology. The storytelling is very dry and boring, and can only simply convey the direct meaning of the text, and the user experience is very poor.
[0047] To solve the above problems, the present invention proposes a story data processing method for intelligent robots. In the method of the present invention, the story in the form of text is converted into multi-modal data that can be displayed in multiple modalities, thereby improving the expressiveness of the story content.
[0048] Further, in actual application scenarios, when human beings perform voice communication, the sounds made by different people are different, and they have the voice characteristics of the speaker. Generally, the text of a story usually contains dialogue and narration, which can be regarded as a character in the story talking. Therefore, in one embodiment, matching sound effects are specifically added to the dialogue and the narration in the story, so that the voice performance of the dialogue and the narration is more realistic and vivid, thereby improving the vividness of storytelling and optimizing the user experience.
[0049] Compared with the prior art, according to the method and system of the present invention, a story in the form of text can be transformed into multi-modal data that can be displayed in multiple modalities, and the presentation of dialogue and narration in the story can be optimized, thereby Greatly improve the user experience of the listener when telling a story.
[0050] Next, the detailed flow of the method according to the embodiment of the present invention will be described in detail based on the accompanying drawings. The steps shown in the flowchart of the accompanying drawings can be executed in a computer system including, for example, a set of computer-executable instructions. Although the logical sequence of the steps is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.
[0051] Such as figure 1 As shown, in one embodiment, the method includes the following processes:
[0052] S110: Acquire story text data;
[0053] S120: Parse the story text data, and identify dialogue and narration in the story text;
[0054] S131, call the story data processing model;
[0055] S132: Perform sound effect processing on the dialogue and narration in the story text to generate dialogue and narration data with sound effects;
[0056] S140, generating and outputting multi-modal data matching the story text, the multi-modal data including the dialogue and narration data with sound effects generated in step S132.
[0057] Further, in one embodiment, the TTS output of the dialogue and narration is mainly performed by voice, therefore, the final output multi-modal data includes dialogue and narration data with sound effects that have been converted into voice. Specifically, in one embodiment, the dialogue and narration in the story text are converted into text-to-speech by combining the dialogue and narration data with sound effects to generate the dialogue and narration voice data with sound effects.
[0058] Further, in order to further improve the vividness of the story performance, in one embodiment, it is not limited to telling the story in voice mode, but also displaying dialogue and narration in text mode. Specifically, in one embodiment, the multimodal data includes dialogue and narration text data with sound effects.
[0059] Further, in order to further improve the vividness of story performance, in one embodiment, it is not limited to using voice and/or text to tell the story. Specifically, in an embodiment, the multi-modal data generated in step S130 further includes intelligent robot motion data, where:
[0060] Generate corresponding intelligent robot action data for the dialogue and narration in the story text.
[0061] In this way, when the intelligent robot is telling a story, it can output dialogue and narration data with sound effects while also assisting with corresponding actions, thereby greatly improving the vividness of storytelling.
[0062] Furthermore, in the story text, in addition to dialogue and narration may also contain other content. In one embodiment, the text other than the dialogue and voice in the story text data is also converted into voice data and merged with the dialogue and narration data with sound effects. Specifically, the method also includes:
[0063] Combine the dialogue and narration data with sound effects to perform text-to-speech conversion of the dialogue and narration in the story text to generate dialogue and narration voice data with sound effects;
[0064] Convert texts other than dialogue and voice in the story text data into first voice data;
[0065] Combine the dialogue and narration data with sound effects and the first voice data to generate story voice data.
[0066] Further, in order to ensure that the sound effects added to the dialogue and narration data can increase the vividness of the story performance, instead of using the wrong sound effects to reduce the story’s expressiveness, in one embodiment, the story text data is parsed to determine the story content. Determine the corresponding sound effects of the dialogue and narration based on the specific content of the story.
[0067] Specifically, in one embodiment, the story text data is analyzed based on text recognition technology. Specifically, in an embodiment, parsing the story text data includes: performing text recognition on the story text data to determine the story content.
[0068] Further, considering the characteristics of computer analysis, in one embodiment, the story text data is analyzed by means of element decomposition. Specifically, in an embodiment, the content elements of the story are disassembled based on the text recognition result, and the story elements are extracted, and the story elements include the style, characters and/or dialogue of the story.
[0069] Specifically, in one embodiment, calling the story data processing model to perform sound effect processing on the dialogue and narration in the story text includes:
[0070] Perform text recognition on the story text, disassemble the content elements of the story based on the text recognition results, and extract the story elements;
[0071] Determine the sound characteristics of the matching dialogue and narration based on the story elements corresponding to the dialogue and narration;
[0072] The dialogue and narration are transformed into dialogue and narration data with sound effects matching the characteristics of the sound effects.
[0073] Specifically, in one embodiment, such as figure 2 As shown, the method includes the following processes:
[0074] S210: Acquire story text data;
[0075] S220: Parse the story text data;
[0076] S221: Disassemble the content elements of the story based on the text recognition result, and extract the story elements;
[0077] S222: Identify the dialogue and narration in the story text;
[0078] S230, call the story data processing model;
[0079] S231: Determine the sound effect characteristics of the matching dialogue and the narration according to the story elements corresponding to the dialogue and the narration;
[0080] S232: Convert the dialogue and narration into dialogue and narration data with sound effects matching the characteristics of the sound effect.
[0081] Specifically, in one embodiment, the analysis targets are divided into specific categories (several story elements), keywords are extracted for each story element, and the extracted keywords and the story element tags are saved as the analysis result.
[0082] Further, in the story text, according to the progress of the story content, the description, description content and/or description background of the dialogue and narration may be different. Therefore, in one embodiment, the corresponding sound effects are determined respectively according to the story elements corresponding to the dialogue and the narration. Specifically, in an embodiment, the sound effects are determined for each sentence of dialogue and each sentence of narration.
[0083] Specifically, in an embodiment, the story element corresponding to the dialogue includes a dialogue role, dialogue content, dialogue environment and/or dialogue context.
[0084] Specifically, in an embodiment, the story element corresponding to the narration includes the narration content, the narration environment, and/or the narration context.
[0085] Further, based on the method of the present invention, the present invention also proposes a storage medium, and the storage medium stores program codes that can implement the method according to the present invention.
[0086] Further, based on the method of the present invention, the present invention also proposes a story data processing system for intelligent robots.
[0087] Specific, such as image 3 As shown, in one embodiment, the system includes:
[0088] The text acquisition module 310 is configured to acquire story text data;
[0089] The text analysis module 320, which is configured to analyze the story text data, and identify the dialogue and narration in the story text;
[0090] Story data processing model library 341, which is configured to store story data processing models;
[0091] The sound effect processing module 340 is configured to call the story data processing model, perform sound effect processing on the dialogue and narration in the story text, and generate dialogue and narration data with sound effects;
[0092] The multi-modal story data generating module 330 is configured to generate and output multi-modal data matching the story text. The multi-modal data includes dialogue with sound effects and narration data.
[0093] Further, in an embodiment, the multi-modal story data generation module 330 is further configured to generate corresponding intelligent robot action data for the dialogue and narration in the story text.
[0094] Further, in one embodiment, such as Figure 4 As shown, the multi-modal story data generation module 430 further includes:
[0095] A voice conversion unit 431, which is configured to convert texts other than dialogue and voice in the story text data into first voice data;
[0096] The voice conversion unit 432 is configured to combine the dialogue and narration data with sound effects to perform text-to-speech conversion on the dialogue and narration in the story text to generate the dialogue and narration voice data with sound effects;
[0097] The voice synthesis unit 433 is configured to merge the dialogue and narration voice data with sound effects and the first voice data to generate story voice data.
[0098] Further, based on the story data processing system proposed by the present invention, the present invention also proposes an intelligent story machine. Specific, such as Figure 5 As shown, in one embodiment, the story machine includes:
[0099] The input acquisition module 510 is configured to collect user multi-modal input and receive user story requirements;
[0100] Story data processing system 520, which is configured to obtain corresponding story text data according to user story requirements, and generate multi-modal data;
[0101] The output module 530 is configured to output multi-modal data to the user.
[0102] Specifically, in one embodiment, the output module 530 includes a playback unit configured to play dialogue and narration voice data with sound effects.
[0103] Specific, such as Image 6 As shown, in one embodiment, the story machine includes a smart device 610 and a cloud server 620, where:
[0104] The cloud server 620 includes a story data processing system 630. The story data processing system 630 is configured to call the capability interface of the cloud server 620 to obtain and parse the story text data, and generate and output multi-modal data including dialogue and narration data with sound effects. Specifically, each capability interface of the story data processing system 630 calls corresponding logic processing in the data analysis process.
[0105] Specifically, in an embodiment, the capability interface of the cloud server 620 includes a text recognition interface 621, a text/speech conversion interface 622, and a sound effect synthesis interface 623.
[0106] The smart device 610 includes a human-computer interaction input and output module 611, a communication module 612, a playback module 613, and an action module 614.
[0107] The human-computer interaction input and output module 611 is configured to obtain the user's control instructions and determine the user's story listening needs.
[0108] The communication module 612 is configured to output the user story listening needs obtained by the human-computer interaction input and output module 611 to the cloud server 620, and receive multi-modal data from the cloud server 620.
[0109] The playing module 613 is configured to play dialogue and narration voice data or story voice data with sound effects in the multi-modal data.
[0110] The action module 614 is configured to make corresponding actions according to the action data of the intelligent robot in the multi-modal data.
[0111] Specifically, in a specific application scenario, the human-computer interaction input and output module 611 obtains the user's control instruction, and determines the user's story listening needs.
[0112] The communication module 612 sends the user story listening request to the cloud server 620.
[0113] The cloud server 620 selects corresponding story text data based on the user's story listening needs. The story data processing system in the cloud server 620 obtains and analyzes the story text data, and generates and outputs multi-modal data. The multi-modal data includes intelligent robot action data and story voice data. The story voice data includes dialogue and narration voice data with sound effects.
[0114] The communication module 612 receives the multi-modal data sent by the cloud server 620.
[0115] The playing module 613 plays the story voice data in the multi-modal data received by the communication module 612.
[0116] The action module 614 makes corresponding actions according to the action data of the intelligent robot in the multi-modal data.
[0117] It should be understood that the embodiments disclosed in the present invention are not limited to the specific structures, processing steps, or materials disclosed herein, but should be extended to equivalent substitutions of these features understood by those of ordinary skill in the relevant fields. It should also be understood that the terms used herein are only used for the purpose of describing specific embodiments, and are not meant to be limiting.
[0118] The "an embodiment" mentioned in the specification means that a specific feature, structure, or characteristic described in conjunction with the embodiment is included in at least one embodiment of the present invention. Therefore, the phrases "one embodiment" appearing in various places throughout the specification do not necessarily all refer to the same embodiment.
[0119] Although the disclosed embodiments of the present invention are as described above, the content described is only the embodiments adopted to facilitate the understanding of the present invention, and is not intended to limit the present invention. The method of the present invention can also have other various embodiments. Without departing from the essence of the present invention, those skilled in the art can make various corresponding changes or modifications according to the present invention, but these corresponding changes or modifications should fall within the protection scope of the claims of the present invention.