Audio generation method and apparatus, electronic device, and storage medium
By combining voice input with vehicle sensor data, and using a large audio model to generate audio and video explanations of vehicle operation, the problem of low retrieval efficiency in traditional vehicle operation manuals is solved, enabling convenient and efficient acquisition of vehicle operation solutions and rich information display.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ANHUI WEIDU HLDG CO LTD
- Filing Date
- 2023-05-12
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional vehicle instruction manuals suffer from low search efficiency, poor information display accuracy and universality, and existing mobile device search methods are cumbersome and offer limited image content.
By acquiring user voice input and combining it with vehicle sensor data, a large audio generation model is used to generate vehicle operation audio, including video and narration, based on the vehicle operation manual and images.
It improves the ease of searching and versatility of vehicle operation plans, enriches information display, and enhances user experience.
Smart Images

Figure CN116645950B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer application technology, and in particular to an audio generation method, apparatus, electronic device, and storage medium. Background Technology
[0002] Users often encounter various problems while driving, thus requiring them to consult the vehicle's owner's manual for necessary operating solutions. However, with the rapid development of automotive technology, car functions are becoming increasingly sophisticated, leading to more complex owner's manuals. Traditional paper and electronic manuals rely on categorization and retrieval to find the information users need, but search efficiency is low, and there are instances where the required information cannot be located.
[0003] In existing technologies, vehicle components are typically scanned using mobile devices such as smartphones or tablets so that the system can retrieve and display relevant operating instructions through image information in real time. However, the operation method is relatively cumbersome, resulting in low retrieval efficiency, and the displayed image content is relatively simple, leading to poor accuracy and universality of information display. Summary of the Invention
[0004] This invention provides an audio generation method, apparatus, electronic device, and storage medium to solve the technical problems of low retrieval efficiency and poor accuracy and universality of information display.
[0005] According to one aspect of the present invention, an audio generation method is provided, wherein the method includes:
[0006] Acquire the first input speech and determine the first text information corresponding to the first input speech;
[0007] Acquire target sensing data, and determine target text features based on the first text information and the target sensing data, wherein the target text features include requirements, scenarios, and fault information;
[0008] The audio generation model generates audio from the input target text features to obtain the target audio corresponding to the target text features. The audio generation model is trained on the artificial intelligence content generation model based on the vehicle operation manual and vehicle operation images. The vehicle operation manual includes vehicle operation solutions for various needs, scenarios, and fault information.
[0009] According to another aspect of the present invention, an audio generation apparatus is provided, wherein the apparatus comprises:
[0010] The voice processing module is used to acquire the first input voice and determine the first text information corresponding to the first input voice;
[0011] The feature determination module is used to acquire target sensing data and determine target text features based on the first text information and the target sensing data, wherein the target text features include requirements, scenarios, and fault information;
[0012] The audio generation module is used to generate audio from the input target text features using an audio generation big model, thereby obtaining the target audio corresponding to the target text features. The audio generation big model is trained on an artificial intelligence content generation big model based on the vehicle operation manual and vehicle operation images. The vehicle operation manual includes vehicle operation solutions for various needs, scenarios, and fault information.
[0013] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising:
[0014] At least one processor; and
[0015] A memory communicatively connected to the at least one processor; wherein,
[0016] The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the audio generation method according to any embodiment of the present invention.
[0017] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement the audio generation method according to any embodiment of the present invention.
[0018] The technical solution of this invention involves acquiring a first input voice and determining the first text information corresponding to that voice; acquiring target sensor data and determining target text features based on the first text information and the target sensor data, wherein the target text features include requirements, scenarios, and fault information. Converting text information into text features improves the efficiency of vehicle operation plan retrieval; and generating audio from the input target text features using an audio generation model to obtain target audio corresponding to the target text features. The audio generation model is trained on an AI-generated content model based on vehicle operation manuals and vehicle operation images. The vehicle operation manual includes vehicle operation plans under various requirements, scenarios, and fault information conditions. This achieves the effect of obtaining target audio corresponding to vehicle operation plans without requiring users to use specific equipment to scan vehicle parts, simply through voice input, thus improving the convenience of vehicle operation plan retrieval; it achieves the effect of generating targeted target audio based on different user descriptions, improving the versatility of the audio generation method; and it achieves the effect of simultaneously displaying vehicle operation plans based on video and narration, improving the richness of target audio and ensuring a good user experience.
[0019] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a flowchart of an audio generation method provided according to Embodiment 1 of the present invention;
[0022] Figure 2 This is a flowchart of an audio generation method provided according to Embodiment 2 of the present invention;
[0023] Figure 3 This is an overall flowchart of an audio generation method provided according to an embodiment of the present invention;
[0024] Figure 4 This is a schematic diagram of the structure of an audio generation device according to Embodiment 3 of the present invention;
[0025] Figure 5 This is a schematic diagram of the structure of an electronic device that implements the audio generation method of this invention. Detailed Implementation
[0026] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0027] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0028] Example 1
[0029] Figure 1 This is a flowchart of an audio generation method provided in Embodiment 1 of the present invention. This embodiment is applicable to situations where content is generated by artificial intelligence. The method can be executed by an audio generation device, which can be implemented in hardware and / or software and can be configured within computer software. Figure 1 As shown, the method includes:
[0030] S110. Obtain the first input voice and determine the first text information corresponding to the first input voice.
[0031] The first input voice can be understood as the user's voice. Optionally, the first input voice can be voice representing the user's intention. In this embodiment of the invention, the first input voice is related to the application scenario and is not specifically limited here. For example, the first input voice can be voice input by the user such as "how to remove fog from the car", "tire damage", or "fuel leak".
[0032] The first text information can be understood as text information obtained by performing speech recognition on the first input speech. In this embodiment of the invention, the first text information is related to the first input speech, and is not specifically limited here. For example, the first text information may be text information such as "how to remove fog from the car", "damaged tire", or "fuel tank leak".
[0033] S120. Obtain target sensing data, and determine target text features based on the first text information and the target sensing data.
[0034] The target sensing data can be understood as data acquired by sensors installed on the target vehicle. The sensors installed on the target vehicle can be preset according to scenario requirements and are not specifically limited here. For example, the sensors may be temperature sensors, pressure sensors, speed sensors, velocity sensors, and / or acceleration sensors, etc.
[0035] Optionally, the target sensing data may be data characterizing the scene in the target text features. In this embodiment of the invention, the target sensing data is related to the sensor and the application scenario, and is not specifically limited here. For example, the target sensing data may be temperature data, pressure data, rotational speed data, velocity data, and / or acceleration data, etc.
[0036] The target text features can be understood as the features required to generate the target audio. Optionally, the target text features include needs, scenarios, and fault information. Specifically, when the first input voice is "how to remove car fog," the first text information determined based on the first input voice is "how to remove car fog." Further, natural language processing is performed on the first text information to determine that the need corresponding to the first text information is to remove car fog, the scenario is a temperature feature (spring, summer, autumn, or winter) determined based on a temperature sensor, and the fault information is car fog.
[0037] S130. The target audio is generated by using a large audio generation model to generate audio from the input target text features.
[0038] The audio generation model can be understood as a large model that can intelligently generate target audio based on the features of the target text. Specifically, the audio generation model is trained on an AI-generated content model based on vehicle operation manuals and vehicle operation images. The vehicle operation manuals include vehicle operation solutions for various needs, scenarios, and fault information.
[0039] It should be understood that the vehicle operation manual includes vehicle operation solutions for various needs, scenarios, and fault information conditions. In this embodiment of the invention, the audio generation model can intelligently generate the target audio corresponding to the target text features based on the vehicle operation manual and the vehicle operation images.
[0040] Optionally, the target audio includes the target video and the target narration corresponding to the target video. After generating audio from the input target text features using a large audio generation model to obtain the target audio corresponding to the target text features, the method further includes:
[0041] The target video is displayed on the target display of the target vehicle, and the target narration is played through the target player of the target vehicle.
[0042] In this invention, the target vehicle can be understood as the vehicle to which the audio generation is directed. The target display can be understood as a device installed on the target vehicle that has video display capabilities. For example, the target display can be a liquid crystal display (LCD). The target player can be understood as a device installed on the target vehicle that has audio playback capabilities. For example, the target player can be a speaker. In this embodiment of the invention, the target display and the target player can be preset according to scenario requirements, and are not specifically limited here.
[0043] The target video can be understood as a video generated by the audio generation model based on the target text features. The target narration can be understood as a narration corresponding to the target video generated by the audio generation model based on the target text features. For example, the target video can be an operation video for removing car fog, and the target narration can be a narration of the operation video for removing car fog. In this embodiment of the invention, the display of the target video is synchronized with the playback of the target narration.
[0044] Optionally, the large model for AI-generated content includes at least one of deep variational autoencoders, generative adversarial neural networks, diffusion models, language models, and visual models.
[0045] The deep variational autoencoder (VAE) comprises an encoder and a decoder. The encoder transforms the original high-dimensional input data into a probability distribution description of the latent space, while the decoder reconstructs new data from the sampled data.
[0046] The Generative Adversarial Networks (GANs) consist of a generator and a discriminator. The generator learns to generate plausible data and uses this generated data as negative samples for the discriminator, which then determines whether the input is generated data or real data.
[0047] The diffusion model comprises a forward diffusion process and a backward diffusion process. In the forward diffusion stage, the image is gradually contaminated with noise until it becomes completely random noise. In the backward diffusion process, a series of Markov chains are used to progressively remove the prediction noise at each time step, thereby recovering the data from the Gaussian noise.
[0048] The language model can be understood as a model with the performance of speech recognition and machine translation.
[0049] The Vision Transformer (ViT) is a model that possesses the ability to perceive and understand visual data using Artificial Intelligence (AI) and to contribute to the development of Artificial Intelligence Generative Conversations (AIGC) technology.
[0050] The technical solution of this invention involves acquiring a first input voice and determining the first text information corresponding to that voice; acquiring target sensor data and determining target text features based on the first text information and the target sensor data, wherein the target text features include requirements, scenarios, and fault information. Converting text information into text features improves the efficiency of vehicle operation plan retrieval; and generating audio from the input target text features using an audio generation model to obtain target audio corresponding to the target text features. The audio generation model is trained on an AI-generated content model based on vehicle operation manuals and vehicle operation images. The vehicle operation manual includes vehicle operation plans under various requirements, scenarios, and fault information conditions. This achieves the effect of obtaining target audio corresponding to vehicle operation plans without requiring users to use specific equipment to scan vehicle parts, simply through voice input, thus improving the convenience of vehicle operation plan retrieval; it achieves the effect of generating targeted target audio based on different user descriptions, improving the versatility of the audio generation method; and it achieves the effect of simultaneously displaying vehicle operation plans based on video and narration, improving the richness of target audio and ensuring a good user experience.
[0051] Example 2
[0052] Figure 2This is a flowchart of an audio generation method provided in Embodiment 2 of the present invention. This embodiment refines the method described in the above embodiment for determining target text features based on the first text information and the target sensing data. Figure 2 As shown, the method includes:
[0053] S210. Obtain the first input voice and determine the first text information corresponding to the first input voice.
[0054] S220. Acquire target sensing data, perform natural language processing on the first text information to obtain the first text feature corresponding to the first text information, and determine the target sensing feature based on the target sensing data.
[0055] The first text feature can be understood as performing natural language processing on the first text information to obtain the features corresponding to the first text information. Optionally, the first text feature may include requirements, scenarios, fault information, and / or other information. Specifically, optionally, the first text feature is obtained by performing natural language processing on the input first text information using a trained natural language processing model.
[0056] The target sensing feature can be understood as the feature corresponding to the target sensing data. Optionally, the target sensing feature may include scene and / or other information. Specifically, feature extraction is performed on the target sensing data to obtain the target sensing feature corresponding to the target sensing data. For example, if the target sensing data is an outside temperature of -1°C obtained by a temperature sensor, then the determined target sensing feature could be that the scene is winter, and the generated target audio could be the audio corresponding to the vehicle operation scheme of turning on the air conditioning to remove fog from the car.
[0057] S230. If the first text feature and the target sensing feature satisfy the first feature condition, the first text feature and the target sensing feature are used as the target text feature.
[0058] The first feature condition can be understood as the required condition for determining the target text features. Optionally, the first feature condition may include the first text features and the target sensing features, including requirements, scenarios, and fault information.
[0059] S240, if the first text feature and the target sensing feature do not satisfy the first feature condition, generate target output speech based on the first text feature and the target sensing feature, and determine target text features based on the target output speech.
[0060] Optionally, generating the target output speech based on the first text features and the target sensing features includes:
[0061] The missing features corresponding to the first text features and the target sensing features are determined, and the target output speech is generated based on the missing features, wherein the missing features include at least one of the following: demand, scenario and / or fault information.
[0062] The target output speech can be understood as the speech generated when the first text feature and the target sensing feature do not meet the first feature condition. Optionally, the target output speech may be the output speech intended to obtain the missing feature. The missing feature can be understood as the missing feature among the first text feature and the target sensing feature. Specifically, for example, if the first input speech is "tire damage", the determined first text information is "tire damage", and the target sensing feature determined by the speed sensor is a speed of 0, then it can be determined that the first text feature and the target sensing feature include scene and fault information, i.e., the vehicle has stopped and the tire is damaged; and there is a missing need, i.e., the need is the missing feature. Further, based on the need, the target output speech is generated, which may be "Please ask for the search and rescue number or change the tire".
[0063] Optionally, determining the target text features based on the target output speech includes:
[0064] Output the target output speech to obtain the second input speech, determine the second text information corresponding to the second input speech, and determine the second text feature corresponding to the second text information;
[0065] If the second text feature satisfies the second feature condition, the first text feature, the target sensing feature, and the second text feature are taken as the target text feature.
[0066] If the second text feature does not meet the second feature condition, return to the operation of outputting the target output speech to obtain the second input speech, determine the second text information corresponding to the second input speech, and determine the second text feature corresponding to the second text information, until the second text feature meets the second feature condition, and obtain the target text feature.
[0067] Optionally, the second feature condition is that the second text feature includes the missing feature.
[0068] The second input voice can be understood as the voice input by the user in response to the target output voice. Optionally, the second input voice may include the voice lacking the aforementioned features. In this embodiment of the invention, the second input voice is related to the application scenario and is not specifically limited here. For example, if the target output voice can be "Please tell me the search and rescue number or change the tire?", the second input voice could be "Change the tire" or "Please repeat", etc.
[0069] The second text information is obtained by performing speech recognition on the second input speech. In this embodiment of the invention, the second text information is related to the second input speech, and is not specifically limited here. For example, the second text information may be text information such as "change tire" or "please repeat".
[0070] The second text feature is processed by natural language processing to obtain the features corresponding to the second text information. Optionally, the second text feature may include requirements, scenarios, fault information, and / or other information. Specifically, optionally, the input second text information is processed by a trained natural language processing model to obtain the second text feature corresponding to the second text information.
[0071] Specifically, when the target output voice is "Should I call for search and rescue or change the tire?", and the second input voice is "change the tire", and the second text information is "change the tire", then the second text feature is determined, that is, the requirement is to change the tire; then the requirement, that is, to change the tire, the scenario, that is, the vehicle stops driving, and the fault information, that is, the tire is damaged, are taken as the target text features to generate the target audio.
[0072] Alternatively, if the target output voice is "Should I call for search and rescue or change a tire?", and the second input voice is "Please repeat", then the second text information is "Please repeat". In this case, the second text feature is determined to be the other information, that is, the second text feature does not include the missing feature. In other words, the second text feature does not meet the second feature condition. Furthermore, the target output voice, "Should I call for search and rescue or change a tire?", continues to be output until the second text feature includes the missing feature, thus obtaining the target text feature to generate the target audio.
[0073] S250. The audio generation model generates audio from the input target text features to obtain the target audio corresponding to the target text features. The audio generation model is trained on the artificial intelligence content generation model based on the vehicle operation manual and vehicle operation images. The vehicle operation manual includes vehicle operation solutions under various needs, scenarios and fault information.
[0074] The technical solution of this invention involves performing natural language processing on the first text information to obtain a first text feature corresponding to the first text information, and determining a target sensing feature based on the target sensing data. If the first text feature and the target sensing feature satisfy a first feature condition, the first text feature and the target sensing feature are used as target text features. If the first text feature and the target sensing feature do not satisfy the first feature condition, target output speech is generated based on the first text feature and the target sensing feature, and target text features are determined based on the target output speech. This ensures the comprehensiveness of the determined target text features, thereby improving the accuracy of the target audio generated based on the target text features.
[0075] Figure 3 This is an overall flowchart of an audio generation method provided according to an embodiment of the present invention, such as... Figure 3 As shown, the overall process of the audio generation method can be as follows:
[0076] 1. Speech recognition. Converts the user's voice input into text information.
[0077] 2. Natural Language Processing. Analyze text information and identify target text features, namely requirements, scenarios, and fault information.
[0078] 3. AIGC Technology Engine. Based on the characteristics of the target text, AIGC technology is used to generate videos corresponding to vehicle operation solutions relevant to user needs.
[0079] 4. Video Demonstration. The generated video will be presented to the user to intuitively demonstrate how to operate the vehicle.
[0080] 5. Speech Synthesis. Add narration to the video to provide richer information.
[0081] Based on the technical solution of this invention, no specific equipment is required; that is, users do not need to use specific devices to scan vehicle parts. They can obtain target audio corresponding to vehicle operation solutions simply by voice input, improving the convenience of vehicle operation solution retrieval. It can improve retrieval efficiency, as voice input and natural language processing technology can more quickly retrieve vehicle operation solutions corresponding to target text features for users. It has greater versatility, as AIGC-based video generation can generate targeted animations based on user descriptions, adapting to various user needs. It is more intuitive, as displaying vehicle operation solutions in video format allows users to more intuitively understand how to operate the vehicle. Furthermore, voice synthesis technology adds narration to the video, enriching the information provided.
[0082] Example 3
[0083] Figure 4 This is a schematic diagram of an audio generation device provided in Embodiment 3 of the present invention. Figure 4 As shown, the device includes: a speech processing module 310, a feature determination module 320, and an audio generation module 330; wherein,
[0084] The speech processing module 310 is used to acquire a first input speech and determine the first text information corresponding to the first input speech; the feature determination module 320 is used to acquire target sensing data and determine target text features based on the first text information and the target sensing data, wherein the target text features include requirements, scenarios, and fault information; the audio generation module 330 is used to generate audio from the input target text features using an audio generation big model to obtain the target audio corresponding to the target text features, wherein the audio generation big model is trained on an artificial intelligence content generation big model based on the vehicle operation manual and vehicle operation images, and the vehicle operation manual includes vehicle operation schemes under various requirements, scenarios, and fault information conditions.
[0085] The technical solution of this invention involves acquiring a first input voice and determining the first text information corresponding to that voice; acquiring target sensor data and determining target text features based on the first text information and the target sensor data, wherein the target text features include requirements, scenarios, and fault information. Converting text information into text features improves the efficiency of vehicle operation plan retrieval; and generating audio from the input target text features using an audio generation model to obtain target audio corresponding to the target text features. The audio generation model is trained on an AI-generated content model based on vehicle operation manuals and vehicle operation images. The vehicle operation manual includes vehicle operation plans under various requirements, scenarios, and fault information conditions. This achieves the effect of obtaining target audio corresponding to vehicle operation plans without requiring users to use specific equipment to scan vehicle parts, simply through voice input, thus improving the convenience of vehicle operation plan retrieval; it achieves the effect of generating targeted target audio based on different user descriptions, improving the versatility of the audio generation method; and it achieves the effect of simultaneously displaying vehicle operation plans based on video and narration, improving the richness of target audio and ensuring a good user experience.
[0086] Optionally, the target audio includes the target video and the target narration corresponding to the target video, and the audio generation device further includes a display and playback module, used for:
[0087] After generating audio from the input target text features using a large audio generation model to obtain the target audio corresponding to the target text features, the target video is displayed on the target display of the target vehicle, and the target narration is played through the target player of the target vehicle.
[0088] Optionally, the feature determination module 320 includes: a first feature processing unit, a first feature determination unit, and a second feature determination unit; wherein,
[0089] The first feature processing unit is used to perform natural language processing on the first text information to obtain the first text feature corresponding to the first text information, and to determine the target sensing feature based on the target sensing data.
[0090] The first feature determination unit is configured to use the first text feature and the target sensing feature as target text features when the first text feature and the target sensing feature satisfy the first feature condition;
[0091] The second feature determination unit is used to generate target output speech based on the first text feature and the target sensing feature when the first text feature and the target sensing feature do not meet the first feature condition, and to determine target text features based on the target output speech.
[0092] Optionally, the second feature determining unit is used for:
[0093] Output the target output speech to obtain the second input speech, determine the second text information corresponding to the second input speech, and determine the second text feature corresponding to the second text information;
[0094] If the second text feature satisfies the second feature condition, the first text feature, the target sensing feature, and the second text feature are taken as the target text feature.
[0095] If the second text feature does not meet the second feature condition, return to the operation of outputting the target output speech to obtain the second input speech, determine the second text information corresponding to the second input speech, and determine the second text feature corresponding to the second text information, until the second text feature meets the second feature condition, and obtain the target text feature.
[0096] Optionally, the first feature condition is that the first text feature and the target sensing feature include demand, scenario, and fault information; the second feature determination unit is used for:
[0097] The missing features corresponding to the first text features and the target sensing features are determined, and the target output speech is generated based on the missing features, wherein the missing features include at least one of the following: demand, scenario and / or fault information.
[0098] Optionally, the second feature condition is that the second text feature includes the missing feature.
[0099] Optionally, the large model for AI-generated content includes at least one of deep variational autoencoders, generative adversarial neural networks, diffusion models, language models, and visual models.
[0100] The audio generation apparatus provided in the embodiments of the present invention can execute the audio generation method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of executing the method.
[0101] Example 4
[0102] Figure 5A schematic diagram of an electronic device 10 that can be used to implement embodiments of the present invention is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.
[0103] like Figure 5 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 may also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0104] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0105] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as audio generation methods.
[0106] In some embodiments, the audio generation method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the audio generation method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the audio generation method by any other suitable means (e.g., by means of firmware).
[0107] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0108] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0109] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0110] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0111] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0112] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0113] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.
[0114] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. An audio generation method, characterized by, include: Acquire the first input speech and determine the first text information corresponding to the first input speech; Acquire target sensing data, and determine target text features based on the first text information and the target sensing data, wherein the target text features include requirements, scenarios, and fault information; The audio generation model generates audio from the input target text features to obtain the target audio corresponding to the target text features. The audio generation model is trained on the artificial intelligence content generation model based on the vehicle operation manual and vehicle operation images. The vehicle operation manual includes vehicle operation solutions for various needs, scenarios and fault information. Determining the target text features based on the first text information and the target sensing data includes: Natural language processing is performed on the first text information to obtain the first text feature corresponding to the first text information, and the target sensing feature is determined based on the target sensing data; If the first text feature and the target sensing feature satisfy the first feature condition, the first text feature and the target sensing feature are used as the target text feature; wherein, the first feature condition is that the first text feature and the target sensing feature include demand, scenario and fault information; If the first text feature and the target sensing feature do not satisfy the first feature condition, a target output speech is generated based on the first text feature and the target sensing feature, and a target text feature is determined based on the target output speech; wherein, the target output speech is intended to obtain output speech lacking features.
2. The method of claim 1, wherein, The target audio includes the target video and the target narration corresponding to the target video. After generating audio from the input target text features using a large audio generation model to obtain the target audio corresponding to the target text features, the following is also included: The target video is displayed on the target display of the target vehicle, and the target narration is played through the target player of the target vehicle.
3. The method of claim 1, wherein, The step of determining target text features based on the target output speech includes: Output the target output speech to obtain the second input speech, determine the second text information corresponding to the second input speech, and determine the second text feature corresponding to the second text information; If the second text feature satisfies the second feature condition, the first text feature, the target sensing feature, and the second text feature are taken as the target text feature. If the second text feature does not meet the second feature condition, return to the operation of outputting the target output speech to obtain the second input speech, determine the second text information corresponding to the second input speech, and determine the second text feature corresponding to the second text information, until the second text feature meets the second feature condition, and obtain the target text feature.
4. The method of claim 1, wherein, The first feature condition is that the first text feature and the target sensing feature include demand, scenario, and fault information. The step of generating the target output speech based on the first text feature and the target sensing feature includes: Determine the missing features corresponding to the first text features and the target sensing features, and generate the target output speech based on the missing features, wherein the missing features include at least one of the following: demand, scenario and / or fault information.
5. The method of claim 3, wherein, The second feature condition is that the second text feature includes the missing feature.
6. The method of claim 1, wherein, The AI-generated content model includes at least one of the following: deep variational autoencoder, generative adversarial neural network, diffusion model, language model, and visual model.
7. An audio generating apparatus, characterized by comprising: include: The voice processing module is used to acquire the first input voice and determine the first text information corresponding to the first input voice; The feature determination module is used to acquire target sensing data and determine target text features based on the first text information and the target sensing data, wherein the target text features include requirements, scenarios, and fault information; The audio generation module is used to generate audio from the input target text features through an audio generation big model to obtain the target audio corresponding to the target text features. The audio generation big model is trained on the artificial intelligence content generation big model based on the vehicle operation manual and vehicle operation images. The vehicle operation manual includes vehicle operation solutions under various needs, scenarios and fault information. The feature determination module includes: a first feature processing unit, a first feature determination unit, and a second feature determination unit; wherein, The first feature processing unit is used to perform natural language processing on the first text information to obtain the first text feature corresponding to the first text information, and to determine the target sensing feature based on the target sensing data. The first feature determination unit is configured to, when the first text feature and the target sensing feature satisfy a first feature condition, use the first text feature and the target sensing feature as target text features; wherein, the first feature condition is that the first text feature and the target sensing feature include demand, scenario and fault information; The second feature determination unit is configured to generate target output speech based on the first text feature and the target sensing feature when the first text feature and the target sensing feature do not satisfy the first feature condition, and to determine target text features based on the target output speech; wherein, the target output speech is intended to obtain output speech lacking features.
8. An electronic device, comprising: The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the audio generation method according to any one of claims 1-6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the audio generation method of any one of claims 1-6.