Audio generation method and apparatus

By using music information extraction technology, an accompaniment audio that highly matches the humming audio is generated, solving the problem of poor accompaniment compatibility in existing technologies and improving the generation effect of humming background music.

WO2026056286A9PCT designated stage Publication Date: 2026-06-11HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-04-28
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Existing humming-based music generation technology suffers from poor accompaniment compatibility, resulting in low robustness of the generated humming-based music effect.

Method used

By acquiring the humming audio, music information extraction (MIR) technology is used to extract information such as beat, pitch, and tone, generating accompaniment audio that highly matches the humming audio. Combined with multi-track MIDI scores and virtual instrument rendering, highly adaptable target audio is generated.

🎯Benefits of technology

It improves the adaptability and robustness of humming background music generation, and the generated accompaniment audio is highly matched with the humming audio, making it suitable for both ordinary users and professional singer-songwriters.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025091831_11062026_PF_FP_ABST
    Figure CN2025091831_11062026_PF_FP_ABST
Patent Text Reader

Abstract

An audio generation method and apparatus. The audio generation method (300) comprises: acquiring a hummed audio (301), the hummed audio being inputted by a user by means of a human-machine interaction interface; performing MIR on the hummed audio to obtain musical information of the hummed audio (302), the musical information comprising meter information, pitch information, and tone information; and generating a target audio on the basis of the musical information of the hummed audio (303), the target audio at least comprising an accompaniment audio corresponding to the hummed audio. A highly adaptable accompaniment or song can be obtained, thereby improving the robustness of humming-based accompaniment generation.
Need to check novelty before this filing date? Find Prior Art

Description

Audio generation method and apparatus Technical Field

[0001] This application relates to audio processing technology, and more particularly to an audio generation method and apparatus. Background Technology

[0002] The Hum-to-Generate-Music task aims to automatically generate music based on a user's humming, ensuring a high degree of match between the music's key, rhythm, and style. This task can be applied in various scenarios, such as music composition, karaoke, and personalized customization.

[0003] Related technologies can generate accompaniment tracks based on the semantic information of humming, limiting templates, and adjusting audio tracks, thereby achieving the goal of generating music from humming. However, these technologies have poor accompaniment adaptability, cannot achieve automatic recognition and adaptation, and their humming background music generation effect has low robustness. Summary of the Invention

[0004] This application provides an audio generation method and apparatus to obtain highly adaptable accompaniment or songs, and to improve the robustness of the generated humming background music.

[0005] In a first aspect, this application provides an audio generation method, comprising: acquiring a humming audio, wherein the humming audio is input by a user through a human-computer interaction interface; performing music information extraction (MIR) on the humming audio to obtain music information of the humming audio, wherein the music information includes beat information, pitch information and tone information; generating a target audio based on the music information of the humming audio, wherein the target audio includes at least an accompaniment audio corresponding to the humming audio.

[0006] In this embodiment, musical information such as beat, pitch, and tone of the humming audio is obtained through MIR (Musical Information Retrieval). This information is then used to calculate matching information between the accompaniment audio and the user's humming, including vocal timing, the role and mood of the accompaniment in different musical sections, and chord progressions. Based on this musical information, the corresponding target audio is then obtained. Therefore, both ordinary users and professional singer-songwriters can obtain highly compatible accompaniments or songs through this embodiment, improving the robustness of humming background music generation.

[0007] The humming audio can be input by the user through a human-computer interaction interface.

[0008] Musical information can include meter, pitch, and tone. Optionally, it can also include phrase information. Meter (also known as time) refers to the regularly repeated accents; pitch refers to the different levels of sound; tone refers to whether a sound can be perceived as a middle note; and phrase refers to the smallest structure in a musical work that expresses a complete or relatively complete musical idea—the smallest structure that constitutes an independent section.

[0009] In this embodiment, music information can be obtained by performing music information retrieval (MIR) on the humming audio. After MIR, the beat information, pitch information, tone information, and musical phrase information are obtained from the music information. In this embodiment, the music information can be displayed on the human-computer interaction interface, so that users can intuitively see the beat, pitch, tone, and musical phrase of the audio they are humming.

[0010] In one possible implementation, MIR can be implemented as an AI model, that is, the humming audio is input into the AI ​​model and the musical information of the humming audio is output.

[0011] In one possible implementation, MIR can be achieved by using multiple AI models in combination. Specifically, the humming audio is input into the first AI model, which outputs beat information; the beat information and the humming audio are input into the second AI model, which outputs pitch information; the pitch information of the humming audio is input into the third AI model, which outputs tone information; and the humming audio is input into the fourth AI model, which outputs musical phrase information.

[0012] In one possible implementation, MIR can be achieved by matching the original score with multiple AI models. Specifically, Automatic Speech Recognition (ASR) is performed on the humming audio to obtain the lyrics; the lyrics of the humming audio are matched with the original score to obtain the beat information, where the original score is the score of the original piece of music corresponding to the humming audio; the beat information and the humming audio are input into a second AI model, which outputs pitch information; and the pitch information of the humming audio is input into a third AI model, which outputs tone information.

[0013] Whether using one AI model or multiple AI models, these AI models can be pre-trained to achieve the aforementioned functions required by MIR.

[0014] In one possible implementation, the target audio includes at least the accompaniment audio corresponding to the humming audio.

[0015] An accompaniment score can be generated based on the musical information of the humming audio. This accompaniment score includes Musical Instrument Digital Interface (MIDI) scores corresponding to multiple orchestrations. In this step, chord progressions can be generated based on the pitch and tone information in the musical information. The smallest unit is two beats. The main logic is to calculate the chord that matches the most pitches and functions best (e.g., cadences and other common rules in popular music) in each chord unit. Optionally, to facilitate the representation of different chord functions and reusability, all chord progressions can be represented using Roman numerals. Then, the accompaniment score is generated based on the musical information and chord progressions of the humming audio; that is, the style input by the user through the human-computer interaction interface (see above description); multiple orchestrations are determined based on the style and chord progressions; and then the accompaniment score is generated based on the multiple orchestrations and the musical information of the humming audio.

[0016] 1. Piano / Guitar / Bass: The main function of these parts is to provide chord support for the song and serve as the main accompaniment instruments. Therefore, the generation method is usually the same as chord progressions, following a two-beat unit. The system then refers to the chord tone and rhythm pattern library to generate the corresponding accompaniment pattern for each unit.

[0017] 2. Drums are generated in two- or four-beat units. However, it relies more on the timing of measures within the structure; for example, a fill-in pattern may occur at the end of a section. Furthermore, due to the numbering of drum instruments in the MIDI format, each sound source needs to be specifically mapped to its corresponding drum instrument.

[0018] 3. Strings (Cello and other string instruments): Strings follow a non-linear generation method, which is more complex than the rhythmic logic of other instruments. Strings provide more contrapuntal melodies and layered roles, generating contrapuntal melodies based on major scale patterns while referencing chord progressions to ensure that the generated variable-length notes fit the chord progression within the time frame.

[0019] 4. Finally, it also includes other miscellaneous generation logic, such as motivational ideas, doubled melodies, and fill patterns for specific number of measures. It also includes the configuration of the piano's instrumental velocity, making the overall arrangement more like a live performance.

[0020] The target audio can be obtained by mixing the aforementioned accompaniment score. Specifically, virtual instrument rendering and single-track sound effect processing are performed on the corresponding MIDI scores for multiple orchestrations to obtain multiple single-track audio tracks. These single-track audio tracks are then mixed in multi-track stereo to obtain the target audio. In this way, the target audio can incorporate the timbres of various orchestrations, playing a melody in a chordal manner, resulting in an accompaniment track that is highly compatible with the user's humming audio.

[0021] In one possible implementation, the target audio includes the accompaniment audio corresponding to the humming audio and the vocal audio within the humming audio.

[0022] The system can preprocess the humming audio to obtain a single-track vocal audio; generate an accompaniment score based on the musical information of the humming audio, which includes multi-track MIDI scores corresponding to multiple instruments; render virtual instruments and perform single-track sound effects processing on the corresponding instruments based on the multi-track MIDI scores to obtain multiple single-track accompaniment audio; obtain multiple single-track audio from the single-track vocal audio and the multiple single-track accompaniment audio; and perform multi-track stereo mixing on the multiple single-track audio to obtain the target audio. Compared to the target audio generated in the previous method, this method adds the user's voice to the target audio, meaning the target audio includes both vocals and accompaniment, resulting in a song with a high degree of integration between vocals and accompaniment.

[0023] Secondly, this application provides an audio generation device, comprising: a human-computer interaction module for acquiring humming audio, wherein the humming audio is input by a user through a human-computer interaction interface; a music information extraction (MIR) module for performing music information extraction (MIR) on the humming audio to obtain music information of the humming audio, wherein the music information includes beat information, pitch information, and tone information; and an audio generation module for generating target audio based on the music information of the humming audio, wherein the target audio includes at least an accompaniment audio corresponding to the humming audio.

[0024] In one possible implementation, the MIR module is specifically used to input the humming audio into an artificial intelligence (AI) model and output the musical information of the humming audio.

[0025] In one possible implementation, the MIR module is specifically used to input the humming audio into a first AI model and output the beat information; input the beat information and the humming audio into a second AI model and output the pitch information; and input the pitch information of the humming audio into a third AI model and output the tone information.

[0026] In one possible implementation, the MIR module is specifically used to perform Automatic Speech Recognition (ASR) on the humming audio to obtain the lyrics of the humming audio; match the lyrics of the humming audio with the original musical score to obtain the beat information, wherein the original musical score is the score of the original piece of music corresponding to the humming audio; input the beat information and the humming audio into a second AI model to output the pitch information; and input the pitch information of the humming audio into a third AI model to output the tone information.

[0027] In one possible implementation, the musical information of the humming audio also includes musical segment information; the MIR module is further used to input the humming audio into a fourth AI model and output the musical segment information.

[0028] In one possible implementation, the audio generation module is specifically used to generate an accompaniment score based on the musical information of the humming audio, the accompaniment score including a multi-track instrument digital interface (MIDI) score corresponding to multiple orchestrations; and to perform mixing processing based on the accompaniment score to obtain the target audio.

[0029] In one possible implementation, the audio generation module is specifically used to generate chord progressions based on the pitch information and the tone information; and to generate the accompaniment score based on the musical information of the humming audio and the chord progressions.

[0030] In one possible implementation, the audio generation module is specifically used to acquire the style input by the user through a human-computer interaction interface; determine the plurality of orchestrations based on the style and the chords; and generate the accompaniment score based on the plurality of orchestrations and the musical information of the humming audio.

[0031] In one possible implementation, the audio generation module is specifically used to render virtual instruments and perform single-track sound effects processing on the corresponding multitrack MIDI scores of the multiple orchestrations to obtain multiple single-track audio; and to perform multitrack stereo mixing processing on the multiple single-track audio to obtain the target audio.

[0032] In one possible implementation, the target audio further includes the vocal audio from the humming audio; the audio generation module is further configured to perform vocal preprocessing on the humming audio to obtain a single-track vocal audio; to perform virtual instrument rendering and single-track sound effect processing on the corresponding multi-track MIDI scores for multiple orchestrations to obtain multiple single-track accompaniment audio; to obtain multiple single-track audio based on the single-track vocal audio and the multiple single-track accompaniment audio; and to perform multi-track stereo mixing on the multiple single-track audio to obtain the target audio.

[0033] Thirdly, this application provides a terminal device, comprising: one or more processors; a memory for storing one or more programs; and when the one or more programs are executed by the one or more processors, causing the one or more processors to implement the method as described in any one of the first aspects above.

[0034] Fourthly, this application provides a computer-readable storage medium, characterized in that it includes a computer program, which, when executed on a computer, causes the computer to perform the method described in any one of the first aspects above.

[0035] Fifthly, this application provides a computer program product, characterized in that the computer program product includes computer program code, which, when run on a computer, causes the computer to perform the method described in any one of the first aspects above. Attached Figure Description

[0036] Figure 1 shows a schematic diagram of the structure of the terminal device 100;

[0037] Figure 2 is a software structure block diagram of the terminal device 100 of this application;

[0038] Figure 3 is a flowchart of process 300 of the audio generation method provided in an embodiment of this application;

[0039] Figure 4 is a schematic diagram of the human-computer interaction interface according to an embodiment of this application;

[0040] Figure 5 is a schematic diagram of music information according to an embodiment of this application;

[0041] Figure 6 is a schematic diagram of the architecture of the MIR system 600 according to an embodiment of this application;

[0042] Figure 7 is a schematic diagram of the structure of a CNN according to an embodiment of this application;

[0043] Figure 8 is a schematic diagram of the accompaniment score of an embodiment of this application;

[0044] Figure 9 is an architecture diagram of the audio generation system according to an embodiment of this application;

[0045] Figure 10 is a flowchart of the audio generation method according to an embodiment of this application;

[0046] Figure 11 is a visualization of the MIR module according to an embodiment of this application;

[0047] Figure 12 is a flowchart of the generation process of accompaniment MIDI score according to an embodiment of this application;

[0048] Figure 13 shows the mixing process of the automatic mixing according to an embodiment of this application;

[0049] Figure 14 is a flowchart of the audio generation method according to an embodiment of this application;

[0050] Figure 15 is a structural schematic diagram of the audio generation device 1500 of this application;

[0051] Figure 16 shows a schematic block diagram of an apparatus 1600 according to an embodiment of this application. Detailed Implementation

[0052] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0053] The terms "first," "second," etc., used in the specification, embodiments, claims, and drawings of this application are for distinguishing purposes only and should not be construed as indicating or implying relative importance or order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, such as including a series of steps or units. A method, system, product, or apparatus is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or apparatuses.

[0054] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0055] Before describing the technical solutions of the embodiments of this application, the application scenarios of the embodiments of this application will first be described with reference to the accompanying drawings. This application can be widely applied to scenarios where there is a need for humming and generating music, such as music creation scenarios (including recording by professional singer-songwriters or capturing inspiration), karaoke scenarios (including remixing existing songs), and personalized customization scenarios (including recording by ordinary users). The aforementioned scenarios can be implemented as system services on terminal devices with recording functions, such as mobile phones, tablets, and headphones. Optionally, a cloud server can also be provided; or, it can be added as an application programming interface (API) interface to an application (APP) with recording functions, such as a virtual assistant or a humming and generating music APP.

[0056] Figure 1 shows a schematic diagram of the terminal device 100. It should be understood that the terminal device 100 shown in Figure 1 is merely an example, and the terminal device 100 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations. The various components shown in Figure 1 can be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and / or application-specific integrated circuits.

[0057] Terminal device 100 may include: processor 110, external memory interface 120, internal memory 121, universal serial bus (USB) interface 130, charging management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and subscriber identification module (SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, a barometric pressure sensor 180C, a magnetic sensor 180D, an accelerometer sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.

[0058] Processor 110 may include one or more processing units, such as: application processor (AP), modem processor, graphics processing unit (GPU), image signal processor (ISP), controller, memory, video codec, digital signal processor (DSP), baseband processor, and / or neural network processing unit (NPU), etc. Different processing units may be independent devices or integrated into one or more processors.

[0059] The controller can serve as the central nervous system and command center of the terminal device 100. The controller can generate operation control signals based on the instruction opcode and timing signals to control the fetching and execution of instructions.

[0060] The processor 110 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. This memory can store instructions or data that the processor 110 has just used or that are used repeatedly. If the processor 110 needs to use the instruction or data again, it can retrieve it directly from the memory. This avoids repeated accesses, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.

[0061] In some embodiments, the processor 110 may include one or more interfaces. Interfaces may include an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver / transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input / output (GPIO) interface, a subscriber identity module (SIM) interface, and / or a universal serial bus (USB) interface, etc.

[0062] The I2C interface is a bidirectional synchronous serial bus, including a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 110 may include multiple I2C buses. The processor 110 can couple to the touch sensor 180K, charger, flash, camera 193, etc., through different I2C bus interfaces. For example, the processor 110 can couple to the touch sensor 180K through the I2C interface, enabling the processor 110 and the touch sensor 180K to communicate through the I2C bus interface, thereby realizing the touch function of the terminal device 100.

[0063] The I2S interface can be used for audio communication. In some embodiments, the processor 110 may include multiple I2S buses. The processor 110 can be coupled to the audio module 170 via the I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 can transmit audio signals to the wireless communication module 160 via the I2S interface to enable the function of answering phone calls through a Bluetooth headset.

[0064] The PCM interface can also be used for audio communication, sampling, quantizing, and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 can be coupled via the PCM bus interface. In some embodiments, the audio module 170 can also transmit audio signals to the wireless communication module 160 via the PCM interface, enabling the function of answering phone calls through a Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

[0065] The UART interface is a universal serial data bus used for asynchronous communication. This bus can be a bidirectional communication bus. It converts the data to be transmitted between serial and parallel communication. In some embodiments, the UART interface is typically used to connect the processor 110 and the wireless communication module 160. For example, the processor 110 communicates with the Bluetooth module in the wireless communication module 160 via the UART interface to implement Bluetooth functionality. In some embodiments, the audio module 170 can transmit audio signals to the wireless communication module 160 via the UART interface to enable music playback through Bluetooth headphones.

[0066] The MIPI interface can be used to connect the processor 110 to peripheral devices such as the display screen 194 and the camera 193. The MIPI interface includes a camera serial interface (CSI) and a display serial interface (DSI). In some embodiments, the processor 110 and the camera 193 communicate via the CSI interface to enable the shooting function of the terminal device 100. The processor 110 and the display screen 194 communicate via the DSI interface to enable the display function of the terminal device 100.

[0067] The GPIO interface can be configured via software. It can be configured as a control signal or a data signal. In some embodiments, the GPIO interface can be used to connect the processor 110 to a camera 193, a display screen 194, a wireless communication module 160, an audio module 170, a sensor module 180, etc. The GPIO interface can also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, etc.

[0068] USB port 130 is a USB standard compliant interface, specifically a Mini USB port, Micro USB port, USB Type-C port, etc. USB port 130 can be used to connect a charger to charge terminal device 100, and can also be used for data transfer between terminal device 100 and peripheral devices. It can also be used to connect headphones for audio playback. This interface can also be used to connect other terminal devices, such as AR devices.

[0069] It is understood that the interface connection relationships between the modules illustrated in the embodiments of this application are merely illustrative and do not constitute a structural limitation on the terminal device 100. In other embodiments of this application, the terminal device 100 may also adopt different interface connection methods or a combination of multiple interface connection methods as described in the above embodiments.

[0070] The charging management module 140 receives charging input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 receives charging input from the wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 receives wireless charging input via the wireless charging coil of the terminal device 100. While charging the battery 142, the charging management module 140 can also supply power to the terminal device via the power management module 141.

[0071] The power management module 141 connects the battery 142, the charging management module 140, and the processor 110. The power management module 141 receives input from the battery 142 and / or the charging management module 140, providing power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, and wireless communication module 160, etc. The power management module 141 can also monitor parameters such as battery capacity, battery cycle count, and battery health status (leakage current, impedance). In some other embodiments, the power management module 141 may also be located within the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be located in the same device.

[0072] The wireless communication function of the terminal device 100 can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor and baseband processor, etc.

[0073] Antennas 1 and 2 are used to transmit and receive electromagnetic wave signals. Each antenna in terminal device 100 can be used to cover one or more communication frequency bands. Different antennas can also be multiplexed to improve antenna utilization. For example, antenna 1 can be multiplexed as a diversity antenna for a wireless local area network. In some other embodiments, the antennas can be used in conjunction with a tuning switch.

[0074] The mobile communication module 150 can provide solutions for wireless communication, including 2G / 3G / 4G / 5G, applied to the terminal device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc. The mobile communication module 150 can receive electromagnetic waves via antenna 1, and perform filtering, amplification, and other processing on the received electromagnetic waves before transmitting them to a modem processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves for radiation via antenna 1. In some embodiments, at least some functional modules of the mobile communication module 150 may be housed in the processor 110. In some embodiments, at least some functional modules of the mobile communication module 150 and at least some modules of the processor 110 may be housed in the same device.

[0075] The modem processor may include a modulator and a demodulator. The modulator modulates the low-frequency baseband signal to be transmitted into a mid-to-high frequency signal. The demodulator demodulates the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. After processing by the baseband processor, the low-frequency baseband signal is transmitted to the application processor. The application processor outputs sound signals through an audio device (not limited to speaker 170A, receiver 170B, etc.) or displays images or videos through the display screen 194. In some embodiments, the modem processor may be a separate device. In other embodiments, the modem processor may be independent of the processor 110 and may be housed in the same device as the mobile communication module 150 or other functional modules.

[0076] The wireless communication module 160 can provide solutions for wireless communication applications on the terminal device 100, including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared (IR) technologies. The wireless communication module 160 can be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via antenna 2, performs frequency modulation and filtering of the electromagnetic wave signals, and sends the processed signal to processor 110. The wireless communication module 160 can also receive signals to be transmitted from processor 110, perform frequency modulation and amplification, and convert them into electromagnetic waves for radiation via antenna 2.

[0077] In some embodiments, antenna 1 of terminal device 100 is coupled to mobile communication module 150, and antenna 2 is coupled to wireless communication module 160, enabling terminal device 100 to communicate with networks and other devices via wireless communication technology. The wireless communication technology may include Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time-Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC, FM, and / or IR technologies, etc. The GNSS may include the Global Positioning System (GPS), the Global Navigation Satellite System (GLONASS), the BeiDou Navigation Satellite System (BDS), the Quasi-Zenith Satellite System (QZSS), and / or satellite-based augmentation systems (SBAS).

[0078] Terminal device 100 implements display functions through a GPU, display screen 194, and application processor. The GPU is a microprocessor for image processing, connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and for graphics rendering. Processor 110 may include one or more GPUs, which execute program instructions to generate or modify display information.

[0079] The display screen 194 is used to display images, videos, etc. The display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a minimized LED, a microLED, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the terminal device 100 may include one or N display screens 194, where N is a positive integer greater than 1.

[0080] Terminal device 100 can perform shooting functions through ISP, camera 193, video codec, GPU, display 194 and application processor.

[0081] The ISP (Image Signal Processor) is used to process data fed back from the camera 193. For example, when taking a picture, the shutter is opened, and light is transmitted through the lens to the camera's photosensitive element. The light signal is converted into an electrical signal, and the camera's photosensitive element transmits the electrical signal to the ISP for processing, transforming it into an image visible to the naked eye. The ISP can also perform algorithmic optimization of image noise, brightness, and skin tone. The ISP can also optimize parameters such as exposure and color temperature of the shooting scene. In some embodiments, the ISP can be set in the camera 193.

[0082] Camera 193 is used to capture still images or videos. An object is projected onto a photosensitive element by generating an optical image through the lens. The photosensitive element can be a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the light signal into an electrical signal, which is then passed to an ISP for conversion into a digital image signal. The ISP outputs the digital image signal to a DSP for processing. The DSP converts the digital image signal into image signals in standard RGB, YUV, or other formats. In some embodiments, the terminal device 100 may include one or N cameras 193, where N is a positive integer greater than 1.

[0083] A digital signal processor (DSP) is used to process digital signals. Besides digital image signals, it can also process other digital signals. For example, when terminal device 100 selects a frequency, the DSP can perform Fourier transforms on the frequency energy.

[0084] Video codecs are used to compress or decompress digital video. Terminal device 100 may support one or more video codecs. Thus, terminal device 100 can play or record videos in various encoding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG 2, MPEG 3, MPEG 4, etc.

[0085] NPU stands for Neural Network (NN) Computing Processor. By borrowing the structure of biological neural networks, such as the transmission patterns between neurons in the human brain, it can rapidly process input information and continuously learn on its own. NPUs enable intelligent cognitive applications in terminal devices, such as image recognition, facial recognition, speech recognition, and text understanding.

[0086] The external storage interface 120 can be used to connect an external storage card, such as a Micro SD card, to expand the storage capacity of the terminal device 100. The external storage card communicates with the processor 110 through the external storage interface 120 to perform data storage functions. For example, music, video, and other files can be saved on the external storage card.

[0087] Internal memory 121 can be used to store computer executable program code, which includes instructions. Processor 110 executes various functional applications and data processing of terminal device 100 by running the instructions stored in internal memory 121. Internal memory 121 may include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback, image playback, etc.), etc. The data storage area may store data created during the use of terminal device 100 (such as audio data, phonebook, etc.). Furthermore, internal memory 121 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.

[0088] Terminal device 100 can implement audio functions, such as music playback and recording, through audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, and application processor.

[0089] The audio module 170 is used to convert digital audio information into analog audio signals for output, and also to convert analog audio input into digital audio signals. The audio module 170 can also be used for encoding and decoding audio signals. In some embodiments, the audio module 170 may be located in the processor 110, or some functional modules of the audio module 170 may be located in the processor 110.

[0090] The speaker 170A, also known as a "loudspeaker," is used to convert audio electrical signals into sound signals. The terminal device 100 can listen to music or make hands-free calls through the speaker 170A.

[0091] The receiver 170B, also known as the "earpiece," is used to convert audio electrical signals into sound signals. When the terminal device 100 answers a phone call or voice message, the receiver 170B can be brought close to the listener's ear to receive the voice message.

[0092] Microphone 170C, also known as a "microphone" or "voice transducer," is used to convert sound signals into electrical signals. When making a phone call or sending a voice message, the user can speak by bringing their mouth close to microphone 170C, inputting the sound signal into microphone 170C. Terminal device 100 may be equipped with at least one microphone 170C. In some embodiments, terminal device 100 may be equipped with two microphones 170C, which, in addition to collecting sound signals, can also perform noise reduction. In other embodiments, terminal device 100 may be equipped with three, four, or more microphones 170C, which can collect sound signals, reduce noise, identify the sound source, and perform directional recording, etc.

[0093] The 170D headphone jack is used to connect wired headphones. The 170D headphone jack can be a USB 130 interface or a 3.5mm Open Mobile Terminal Platform (OMTP) standard interface, a CTIA (Cellular Telecommunications Industry Association of the USA) standard interface.

[0094] Pressure sensor 180A is used to sense pressure signals and convert them into electrical signals. In some embodiments, pressure sensor 180A can be disposed on display screen 194. There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, and capacitive pressure sensors. A capacitive pressure sensor may include at least two parallel plates with conductive material. When force is applied to pressure sensor 180A, the capacitance between the electrodes changes. Terminal device 100 determines the pressure intensity based on the change in capacitance. When a touch operation is applied to display screen 194, terminal device 100 detects the intensity of the touch operation based on pressure sensor 180A. Terminal device 100 can also calculate the touch position based on the detection signal from pressure sensor 180A. In some embodiments, touch operations applied to the same touch position but with different touch operation intensities can correspond to different operation commands. For example: when a touch operation with an intensity less than a first pressure threshold is applied to the SMS application icon, a command to view an SMS is executed. When a touch operation with an intensity greater than or equal to the first pressure threshold is applied to the SMS application icon, a command to create a new SMS is executed.

[0095] The gyroscope sensor 180B can be used to determine the motion attitude of the terminal device 100. In some embodiments, the gyroscope sensor 180B can determine the angular velocity of the terminal device 100 around three axes (i.e., the x, y, and z axes). The gyroscope sensor 180B can be used for image stabilization. For example, when the shutter is pressed, the gyroscope sensor 180B detects the angle of the terminal device 100's shake, calculates the distance that the lens module needs to compensate based on the angle, and allows the lens to counteract the shake of the terminal device 100 through reverse movement, thus achieving image stabilization. The gyroscope sensor 180B can also be used in navigation and motion-sensing game scenarios.

[0096] The barometric pressure sensor 180C is used to measure air pressure. In some embodiments, the terminal device 100 calculates altitude using the air pressure value measured by the barometric pressure sensor 180C to assist in positioning and navigation.

[0097] The magnetic sensor 180D includes a Hall sensor. The terminal device 100 can use the magnetic sensor 180D to detect the opening and closing of the flip cover. In some embodiments, when the terminal device 100 is a flip phone, the terminal device 100 can detect the opening and closing of the flip cover using the magnetic sensor 180D. Then, based on the detected opening and closing state of the cover or the flip cover, features such as automatic flip unlocking can be set.

[0098] The 180E accelerometer can detect the magnitude of acceleration of the terminal device 100 in various directions (typically three axes). When the terminal device 100 is stationary, it can detect the magnitude and direction of gravity. It can also be used to identify the attitude of the terminal device, and can be applied to applications such as landscape / portrait switching and pedometers.

[0099] A distance sensor 180F is used to measure distance. The terminal device 100 can measure distance via infrared or laser. In some embodiments, during a shooting scene, the terminal device 100 can utilize the distance sensor 180F to measure distance for rapid focusing.

[0100] The proximity sensor 180G may include, for example, a light-emitting diode (LED) and a light detector, such as a photodiode. The LED may be an infrared LED. The terminal device 100 emits infrared light outward through the LED. The terminal device 100 uses the photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device 100. When insufficient reflected light is detected, the terminal device 100 can determine that there is no object near the terminal device 100. The terminal device 100 may use the proximity sensor 180G to detect when a user holds the terminal device 100 close to their ear for a call, so as to automatically turn off the screen to save power. The proximity sensor 180G can also be used in holster mode and pocket mode for automatic unlocking and screen locking.

[0101] The ambient light sensor 180L is used to sense the ambient light intensity. The terminal device 100 can adaptively adjust the brightness of the display screen 194 based on the sensed ambient light intensity. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also work with the proximity sensor 180G to detect whether the terminal device 100 is in a pocket to prevent accidental touches.

[0102] The fingerprint sensor 180H is used to collect fingerprints. The terminal device 100 can use the collected fingerprint characteristics to achieve fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering calls with fingerprints, etc.

[0103] Temperature sensor 180J is used to detect temperature. In some embodiments, terminal device 100 uses the temperature detected by temperature sensor 180J to execute a temperature handling strategy. For example, when the temperature reported by temperature sensor 180J exceeds a threshold, terminal device 100 reduces the performance of the processor located near temperature sensor 180J to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is below another threshold, terminal device 100 heats battery 142 to prevent abnormal shutdown of terminal device 100 due to low temperature. In still other embodiments, when the temperature is below yet another threshold, terminal device 100 boosts the output voltage of battery 142 to prevent abnormal shutdown due to low temperature.

[0104] Touch sensor 180K, also known as a "touch panel," can be located on display screen 194. The touch sensor 180K and display screen 194 together form a touchscreen, also known as a "touch screen." Touch sensor 180K detects touch operations applied to or near it. The touch sensor can transmit the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through display screen 194. In other embodiments, touch sensor 180K may also be located on the surface of terminal device 100, in a different position than display screen 194.

[0105] The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire vibration signals from the vibrating bone segments of the human vocal cords. The bone conduction sensor 180M can also contact the human pulse to receive blood pressure signals. In some embodiments, the bone conduction sensor 180M can also be incorporated into headphones to form bone conduction headphones. The audio module 170 can parse the voice signals from the vibrating bone segments of the vocal cords acquired by the bone conduction sensor 180M to realize voice functionality. The application processor can parse heart rate information from the blood pressure signals acquired by the bone conduction sensor 180M to realize heart rate detection functionality.

[0106] Buttons 190 include a power button, volume buttons, etc. Buttons 190 can be mechanical buttons or touch-sensitive buttons. Terminal device 100 can receive button input and generate key signal inputs related to user settings and function control of terminal device 100.

[0107] Motor 191 can generate vibration alerts. Motor 191 can be used for incoming call vibration alerts or for touch vibration feedback. For example, different vibration feedback effects can correspond to touch operations performed on different applications (such as taking photos, playing audio, etc.). Motor 191 can also correspond to different vibration feedback effects for touch operations performed on different areas of the display screen 194. Different application scenarios (such as time reminders, receiving messages, alarm clocks, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also be customized.

[0108] Indicator 192 can be an indicator light, used to indicate charging status, power changes, or to indicate messages, missed calls, notifications, etc.

[0109] The SIM card interface 195 is used to connect a SIM card. The SIM card can be inserted into or removed from the SIM card interface 195 to make contact with and separate from the terminal device 100. The terminal device 100 can support one or N SIM card interfaces, where N is a positive integer greater than 1. The SIM card interface 195 can support Nano SIM cards, Micro SIM cards, SIM cards, etc. Multiple cards can be inserted into the same SIM card interface 195 simultaneously. The multiple cards can be of the same or different types. The SIM card interface 195 is also compatible with different types of SIM cards. The SIM card interface 195 is also compatible with external memory cards. The terminal device 100 interacts with the network through the SIM card to realize functions such as calls and data communication. In some embodiments, the terminal device 100 uses an eSIM, i.e., an embedded SIM card. The eSIM card can be embedded in the terminal device 100 and cannot be separated from the terminal device 100.

[0110] The software system of terminal device 100 can adopt a layered architecture, event-driven architecture, microkernel architecture, microservice architecture, or cloud architecture. This application embodiment uses the layered architecture Android system as an example to exemplify the software structure of terminal device 100.

[0111] Figure 2 is a software structure block diagram of the terminal device 100 of this application. As shown in Figure 2, the layered architecture of the terminal device 100 divides the software into several layers, each with a clear role and division of labor. The layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, from top to bottom: the application layer, the application framework layer, the Android runtime and system libraries, and the kernel layer.

[0112] The application layer can include a series of application packages.

[0113] As shown in Figure 2, the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, SMS, and humming.

[0114] The application framework layer provides application programming interfaces (APIs) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

[0115] As shown in Figure 2, the application framework layer may include a window manager, content provider, view system, phone manager, resource manager, notification manager, etc.

[0116] The window manager is used to manage windowed applications. It can retrieve screen size, determine the presence of a status bar, lock the screen, and capture screenshots, among other things.

[0117] Content providers store and retrieve data, making that data accessible to applications. This data may include videos, images, audio, made and received phone calls, browsing history and bookmarks, phone books, etc.

[0118] A view system includes visual controls, such as controls for displaying text and controls for displaying images. View systems can be used to build applications. A display interface can consist of one or more views. For example, a display interface including a text notification icon could include views for displaying text and views for displaying images.

[0119] The phone manager is used to provide communication functions for terminal device 100. For example, it manages call status (including connection, hang-up, etc.).

[0120] The file explorer provides applications with various resources, such as localized strings, icons, images, layout files, video files, and more.

[0121] The notification manager allows applications to display notifications in the status bar. These notifications can be used to deliver informational messages and can disappear automatically after a short pause, requiring no user interaction. For example, the notification manager can be used to notify users of download completion or message alerts. The notification manager can also display notifications as icons or scrolling text in the top status bar, such as notifications from background applications, or as dialog boxes on the screen. Examples include displaying text messages in the status bar, emitting sounds, vibrating the device, and flashing indicator lights.

[0122] The Android Runtime consists of core libraries and a virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.

[0123] The core library consists of two parts: one part is the functionalities that need to be called by the Java language, and the other part is the Android core library.

[0124] The application layer and application framework layer run in a virtual machine. The virtual machine executes the Java files of the application layer and application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, security and exception management, and garbage collection.

[0125] System libraries can include multiple functional modules. For example: surface manager, media libraries, 3D graphics processing libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), etc.

[0126] The Surface Manager is used to manage the display subsystem and provides the blending of 2D and 3D layers for multiple applications.

[0127] The media library supports playback and recording of various common audio and video formats, as well as still image files. It supports multiple audio and video encoding formats, such as MPEG4, H.264, MP3, AAC, AMR, JPG, and PNG.

[0128] The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.

[0129] A 2D graphics engine is a graphics engine for 2D drawing.

[0130] The kernel layer is the layer between hardware and software. The kernel layer contains at least the display driver, camera driver, audio driver, and sensor driver.

[0131] It is understood that the components included in the system framework layer, system library, and runtime layer shown in Figure 2 do not constitute a specific limitation on the terminal device 100. In other embodiments of this application, the terminal device 100 may include more or fewer components than shown in the figure, or combine some components, or split some components, or have different component arrangements.

[0132] Since this application involves the application of artificial intelligence (AI), some related terms or concepts are explained below for ease of understanding.

[0133] (1) Neural Network

[0134] Neural Networks (NNs) are machine learning models. A neural network can be composed of neural units, which are computational units that take xs and an intercept of 1 as input. The output of such a computational unit can be:

[0135] Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be the sigmoid function. A neural network is a network formed by connecting many of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

[0136] (2) Deep Neural Networks

[0137] Deep neural networks (DNNs), also known as multilayer neural networks, can be understood as neural networks with many hidden layers, though there's no specific metric for "many." DNNs can be categorized into three layers based on their position: input layers, hidden layers, and output layers. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. All layers are fully connected, meaning that any neuron in the i-th layer is connected to any neuron in the (i+1)-th layer. Although DNNs appear complex, the operation of each layer is actually quite simple, resembling a linear relationship as follows: in, It is the input vector. It is the output vector. α is the offset vector, W is the weight matrix (also called coefficients), and α() is the activation function. Each layer is simply an adjustment of the input vector. The output vector is obtained through such a simple operation. Because DNNs have many layers, the coefficients W and the offset vector... The number of these parameters is therefore quite large. The definitions of these parameters in a DNN are as follows: Taking the coefficient W as an example: Assuming a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as... The superscript 3 represents the layer number where coefficient W resides, while the subscript corresponds to the output third layer index 2 and the input second layer index 4. In summary, the coefficients from the k-th neuron in layer L-1 to the j-th neuron in layer L are defined as follows: It's important to note that the input layer does not have a W parameter. In deep neural networks, more hidden layers allow the network to better represent complex real-world situations. Theoretically, the more parameters a model has, the higher its complexity and "capacity," meaning it can perform more complex learning tasks. Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrix of all layers in the trained deep neural network (a weight matrix formed by the vectors W from many layers).

[0138] (3) Convolutional Neural Network

[0139] A convolutional neural network (CNN) is a deep neural network with convolutional structures. It is a deep learning architecture, which refers to learning at multiple levels of abstraction using machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network, where each neuron responds to an input image. A CNN contains a feature extractor consisting of convolutional layers and pooling layers. This feature extractor can be viewed as a filter, and the convolution process can be seen as performing convolution with a trainable filter and an input image or a convolutional feature map.

[0140] A convolutional layer is a layer of neurons in a convolutional neural network that performs convolution processing on the input signal. A convolutional layer can contain multiple convolution operators, also called kernels. In image processing, these operators act as filters, extracting specific information from the input image matrix. Essentially, a convolution operator can be a weight matrix, which is usually predefined. During the convolution operation, the weight matrix typically processes the input image pixel by pixel (or two pixels by two pixels, depending on the stride) along the horizontal direction, thus extracting specific features from the image. The size of the weight matrix should be related to the image size. It's important to note that the depth dimension of the weight matrix is ​​the same as the depth dimension of the input image; during convolution, the weight matrix extends to the entire depth of the input image. Therefore, convolution with a single weight matrix produces a single-depth convolutional output. However, in most cases, multiple weight matrices of the same size (rows × columns) are used instead of a single weight matrix. The outputs of each weight matrix are stacked to form the depth dimension of the convolutional image. This dimension can be understood as being determined by the "multiple" factors mentioned above. Different weight matrices can be used to extract different features from the image. For example, one weight matrix can be used to extract edge information, another to extract specific colors, and yet another to blur unwanted noise. These multiple weight matrices have the same size (rows × columns), and the feature maps extracted by these weight matrices also have the same size. These extracted feature maps are then merged to form the output of the convolution operation. The weight values ​​in these weight matrices need to be obtained through extensive training in practical applications. The weight matrices formed by these trained weight values ​​can be used to extract information from the input image, enabling the convolutional neural network to make correct predictions. When a convolutional neural network has multiple convolutional layers, the initial convolutional layers often extract more general features, which can also be called low-level features. As the depth of the convolutional neural network increases, the features extracted by later convolutional layers become increasingly complex, such as high-level semantic features. Features with higher semantic levels are more suitable for the problem being solved.

[0141] Because it's often necessary to reduce the number of training parameters, pooling layers are frequently introduced periodically after convolutional layers. This can be a single convolutional layer followed by a pooling layer, or multiple convolutional layers followed by one or more pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. Pooling layers can include average pooling and / or max pooling operators to sample the input image to obtain a smaller image size. Average pooling calculates the average value of pixel values ​​within a specific range as the result of average pooling. Max pooling takes the pixel with the largest value within a specific range as the result of max pooling. Furthermore, just as the size of the weight matrix in a convolutional layer should be related to the image size, the operators in a pooling layer should also be related to the image size. The size of the output image after pooling can be smaller than the size of the input image of the pooling layer. Each pixel in the output image represents the average or maximum value of the corresponding sub-region of the input image of the pooling layer.

[0142] After processing by convolutional / pooling layers, a convolutional neural network (CNN) is still insufficient to output the required information. As mentioned earlier, convolutional / pooling layers only extract features and reduce the parameters introduced by the input image. However, to generate the final output information (the required class information or other relevant information), the CNN needs to utilize neural network layers to generate one or a set of desired classes in the output. Therefore, the neural network can include multiple hidden layers, the parameters of which can be pre-trained based on relevant data for a specific task type, such as image recognition, image classification, image super-resolution reconstruction, etc.

[0143] Optionally, after the multiple hidden layers in the neural network, there is also an output layer of the entire convolutional neural network. This output layer has a loss function similar to the classification cross-entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network is completed, the backpropagation will begin to update the weight values ​​and biases of the aforementioned layers to reduce the loss of the convolutional neural network and the error between the result output by the convolutional neural network through the output layer and the ideal result.

[0144] (4) Recurrent Neural Network

[0145] Recurrent neural networks (RNNs) are used to process sequential data. In traditional neural network models, the layers from the input layer to the hidden layer and then to the output layer are fully connected, but the nodes within each layer are unconnected. While this type of neural network has solved many difficult problems, it remains inadequate for many others. For example, predicting the next word in a sentence generally requires using the preceding words because words in a sentence are not independent. RNNs are called recurrent neural networks because the current output of a sequence is related to the outputs of previous sequences. Specifically, the network memorizes previous information and applies it to the calculation of the current output; that is, nodes within the same hidden layer are no longer unconnected but connected, and the input to a hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous time step. Theoretically, RNNs can process sequential data of any length. Training an RNN is similar to training a traditional CNN or DNN. This algorithm also uses the backpropagation algorithm, but with one key difference: when an RNN is expanded, its parameters, such as W, are shared; however, this is not the case with traditional neural networks as illustrated above. Furthermore, in gradient descent, the output at each step depends not only on the network at the current step but also on the states of the network in previous steps. This learning algorithm is called Backpropagation Through Time (BPTT).

[0146] Since we already have convolutional neural networks (CNNs), why do we need recurrent neural networks (RNNs)? The reason is simple. CNNs rely on the fundamental assumption that elements are independent of each other, and that input and output are also independent—like a cat and a dog. However, in the real world, many elements are interconnected. For example, stock prices fluctuate over time. Or, imagine someone saying, "I love traveling, and my favorite place is Yunnan. I definitely want to go there someday." Humans know the answer to this question is "Yunnan." Humans can infer from context, but how can machines do the same? This is where RNNs come in. RNNs aim to give machines the ability to remember, just like humans. Therefore, the output of an RNN depends on both the current input information and historical memory information.

[0147] (5) Loss Function

[0148] In training a deep neural network, to ensure the output closely approximates the desired predicted value, we compare the network's prediction with the target value. Based on the difference, we update the weight vector of each layer (usually pre-configuring parameters before the initial update). For example, if the prediction is too high, the weight vector is adjusted to predict a lower value. This adjustment continues until the deep neural network predicts the target value or a value very close to it. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, and training the deep neural network becomes a process of minimizing this loss.

[0149] (6) Backpropagation algorithm

[0150] Convolutional neural networks can employ backpropagation (BP) to correct the parameters in the initial super-resolution model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates an error loss; this error loss information is then propagated back to update the parameters in the initial super-resolution model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the super-resolution model, such as the weight matrix.

[0151] (7) Generative Adversarial Networks

[0152] Generative adversarial networks (GANs) are a type of deep learning model. This model comprises at least two modules: a generative model and a discriminative model. These two modules learn from each other through a game-like interaction, resulting in better outputs. Both the generative and discriminative models can be neural networks, specifically deep neural networks or convolutional neural networks. The basic principle of GANs is as follows: Taking an image-generating GAN as an example, suppose there are two networks, G (Generator) and D (Discriminator). G is a network that generates images by receiving random noise z and using this noise, denoted as G(z). D is a discriminative network used to determine whether an image is "real." Its input parameter is x, representing an image, and its output D(x) represents the probability that x is a real image. A value of 1 indicates that the image is 100% real, while a value of 0 indicates that the image is impossible to be real. During the training of this generative adversarial network (GAN), the goal of the generative network G is to generate realistic images to deceive the discriminator network D, while the goal of the discriminator network D is to distinguish the images generated by G from real images as much as possible. Thus, G and D constitute a dynamic "game," which is the "adversarial" aspect of the GAN. Ideally, the game will result in G generating images G(z) that are sufficiently realistic, while D struggles to determine whether the images generated by G are real or not, i.e., D(G(z)) = 0.5. This yields a superior generative model G that can be used to generate images.

[0153] Based on this, this application provides an audio generation method to improve the adaptability of humming compositions.

[0154] Figure 3 is a flowchart of process 300 of the audio generation method provided in an embodiment of this application. Process 300 can be executed by the terminal device 100 described above, especially the system service provided by the terminal device 100 or the humming and song generation APP installed thereon. Process 300 is described as a series of steps or operations. It should be understood that process 300 can be executed in various orders and / or occur simultaneously, and is not limited to the execution order shown in Figure 3.

[0155] Process 300 may include:

[0156] Step 301: Obtain the humming audio.

[0157] The humming audio can be input by the user through a human-computer interaction interface.

[0158] For example, Figure 4 is a schematic diagram of the human-computer interaction interface of an embodiment of this application. As shown in Figure 4, the main interface of the Humming Song APP includes two pages: Start Creating and All Works. The Start Creating page includes controls for style selection (e.g., pure piano, pure guitar, band, etc.) and recording. For the controls, users can first click on the desired style of controls, and then click on record. Control the key, then start humming to record the humming audio.

[0159] It should be noted that Figure 4 is only an example showing one possible implementation of the human-computer interaction interface, but this does not constitute a limitation on the human-computer interaction interface. The human-computer interaction interface can be implemented in any way and can contain more or less content. This application embodiment does not make any specific limitation in this regard.

[0160] Step 302: Perform MIR on the humming audio to obtain the musical information of the humming audio.

[0161] Musical information can include meter, pitch, and tone. Optionally, it can also include phrase information. Meter (also known as time) refers to the regularly repeated accents; pitch refers to the different levels of sound; tone refers to whether a sound can be perceived as a middle note; and phrase refers to the smallest structure in a musical work that expresses a complete or relatively complete musical idea—the smallest structure that constitutes an independent section.

[0162] In this embodiment, the music information can be obtained by performing Music Information Retrieval (MIR) on the humming audio. For example, as shown in Figure 5 (Figure 5 is a schematic diagram of the music information in this embodiment), after MIR, the beat information, pitch information, tone information, and musical phrase information are obtained from the music information. In this embodiment, Figure 5 can be displayed on a human-computer interaction interface, allowing users to intuitively see the beat, pitch, tone, and musical phrase of the humming audio.

[0163] In one possible implementation, MIR can be implemented as an AI model, that is, the humming audio is input into the AI ​​model and the musical information of the humming audio is output.

[0164] In one possible implementation, MIR can be achieved by using multiple AI models in combination. Specifically, the humming audio is input into the first AI model, which outputs beat information; the beat information and the humming audio are input into the second AI model, which outputs pitch information; the pitch information of the humming audio is input into the third AI model, which outputs tone information; and the humming audio is input into the fourth AI model, which outputs musical phrase information.

[0165] In one possible implementation, MIR can be achieved by matching the original score with multiple AI models. Specifically, Automatic Speech Recognition (ASR) is performed on the humming audio to obtain the lyrics; the lyrics of the humming audio are matched with the original score to obtain the beat information, where the original score is the score of the original piece of music corresponding to the humming audio; the beat information and the humming audio are input into a second AI model, which outputs pitch information; and the pitch information of the humming audio is input into a third AI model, which outputs tone information.

[0166] Whether using one or multiple AI models, these models can be pre-trained to achieve the aforementioned functions required for MIR. For example, Figure 6 is a schematic diagram of the architecture of the MIR system 600 according to an embodiment of this application. As shown in Figure 6, the data acquisition device 660 is used to collect data (e.g., multiple sets of humming audio) and store it in the database 630. The training device 620 generates a target model / rule 601 based on the data maintained in the database 630. The following will describe in more detail how the training device 620 obtains the target model / rule 601 based on the data. The target model / rule 601 can achieve the functions of one or more AI models and output corresponding information.

[0167] The function of each layer in a deep neural network can be expressed mathematically. To describe it: From a physical perspective, the work of each layer in a deep neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space. These five operations include: 1. Dimensionality increase / decrease; 2. Magnification / scaling; 3. Rotation; 4. Translation; 5. "Bending". Operations 1, 2, and 3 are... The operation 4 is completed using +b, and the operation 5 is implemented using a(). The term "space" is used here because the objects being classified are not individual things, but a class of things; space refers to the set of all individuals within this class of things. Here, W is the weight vector, where each value represents the weight of a neuron in that layer of the neural network. This vector W determines the spatial transformation from the input space to the output space, as described above; that is, the weights W of each layer control how the space is transformed. The purpose of training a deep neural network is to ultimately obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W from many layers). Therefore, the training process of a neural network is essentially learning how to control spatial transformation, more specifically, learning the weight matrix.

[0168] Because the goal is for the output of a deep neural network to be as close as possible to the actual predicted value, we can compare the current network's prediction with the desired target value and update the weight vector of each layer based on the difference. (Of course, there's usually an initialization process before the first update, where parameters are pre-configured for each layer in the deep neural network). For example, if the network's prediction is too high, the weight vector is adjusted to predict a lower value, and this adjustment continues until the neural network can predict the actual target value. Therefore, it's necessary to predefine "how to compare the difference between the predicted value and the target value," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, so training a deep neural network becomes a process of minimizing this loss as much as possible.

[0169] The target model / rule 601 obtained from training device 620 can be applied to different systems or devices.

[0170] The execution device 610 is equipped with an I / O interface 612 for data interaction with external devices. The "user" can input data to the I / O interface 612 through the terminal device 640.

[0171] The execution device 610 can call data, code, etc. in the data storage system 650, and can also store data, instructions, etc. in the data storage system 650.

[0172] The calculation module 611 processes the input data using the target model / rule 601 to implement the MIR of this application embodiment.

[0173] The associated function module 613 and associated function module 614 can respectively implement related functions in the training process, such as preprocessing and filtering.

[0174] Finally, the I / O interface 612 returns the processing result to the terminal device 640 for the user.

[0175] At a deeper level, the training device 620 can generate corresponding target models / rules 601 based on different data for different objectives, in order to provide users with better results.

[0176] In the scenario shown in Figure 6, the user can manually specify the data to be input into the execution device 610, for example, by operating through the interface provided by the I / O interface 612. Alternatively, the terminal device 640 can automatically input data into the I / O interface 612 and obtain the results. If the terminal device 640 requires user authorization to automatically input data, the user can set the corresponding permissions in the terminal device 640. The user can view the results output by the execution device 610 on the terminal device 640, which can be presented in various ways such as display, sound, or animation. The terminal device 640 can also act as a data acquisition terminal to store the acquired data into the database 630.

[0177] It is worth noting that Figure 6 is merely a schematic diagram of an MIR system provided by an embodiment of the present invention. The positional relationships between the devices, components, modules, etc. shown in the figure do not constitute any limitation. For example, in Figure 6, the data storage system 650 is an external memory relative to the execution device 610. In other cases, the data storage system 650 can also be placed in the execution device 610. As another example, in Figure 6, the terminal device 640 and the execution device 610 are two devices. In other cases, the terminal device 640 and the execution device 610 can also be integrated into one device.

[0178] A convolutional neural network (CNN) is a deep neural network with convolutional structures. It is a deep learning architecture, which refers to learning at multiple levels of abstraction using machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network where each neuron responds to overlapping regions in the input image.

[0179] For example, FIG7 is a schematic diagram of the structure of a CNN according to an embodiment of the present application. As shown in FIG7, the CNN 700 may include an input layer 710, a convolutional layer / pooling layer 720, wherein the pooling layer is optional, and a neural network layer 730.

[0180] Convolutional / pooling layers 720:

[0181] Convolutional layers:

[0182] As shown in Figure 7, the convolutional / pooling layer 720 may include layers 721-726 as in the examples. In one implementation, layer 721 is a convolutional layer, layer 722 is a pooling layer, layer 723 is a convolutional layer, layer 724 is a pooling layer, layer 725 is a convolutional layer, and layer 726 is a pooling layer. In another implementation, layers 721 and 722 are convolutional layers, layer 723 is a pooling layer, layers 724 and 725 are convolutional layers, and layer 726 is a pooling layer. That is, the output of the convolutional layer can be used as the input of a subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.

[0183] Taking convolutional layer 721 as an example, it can include multiple convolution operators, also known as kernels. In image processing, a convolution operator acts as a filter, extracting specific information from the input image matrix. Essentially, a convolution operator can be a weight matrix, which is usually predefined. During the convolution operation, the weight matrix processes the input image pixel by pixel (or two pixels by two pixels, depending on the stride) along the horizontal direction, thus extracting specific features. The size of the weight matrix should be related to the image size. It's important to note that the depth dimension of the weight matrix is ​​the same as the depth dimension of the input image; during convolution, the weight matrix extends to the entire depth of the input image. Therefore, convolution with a single weight matrix produces a single-depth convolutional output. However, in most cases, multiple weight matrices of the same dimension are applied instead of a single weight matrix. The outputs of each weight matrix are stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features from an image. For example, one weight matrix can be used to extract image edge information, another weight matrix can be used to extract specific colors from the image, and yet another weight matrix can be used to blur unwanted noise in the image. These multiple weight matrices have the same dimension, and the feature maps extracted by these multiple weight matrices with the same dimension also have the same dimension. The extracted feature maps with the same dimension are then merged to form the output of the convolution operation.

[0184] The weight values ​​in these weight matrices need to be obtained through extensive training in practical applications. The weight matrices formed by the weight values ​​obtained through training can extract information from the input image, thereby helping CNN 700 to make correct predictions.

[0185] When a CNN 700 has multiple convolutional layers, the initial convolutional layers (e.g., 721) tend to extract more general features, which can also be called low-level features. As the depth of the CNN 700 increases, the features extracted by later convolutional layers (e.g., 726) become more and more complex, such as high-level semantic features. The higher the semantic level of the features, the more suitable they are for the problem to be solved.

[0186] Pooling layer:

[0187] Because it's often necessary to reduce the number of training parameters, pooling layers are frequently introduced periodically after convolutional layers, as illustrated in layers 721-726 of Figure 7 (example 720). This can be a convolutional layer followed by a pooling layer, or multiple convolutional layers followed by one or more pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. Pooling layers can include average pooling and / or max pooling operators to sample the input image to obtain a smaller image size. Average pooling calculates the average value of pixel values ​​within a specific range. Max pooling takes the pixel with the largest value within a specific range as the result of max pooling. Furthermore, just as the size of the weight matrix in a convolutional layer should be related to the image size, the operators in a pooling layer should also be related to the image size. The size of the output image after pooling can be smaller than the size of the input image of the pooling layer. Each pixel in the output image represents the average or maximum value of the corresponding sub-region of the input image of the pooling layer.

[0188] Neural network layer 730:

[0189] After processing by the convolutional / pooling layers 720, the CNN 700 is still insufficient to output the required information. As mentioned earlier, the convolutional / pooling layers 720 only extract features and reduce the parameters introduced by the input image. However, to generate the final output information (the required class information or other relevant information), the CNN 700 needs to utilize neural network layers 730 to generate one or a set of required class numbers of output. Therefore, neural network layer 730 can include multiple hidden layers (731, 732 to 73n as shown in Figure 7) and an output layer 740. The parameters contained in these hidden layers can be pre-trained based on relevant data for a specific task type, such as image recognition, image classification, or image super-resolution reconstruction.

[0190] After the multiple hidden layers in neural network layer 730, which is the last layer of the entire CNN 700, is the output layer 740. The output layer 740 has a loss function similar to the classification cross-entropy, which is used to calculate the prediction error. Once the forward propagation of the entire CNN 700 is completed (as shown in Figure 7, the propagation from 710 to 740 is the forward propagation), the back propagation (as shown in Figure 7, the propagation from 740 to 710 is the back propagation) will begin to update the weight values ​​and biases of the aforementioned layers, in order to reduce the loss of CNN 700 and the error between the result output by CNN 700 through the output layer and the ideal result.

[0191] It should be noted that the CNN 700 shown in Figure 7 is only an example of a convolutional neural network. In specific applications, convolutional neural networks can also exist in the form of other network models. For example, as shown in Figure 7, multiple convolutional / pooling layers can be parallelized, and the extracted features can be input into the full neural network layer 730 for processing.

[0192] Step 303: Generate the target audio based on the musical information of the humming audio.

[0193] In one possible implementation, the target audio includes at least the accompaniment audio corresponding to the humming audio.

[0194] An accompaniment score can be generated based on the musical information of the humming audio. This accompaniment score includes Musical Instrument Digital Interface (MIDI) scores corresponding to multiple orchestrations. In this step, chord progressions can be generated based on the pitch and tone information in the musical information. The smallest unit is two beats. The main logic is to calculate the chord that matches the most pitches and functions best (e.g., cadences and other common rules in popular music) in each chord unit. Optionally, to facilitate the representation of different chord functions and reusability, all chord progressions can be represented using Roman numerals. Then, the accompaniment score is generated based on the musical information and chord progressions of the humming audio; that is, the style input by the user through the human-computer interaction interface (see above description); multiple orchestrations are determined based on the style and chord progressions; and then the accompaniment score is generated based on the multiple orchestrations and the musical information of the humming audio.

[0195] 1. Piano / Guitar / Bass: The main function of these parts is to provide chord support for the song and serve as the main accompaniment instruments. Therefore, the generation method is usually the same as chord progressions, following a two-beat unit. The system then refers to the chord tone and rhythm pattern library to generate the corresponding accompaniment pattern for each unit.

[0196] 2. Drums are generated in two- or four-beat units. However, it relies more on the timing of measures within the structure; for example, a fill-in pattern may occur at the end of a section. Furthermore, due to the numbering of drum instruments in the MIDI format, each sound source needs to be specifically mapped to its corresponding drum instrument.

[0197] 3. Strings (Cello and other string instruments): Strings follow a non-linear generation method, which is more complex than the rhythmic logic of other instruments. Strings provide more contrapuntal melodies and layered roles, generating contrapuntal melodies based on major scale patterns while referencing chord progressions to ensure that the generated variable-length notes fit the chord progression within the time frame.

[0198] 4. Finally, it also includes other miscellaneous generation logic, such as motivational ideas, doubled melodies, and fill patterns for specific number of measures. It also includes the configuration of the piano's instrumental velocity, making the overall arrangement more like a live performance.

[0199] For example, as shown in Figure 8 (Figure 8 is a schematic diagram of the accompaniment score in an embodiment of this application), the accompaniment score generated based on the aforementioned music information can include various music styles, such as pure guitar or pure piano styles, or multi-instrument band styles, including piano, electric guitar, bass, cello, violin, drums, synthesizers, etc. In this embodiment of the application, Figure 8 can be displayed on a human-computer interaction interface, allowing users to intuitively see the accompaniment score generated for their hummed audio, and then use the score for more professional arrangement and production.

[0200] The target audio can be obtained by mixing the aforementioned accompaniment score. Specifically, virtual instrument rendering and single-track sound effect processing are performed on the corresponding MIDI scores for multiple orchestrations to obtain multiple single-track audio tracks. These single-track audio tracks are then mixed in multi-track stereo to obtain the target audio. In this way, the target audio can incorporate the timbres of various orchestrations, playing a melody in a chordal manner, resulting in an accompaniment track that is highly compatible with the user's humming audio.

[0201] In one possible implementation, the target audio includes the accompaniment audio corresponding to the humming audio and the vocal audio within the humming audio.

[0202] The system can preprocess the humming audio to obtain a single-track vocal audio; generate an accompaniment score based on the musical information of the humming audio, which includes multi-track MIDI scores corresponding to multiple instruments; render virtual instruments and perform single-track sound effects processing on the corresponding instruments based on the multi-track MIDI scores to obtain multiple single-track accompaniment audio; obtain multiple single-track audio from the single-track vocal audio and the multiple single-track accompaniment audio; and perform multi-track stereo mixing on the multiple single-track audio to obtain the target audio. Compared to the target audio generated in the previous method, this method adds the user's voice to the target audio, meaning the target audio includes both vocals and accompaniment, resulting in a song with a high degree of integration between vocals and accompaniment.

[0203] This application embodiment obtains musical information such as rhythm, pitch, and tone from the humming audio via MIR, thereby calculating the matching information between the accompaniment audio and the user's humming, such as vocal timing, the role and emotion of the accompaniment in different musical sections, chord progressions, etc., and then obtains the corresponding target audio based on the musical information. It is evident that even ordinary users without professional singing skills can obtain complete and reasonable accompaniment or songs through this application embodiment, achieving the singing and composition effects that only professional music production teams can provide; professional singer-songwriters can also use this application embodiment to quickly transform their inspiration into complete music, providing valuable reference; this application embodiment can also provide new styles of accompaniment generation for existing songs, serving as secondary creation or avoiding potential copyright infringement issues related to original soundtracks.

[0204] The technical solutions of the method embodiments shown in Figure 3 will be described in detail below using several specific examples.

[0205] Figure 9 is an architecture diagram of the audio generation system according to an embodiment of this application. As shown in Figure 9, the system includes four functional modules: a user interface (UI) module, a MIR module, an automatic accompaniment module, and an automatic mixing module.

[0206] The purpose of the UI module is to provide users with simple and clear guidance to select song styles and record. The UI module offers multiple recording modes, allowing novice, intermediate, and professional users to record songs at specific beats per minute (BPM) according to their individual needs. Based on the user-friendly interface provided by the UI module, the user's humming audio can be obtained.

[0207] The MIR module includes vocal transcription, beat recognition, tonality recognition, and musical phrase recognition. It can use AI technology to extract musical information from humming audio, such as beat, pitch, tonality, and musical phrase, and convert the musical information into digital JSON information for use by other modules.

[0208] The automatic accompaniment module includes chord arrangement and MIDI generation. Based on the music information extracted by the MIR module, it can generate MIDI scores for accompaniment in a symbolic field. It can also configure different numbers of tracks, different types of instruments (instrumentation), and chords according to the song's length and style to create different effects. Finally, it aligns the humming audio to ensure a perfect match with the accompaniment audio without any disharmony. Specifically, it first selects the chord with the highest matching degree based on the pitch of the humming audio, and then generates multi-track MIDI scores with different instrumentation for each musical segment based on the user-specified style and the identified musical structure. Optionally, the melody of the chorus can be extracted and used as the intro and outro of the song to create a sense of progression and structure.

[0209] The automatic mixing module includes sound source rendering, vocal mixing, stereo accompaniment mixing, and overall mastering. It can render high-quality target audio as the final product based on the accompaniment MIDI score generated by the automatic accompaniment module. During the process, appropriate sound source files are selected based on the song style and the instruments used in the track to generate corresponding dry instrument audio, which is then mixed with the humming audio to ensure complete blending of instrument sounds and vocals, achieving a commercially viable song quality. Optionally, in addition to the complete song, the system can also generate pure accompaniment audio for various scenarios (such as karaoke use).

[0210] Figure 10 is a flowchart of the audio generation method according to an embodiment of this application. As shown in Figure 10, the following functions are implemented by various modules in the system shown in Figure 9:

[0211] UI module

[0212] Recording is done through the user interface provided by the UI module. Users can choose between a professional mode with a metronome or a normal mode without a metronome, depending on their musical skill level. After recording, users can select different music styles, including band, pure piano, and pure guitar. The user's humming audio and the aforementioned information can be packaged and sent to subsequent modules in the form of an HTTP request.

[0213] MIR module

[0214] The humming audio is subjected to MIR (Musical Information Retrieval) to obtain musical information such as beat, musical phrase, pitch transcription, and tone, and then post-processed, as shown in Figure 11 (Figure 11 is a visualization of the MIR module in an embodiment of this application). The processing procedure is as follows:

[0215] 1. Beat Information: The beat points of the humming audio, including the timestamps of the accents and sub-beats, are stored as timestamps. If the user selects the normal recording mode, i.e. without a metronome, the humming will not follow a completely fixed BPM, but will have a certain degree of time fluctuation. In this case, the beat points will become the most crucial information for the humming and accompaniment to be perfectly aligned.

[0216] 2. Musical Segment Structure: The system analyzes the humming audio and automatically divides the music into six common pop music segments based on time points (segment start and end): intro, verse, chorus, interlude, bridge, and outro. The subsequent accompaniment module provides different energy levels for each segment; for example, the verse has lower energy and a calmer mood than the chorus, while the intro and outro reproduce the melody and motifs. While a complete pop song typically consists of multiple verses and choruses, in practice, user recordings are often short, sometimes not even forming a complete segment. In such cases, the post-processing algorithm treats it as an independent segment.

[0217] 3. Vocal Pitch Transcription and Tonality: Based on the beat information, the system will segment the humming audio into 16-beat units and identify the pitch of each 16-beat unit. The tonality will then be determined based on the pitch, and chords will be arranged in subsequent modules. Pitch is also a key component for extracting melodic motifs and may be used in the arrangement.

[0218] Automatic accompaniment module

[0219] The system generates a MIDI score for accompaniment based on the extracted music information. It also arranges different numbers of tracks, different types of instruments, and chords according to the length and style of the song to create different effects. Finally, it aligns the user's song input to ensure that it can perfectly match the accompaniment audio without any sense of incongruity.

[0220] For example, as shown in Figure 12 (Figure 12 is a flowchart of the generation process of accompaniment MIDI score according to an embodiment of this application), the corresponding chord progression is first generated based on the pitch and tone. The smallest unit is two beats. The main logic is to calculate the chord that matches the most pitch and best fits the function (e.g., cadences and other common rules in popular music) in each chord unit. Optionally, to facilitate the representation of the function and reusability of different chords, all chord progressions can be represented by Roman numerals. Pitch information is a necessary condition for generating chords. Based on the pitch information, chords that conform to the pitch can be generated, and based on the chords, parts of all instruments can be generated.

[0221] Then, based on pitch, chord progression, and other corresponding musical information, the following MIDI scores with different orchestrations are generated:

[0222] 1. Piano / Guitar / Bass: The main function of these parts is to provide chord support for the song and serve as the main accompaniment instruments. Therefore, the generation method is usually the same as chord progressions, following a two-beat unit. The system then refers to the chord tone and rhythm pattern library to generate the corresponding accompaniment pattern for each unit.

[0223] 2. Drums are generated in two- or four-beat units. However, it relies more on the timing of measures within the structure; for example, a fill-in pattern may occur at the end of a section. Furthermore, due to the numbering of drum instruments in the MIDI format, each sound source needs to be specifically mapped to its corresponding drum instrument.

[0224] 3. Strings (Cello and other string instruments): Strings follow a non-linear generation method, which is more complex than the rhythmic logic of other instruments. Strings provide more contrapuntal melodies and layered roles, generating contrapuntal melodies based on major scale patterns while referencing chord progressions to ensure that the generated variable-length notes fit the chord progression within the time frame.

[0225] 4. Finally, it also includes other miscellaneous generation logic, such as motivational ideas, doubled melodies, and fill patterns for specific number of measures. It also includes the configuration of the piano's instrumental velocity, making the overall arrangement more like a live performance.

[0226] Automatic mixing module

[0227] Based on the MIDI scores with different orchestrations, they are rendered into high-quality target audio as the final product. During the process, appropriate sound source files are selected to generate the corresponding dry audio of the instruments according to the song style and the instruments used in the track. This dry audio is then mixed with the user's vocal input to ensure that the instrument sounds and vocals are completely integrated, achieving the level of a commercially viable song.

[0228] For example, as shown in Figure 13 (Figure 13 is the mixing flow of automatic mixing according to an embodiment of this application), the processing procedure is as follows:

[0229] 1. Based on the information of each track in the MIDI score, the "virtual instrument rendering module" first selects the sound source of the virtual instrument according to the genre and renders it into audio;

[0230] 2. After the human voice undergoes noise reduction and sound balance repair in the "human voice preprocessing module", it is optimized for single-track sound effects, just like the rendered audio of each instrument, including adjusting its equalization, compression ratio, etc.

[0231] 3. Add reverb, delay, and other effects through the "Reverb Space Rendering Module";

[0232] 4. Adjust the overall dry / wet sound ratio using the "dry / wet sound balance module";

[0233] 5. The "multi-track mixing module" analyzes the vocals, drum instruments, and non-drum instruments separately and calls the ducking adaptive volume control module, which is more conducive to the blending of accompaniment and vocals.

[0234] 6. Call the "Mastering Module" to adaptively process the final mix using multi-segment compressed audio effects, ensuring that the output audio meets the objective signal indicators of commercial music.

[0235] Once the final product is obtained, the audio can be sent to the UI module via an HTTP request for users to listen to and share.

[0236] In this embodiment, the user recorded humming the chorus of the popular song "Love Story" using a mobile phone. The music information was accurately captured by the MIR module. Therefore, the subsequent accompaniment mixing module can generate accompaniment audio that perfectly matches the humming audio while retaining the user's humming information.

[0237] Regarding the generation of accompaniment MIDI scores, following standard music production practices, chord progressions that match vocals are sufficient for the needs of most users, avoiding any conflict between vocal and instrument pitches. In terms of orchestration, this embodiment employs a band style, including most instrumental parts, thus better simulating the completeness of a real pop song.

[0238] In terms of automatic mixing, due to the diverse range of instruments in the band, sound image distribution was implemented for different instruments: the piano has a wide stereo width, positioned slightly to the left of center; the violin and cello are positioned on the left and right respectively; the celesta has a greater sense of depth; and the vocals are centered and have a sense of space. In terms of spectral energy, the bass is highly elastic; the drums have a natural energy distribution across the entire frequency range with ample space; and the string instruments highlight their characteristic high frequencies. Compared to MicroVoice, the self-developed algorithm achieves better balance and integration between vocals and accompaniment; compared to Ripple, the final product is also louder, with fuller energy across the entire frequency range, resulting in a more professional listening experience.

[0239] Figure 14 is a flowchart of the audio generation method according to an embodiment of this application. As shown in Figure 14, both this embodiment and the embodiment shown in Figure 10 can be applied to recording humming and background music on general mobile phones and tablets. The process is basically the same. The difference is that the humming audio recorded in this embodiment has the original score (i.e., an existing song). Therefore, it can be compared with the original song using Automatic Speech Recognition (ASR) to obtain more accurate music information and achieve a better background music effect. The following functions are implemented by the various modules in the system shown in Figure 9. The input, automatic accompaniment, and automatic mixing parts of this embodiment are basically the same as those in the embodiment shown in Figure 10, but the MIR module is integrated with ASR:

[0240] 1. Prepare a set of song templates: For the use of this application in different scenarios (such as holiday scenarios, customized scenarios), we can find corresponding existing songs to transcribe, obtain sheet music templates, lyric templates, and chord templates, and organize them into a set of templates so as to align with the user's singing input.

[0241] 2. User humming lyrics recognition: Lyrics are extracted from the humming audio using ASR technology and matched against songs in the template set to find the original song corresponding to the humming audio. If no song is matched, the MIR module shown in the embodiment of Figure 10 will be used.

[0242] 3. Lyrics Timestamp Matching: After matching the original song, the system matches the word-level timestamps of the lyrics in the humming audio with the original song to find the beat length of each lyric, thereby calculating the beat information to ensure that the accompaniment is perfectly in sync with the lyrics. Considering that the humming audio will have a certain degree of time fluctuation, the system will normalize the final beat to ensure that the beat fluctuation is within a controllable range to avoid excessive fluctuation.

[0243] 4. Vocal pitch transcription and tonality: When acquiring pitch, the pitch of the original song is considered at the same time to increase the accuracy of tonality recognition. At the same time, it ensures that even if the user is slightly off-key, the closest tonality can be found to maintain listenability.

[0244] After passing through the ASR and MIR modules, the system will run the automatic accompaniment module to generate the MIDI score for the accompaniment and perform mixing. The chord progressions of the MIDI accompaniment will use the original chord templates in the song template to match the original song's effect and ensure that the user can recognize the similarity to the original song.

[0245] Building upon the technical effects of the embodiment shown in Figure 10, this embodiment further enhances the accuracy of music information extraction by using the original song's sheet music as a reference. It also allows for the direct reuse of the original song's chord progressions, ensuring that users can discern a similarity to the original song and generate emotional resonance in specific scenarios. For example, in the Mother's Day scenario, we have already incorporated nine Mother's Day song templates into the process, enabling rapid alignment of music information and the generation of accurate emotional and musical theory information when humming audio corresponds to a song.

[0246] In addition, considering the potential copyright issues that this embodiment may cause, the system will only reuse the chord progressions of the original song, without changing the instrumentation or other mixing materials. This will also ensure that the musical style is completely different from the original song, thereby avoiding copyright issues.

[0247] Figure 15 is a structural schematic diagram of the audio generation device 1500 of this application. As shown in Figure 15, the audio generation device 1500 of this embodiment can be applied to the aforementioned terminal device. The audio generation device 1500 may include: a human-computer interaction module 1501, an MIR module 1502, and an audio generation module 1503.

[0248] The human-computer interaction module 1501 is used to acquire humming audio, which is input by the user through the human-computer interaction interface; the MIR module 1502 is used to perform music information extraction (MIR) on the humming audio to obtain the music information of the humming audio, which includes beat information, pitch information and tone information; the audio generation module 1503 is used to generate target audio based on the music information of the humming audio, which includes at least the accompaniment audio corresponding to the humming audio.

[0249] In one possible implementation, the MIR module 1502 is specifically used to input the humming audio into an artificial intelligence (AI) model and output the music information of the humming audio.

[0250] In one possible implementation, the MIR module 1502 is specifically used to input the humming audio into a first AI model and output the beat information; input the beat information and the humming audio into a second AI model and output the pitch information; and input the pitch information of the humming audio into a third AI model and output the tone information.

[0251] In one possible implementation, the MIR module 1502 is specifically used to perform Automatic Speech Recognition (ASR) on the humming audio to obtain the lyrics of the humming audio; match the lyrics of the humming audio with the original musical score to obtain the beat information, wherein the original musical score is the score of the original piece of music corresponding to the humming audio; input the beat information and the humming audio into a second AI model to output the pitch information; and input the pitch information of the humming audio into a third AI model to output the tone information.

[0252] In one possible implementation, the musical information of the humming audio also includes musical segment information; the MIR module 1502 is further used to input the humming audio into the fourth AI model and output the musical segment information.

[0253] In one possible implementation, the audio generation module 1503 is specifically used to generate an accompaniment score based on the musical information of the humming audio, the accompaniment score including a multi-track instrument digital interface (MIDI) score corresponding to multiple orchestrations; and to perform mixing processing based on the accompaniment score to obtain the target audio.

[0254] In one possible implementation, the audio generation module 1503 is specifically used to generate chord progressions based on the pitch information and the tone information; and to generate the accompaniment score based on the musical information of the humming audio and the chord progressions.

[0255] In one possible implementation, the audio generation module 1503 is specifically used to acquire the style input by the user through the human-computer interaction interface; determine the plurality of orchestrations based on the style and the chords; and generate the accompaniment score based on the plurality of orchestrations and the musical information of the humming audio.

[0256] In one possible implementation, the audio generation module 1503 is specifically used to render virtual instruments and perform single-track sound effect processing on the corresponding multi-track MIDI scores of the multiple orchestrations to obtain multiple single-track audio; and to perform multi-track stereo mixing processing on the multiple single-track audio to obtain the target audio.

[0257] In one possible implementation, the target audio further includes the vocal audio from the humming audio; the audio generation module 1503 is further configured to perform vocal preprocessing on the humming audio to obtain a single-track vocal audio; to perform virtual instrument rendering and single-track sound effect processing on the corresponding multi-track MIDI scores for multiple orchestrations to obtain multiple single-track accompaniment audio; to obtain multiple single-track audio based on the single-track vocal audio and the multiple single-track accompaniment audio; and to perform multi-track stereo mixing processing on the multiple single-track audio to obtain the target audio.

[0258] The apparatus in this embodiment can be used to execute the technical solution of the method embodiment shown in FIG3. Its implementation principle and technical effect are similar, and will not be described again here.

[0259] It is understood that, in order to achieve the above-mentioned functions, the terminal device includes hardware and / or software modules that perform the respective functions. Based on the algorithm steps of the examples described in the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed by hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application in conjunction with the embodiments, but such implementation should not be considered beyond the scope of this application.

[0260] In one example, FIG16 shows a schematic block diagram of an apparatus 1600 according to an embodiment of the present application. The apparatus 1600 may include a processor 1601 and a transceiver / transceiver pin 1602, and optionally, a memory 1603.

[0261] The various components of device 1600 are coupled together via bus 1604, which includes a data bus, a power bus, a control bus, and a status signal bus. However, for clarity, all buses are referred to as bus 1604 in the figure.

[0262] Optionally, the memory 1603 can be used for the instructions in the foregoing method embodiments. The processor 1601 can be used to execute the instructions in the memory 1603, control the receive pin to receive signals, and control the transmit pin to transmit signals.

[0263] Device 1600 may be the terminal device or the chip of the terminal device in the above method embodiments.

[0264] All relevant content of each step involved in the above method embodiments can be referenced from the functional description of the corresponding functional module, and will not be repeated here.

[0265] This embodiment also provides a computer storage medium storing computer instructions. When the computer instructions are executed on a terminal device, the terminal device performs the aforementioned method steps to implement the audio generation method in the above embodiment.

[0266] This embodiment also provides a computer program product that, when run on a computer, causes the computer to perform the aforementioned steps to implement the audio generation method in the above embodiment.

[0267] In addition, embodiments of this application also provide an apparatus, which may specifically be a chip, component, or module. The apparatus may include a connected processor and a memory; wherein the memory is used to store computer execution instructions, and when the apparatus is running, the processor may execute the computer execution instructions stored in the memory to cause the chip to execute the audio generation method in the above-described method embodiments.

[0268] In this embodiment, the terminal device, computer storage medium, computer program product or chip are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the beneficial effects in the corresponding methods provided above, and will not be repeated here.

[0269] Through the above description of the embodiments, those skilled in the art will understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0270] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another apparatus, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0271] The units described as separate components may or may not be physically separate. A component shown as a unit can be one or more physical units; that is, it can be located in one place or distributed in multiple different locations. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0272] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0273] Any content in the various embodiments of this application, as well as any content in the same embodiment, can be freely combined. Any combination of the above content is within the scope of this application.

[0274] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0275] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

[0276] The steps of the methods or algorithms described in conjunction with the embodiments of this application can be implemented in hardware or by a processor executing software instructions. The software instructions can consist of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, portable hard disks, CD-ROMs, or any other form of storage medium well known in the art. An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can reside in an ASIC.

[0277] Those skilled in the art will recognize that the functions described in the embodiments of this application in one or more of the above examples can be implemented using hardware, software, firmware, or any combination thereof. When implemented using software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium. Computer-readable media include computer storage media and communication media, wherein communication media include any medium that facilitates the transfer of a computer program from one place to another. Storage media can be any available medium that can be accessed by a general-purpose or special-purpose computer.

[0278] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. An audio generation method, characterized in that, include: Acquire humming audio, which is input by the user through a human-computer interaction interface; Music information extraction (MIR) is performed on the humming audio to obtain the music information of the humming audio, the music information including beat information, pitch information and tone information; A target audio is generated based on the musical information of the humming audio, and the target audio includes at least the accompaniment audio corresponding to the humming audio.

2. The method according to claim 1, characterized in that, The step of performing MIR on the humming audio to obtain the musical information of the humming audio includes: The humming audio is input into an artificial intelligence (AI) model, which outputs the musical information of the humming audio.

3. The method according to claim 1 or 2, characterized in that, The step of performing MIR on the humming audio to obtain the musical information of the humming audio includes: The humming audio is input into the first AI model, and the beat information is output. The beat information and the humming audio are input into the second AI model, and the pitch information is output. The pitch information of the humming audio is input into the third AI model, and the tone information is output.

4. The method according to claim 1 or 2, characterized in that, The step of performing MIR on the humming audio to obtain the musical information of the humming audio includes: Automatic speech recognition (ASR) is performed on the humming audio to obtain the lyrics of the humming audio; The lyrics of the humming audio are matched with the original musical score to obtain the beat information. The original musical score is the score of the original music corresponding to the humming audio. The beat information and the humming audio are input into the second AI model, and the pitch information is output. The pitch information of the humming audio is input into the third AI model, and the tone information is output.

5. The method according to any one of claims 1-4, characterized in that, The musical information of the humming audio also includes musical phrase information; the process of performing MIR on the humming audio to obtain the musical information of the humming audio includes: The humming audio is input into the fourth AI model, which outputs the musical segment information.

6. The method according to any one of claims 1-5, characterized in that, The step of generating the target audio based on the musical information of the humming audio includes: An accompaniment score is generated based on the musical information of the humming audio, and the accompaniment score includes a multi-track instrument digital interface (MIDI) score corresponding to multiple orchestrations; The target audio is obtained by mixing the accompaniment score.

7. The method according to claim 6, characterized in that, The step of generating the accompaniment score based on the musical information of the humming audio includes: A chord progression is generated based on the pitch information and the tone information; The accompaniment score is generated based on the musical information of the humming audio and the chords.

8. The method according to claim 7, characterized in that, The step of generating the accompaniment score based on the musical information of the humming audio and the chords includes: Obtain the style input by the user through the human-computer interaction interface; The plurality of instrumentations are determined according to the style and the chords; The accompaniment score is generated based on the musical information of the multiple orchestrations and the humming audio.

9. The method according to any one of claims 6-8, characterized in that, The mixing process based on the accompaniment score to obtain the target audio includes: Based on the multi-track MIDI score corresponding to multiple orchestrations, the virtual instrument rendering and single-track sound effect processing of the corresponding orchestrations are performed respectively to obtain multiple single-track audio; The multiple single-track audio files are subjected to multi-track stereo mixing processing to obtain the target audio.

10. The method according to any one of claims 6-8, characterized in that, The target audio also includes the human voice audio in the humming audio; the method includes: The humming audio is preprocessed to obtain a single-track human voice audio. Based on the multi-track MIDI score corresponding to multiple instrumentations, the virtual instrument rendering and single-track sound effect processing of the corresponding instrumentations are performed respectively to obtain multiple accompaniment single-track audio. Multiple single-track audios are obtained based on the human voice single-track audio and the multiple accompaniment single-track audios; The multiple single-track audio files are subjected to multi-track stereo mixing processing to obtain the target audio.

11. An audio generation apparatus, characterized in that, include: The human-computer interaction module is used to acquire humming audio, which is input by the user through the human-computer interaction interface; The MIR module is used to perform music information extraction (MIR) on the humming audio to obtain the music information of the humming audio, the music information including beat information, pitch information and tone information; An audio generation module is used to generate a target audio based on the musical information of the humming audio, wherein the target audio includes at least the accompaniment audio corresponding to the humming audio.

12. The apparatus according to claim 11, characterized in that, The MIR module is specifically used to input the humming audio into an artificial intelligence (AI) model and output the music information of the humming audio.

13. The apparatus according to claim 11 or 12, characterized in that, The MIR module is specifically used to input the humming audio into a first AI model and output the beat information; input the beat information and the humming audio into a second AI model and output the pitch information; and input the pitch information of the humming audio into a third AI model and output the tone information.

14. The apparatus according to claim 11 or 12, characterized in that, The MIR module is specifically used to perform Automatic Speech Recognition (ASR) on the humming audio to obtain the lyrics of the humming audio; match the lyrics of the humming audio with the original musical score to obtain the beat information, wherein the original musical score is the score of the original piece of music corresponding to the humming audio; input the beat information and the humming audio into a second AI model to output the pitch information; and input the pitch information of the humming audio into a third AI model to output the tone information.

15. The apparatus according to any one of claims 11-14, characterized in that, The musical information of the humming audio also includes musical segment information; the MIR module is also used to input the humming audio into the fourth AI model and output the musical segment information.

16. The apparatus according to any one of claims 11-15, characterized in that, The audio generation module is specifically used to generate an accompaniment score based on the musical information of the humming audio. The accompaniment score includes a multi-track instrument digital interface (MIDI) score corresponding to multiple orchestrations. The accompaniment score is then mixed to obtain the target audio.

17. The apparatus according to claim 16, characterized in that, The audio generation module is specifically used to generate chord progressions based on the pitch information and the tone information; and to generate the accompaniment score based on the musical information of the humming audio and the chord progressions.

18. The apparatus according to claim 17, characterized in that, The audio generation module is specifically used to acquire the style input by the user through the human-computer interaction interface; determine the multiple instrumentations based on the style and the chords; and generate the accompaniment score based on the multiple instrumentations and the musical information of the humming audio.

19. The apparatus according to any one of claims 16-18, characterized in that, The audio generation module is specifically used to render virtual instruments and process single-track sound effects for the corresponding multi-track MIDI scores based on the multiple accompaniments, so as to obtain multiple single-track audio. The multiple single-track audio files are subjected to multi-track stereo mixing processing to obtain the target audio.

20. The apparatus according to any one of claims 16-18, characterized in that, The target audio also includes the human voice audio in the humming audio; the audio generation module is further used to perform human voice preprocessing on the humming audio to obtain human voice single-track audio; and to perform virtual instrument rendering and single-track sound effect processing on the corresponding accompaniment according to the multi-track MIDI score corresponding to multiple accompaniments to obtain multiple accompaniment single-track audio. Multiple single-track audios are obtained based on the human voice single-track audio and the multiple accompaniment single-track audios; The multiple single-track audio files are subjected to multi-track stereo mixing processing to obtain the target audio.

21. A terminal device, characterized in that, include: One or more processors; Memory, used to store one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-10.

22. A computer-readable storage medium, characterized in that, It includes a computer program that, when executed on a computer, causes the computer to perform the method of any one of claims 1-10.

23. A computer program product, characterized in that, The computer program product includes computer program code that, when run on a computer, causes the computer to perform the method of any one of claims 1-10.