Audio playing method and related apparatus

By introducing microphone middleware into the playback device for data synthesis, the problem of large latency in cross-process transmission of synthesized audio data is solved, resulting in shorter audio data transmission time and a better user experience.

CN117215518BActive Publication Date: 2026-06-16TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD
Filing Date
2023-10-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In smart devices, the cross-process transmission of synthesized audio data results in significant latency, causing users to hear a noticeable echo and leading to a poor user experience.

Method used

Introducing a microphone middleware into the playback device allows for data synthesis through the application layer and system service layer of the microphone middleware, reducing cross-process transmission and sending synthesized audio data directly to the kernel driver layer.

Benefits of technology

It effectively reduces the transmission latency of synthesized audio data, improves the user experience, and reduces echo.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117215518B_ABST
    Figure CN117215518B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose an audio playing method and related device. Embodiments of the present application include: a microphone middleware receiving dry sound data sent by a collection terminal device, an application layer of the microphone middleware receiving song accompaniment data sent by a target application in an application layer of a playing device, the application layer of the microphone middleware sending the song accompaniment data to a system service layer, the system service layer synthesizing the dry sound data and the song accompaniment data to obtain synthesized audio data, the system service layer sending the synthesized audio data to a kernel driver layer, and the kernel driver layer driving a loudspeaker to play the synthesized audio data. Therefore, the synthesized audio data can be transmitted in the microphone middleware, without being transmitted through an application process or a system process, so that cross-process transmission does not occur, and the time delay of the synthesized audio data transmission in the playing device can be reduced, and the user experience is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of audio playback, specifically to an audio playback method and related apparatus. Background Technology

[0002] After a smartphone connects to a playback device, the smartphone acts as a microphone to collect the user's dry audio data. This collected dry audio data is then sent to the playback device, which combines the dry audio data with the accompaniment data of the song and plays the resulting audio. This playback device can be a smart TV, a car infotainment system, or other smart devices with sound capabilities.

[0003] The playback device receives dry audio data sent by the smartphone through its APP layer. The APP layer synthesizes the dry audio data with the accompaniment data of the song to obtain synthesized audio data, and then sends this synthesized audio data to the OpenSL service of the playback device. The OpenSL service sends the synthesized audio data to the Audio Flinger service in the Audio Framework layer. The Audio Flinger service then sends the synthesized audio data to the Audio driver in the kernel driver layer so that the Audio driver can drive the speaker to play the synthesized audio data.

[0004] In the above process, since the OpenSL service and the Audio Framework layer belong to different application processes and system processes, the synthesized audio data needs to be transmitted across processes within the playback device. The transmission time is relatively long, which in turn results in a large delay in the process of dry audio data acquisition and playback on the playback device. Users can clearly hear echoes, resulting in a poor user experience. Summary of the Invention

[0005] This application provides an audio playback method and related apparatus to reduce the latency of synthesized audio data transmission in playback devices.

[0006] The first aspect of this application provides an audio playback method, which is applied to a playback device. The playback device includes an application layer, a native audio framework, and a kernel driver layer. The playback device is also configured with a microphone middleware, which includes an application layer and a system service layer. The application layer of the microphone middleware is deployed on the application layer of the playback device, and the system service layer of the microphone middleware is deployed on the native audio framework. The playback device is connected to a acquisition device for acquiring human voice.

[0007] The method includes:

[0008] The microphone middleware receives dry sound data sent by the acquisition device.

[0009] The application layer based on the microphone middleware receives the song accompaniment data sent by the target application in the application layer of the playback device;

[0010] The application layer of the microphone middleware sends the song accompaniment data to the system service layer.

[0011] Based on the system service layer, the dry audio data and the song accompaniment data are synthesized to obtain synthesized audio data;

[0012] The system service layer sends the synthesized audio data to the kernel driver layer, so that the kernel driver layer drives the speaker to play the synthesized audio data.

[0013] A second aspect of this application provides a playback device, which includes an application layer, a native audio framework, and a kernel driver layer; the playback device is also configured with a microphone middleware, which includes an application layer and a system service layer, wherein the application layer of the microphone middleware is deployed on the application layer of the playback device, and the system service layer of the microphone middleware is deployed on the native audio framework; the playback device is connected to a acquisition end device for acquiring human voice.

[0014] The microphone middleware is used to receive dry sound data sent by the acquisition device;

[0015] The application layer of the microphone middleware is used to receive song accompaniment data sent by the target application in the application layer of the playback device;

[0016] The application layer of the microphone middleware is also used to send the song accompaniment data to the system service layer;

[0017] The system service layer is used to synthesize the dry audio data and the song accompaniment data to obtain synthesized audio data;

[0018] The system service layer is also used to send the synthesized audio data to the kernel driver layer, so that the kernel driver layer drives the speaker to play the synthesized audio data.

[0019] A third aspect of this application provides a playback device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the method of the first aspect described above.

[0020] A fourth aspect of this application provides a computer storage medium storing instructions that, when executed on a playback device, cause the playback device to perform the method described in the first aspect.

[0021] As can be seen from the above technical solutions, the embodiments of this application have the following advantages:

[0022] The microphone middleware receives dry audio data from the acquisition device. Its application layer receives accompaniment data from the target application in the playback device's application layer. The application layer then sends the accompaniment data to the system service layer. The system service layer synthesizes the dry audio data and the accompaniment data to obtain synthesized audio data, which is then sent to the kernel driver layer. The kernel driver layer drives the speaker to play the synthesized audio data. Therefore, synthesized audio data can be transmitted within the microphone middleware without going through application or system processes, avoiding cross-process transmission and reducing latency in the playback device, thus improving the user experience. Attached Figure Description

[0023] Figure 1 This is a diagram illustrating the interaction between components within the mobile and TV terminals in the relevant solution.

[0024] Figure 2 This is a schematic diagram of the network framework in an embodiment of this application;

[0025] Figure 3 This is a schematic diagram of the playback device in the embodiments of this application;

[0026] Figure 4 This is a flowchart illustrating an audio playback method in an embodiment of this application;

[0027] Figure 5 This is another flowchart illustrating the audio playback method in this application.

[0028] Figure 6 This is a schematic diagram illustrating the interaction between components within the mobile terminal and the TV terminal in this embodiment of the application.

[0029] Figure 7 This is a schematic diagram illustrating the results of multiple latency tests from the mobile phone terminal collecting dry audio data to the TV terminal playing synthesized audio data in an embodiment of this application.

[0030] Figure 8 This is a schematic diagram of a playback device in an embodiment of this application. Detailed Implementation

[0031] In the relevant scheme, the APP layer in the playback device synthesizes the dry audio data with the accompaniment data of the song to obtain synthesized audio data, and then sends the synthesized audio data to the OpenSL service of the playback device. The OpenSL service sends the synthesized audio data to the Audio Flinger service in the NativeFramework layer. The Audio Flinger service then sends the synthesized audio data to the Audio driver in the kernel driver layer so that the Audio driver can drive the speaker to play the synthesized audio data.

[0032] like Figure 1 As shown, the mobile phone and TV are connected to the same local area network and establish a UDP communication connection; the mobile phone's microphone hardware collects human voice data, which passes through the kernel driver layer and the Native Framework layer to reach the karaoke app process; the karaoke app sends the data to the network receiver of the karaoke app on the TV via the local area network UDP; after receiving the human voice data, the karaoke app on the TV plays it through its OpenSL, passing through the Native Framework and kernel driver layer in the link, and finally plays the human voice data and accompaniment data through the speaker.

[0033] As shown in the diagram, because the OpenSL service and the Audio Framework layer belong to different application and system processes, the synthesized audio data needs to be transmitted across processes within the playback device. This transmission takes a long time, resulting in a significant latency in the process from dry audio data acquisition to playback on the playback device. Figure 1 The process, as shown, takes 360-370ms, and the user can clearly hear the echo, resulting in a poor user experience.

[0034] To address this technical problem, embodiments of this application propose an audio playback method and related apparatus to reduce the latency of synthesized audio data transmission in playback devices.

[0035] Please see Figure 2 The network framework in this embodiment includes:

[0036] The acquisition device and the playback device that are communicatively connected to the acquisition device are used to acquire dry audio data of the user, such as the user's singing voice data.

[0037] The communication connection between the acquisition device and the playback device can be based on any communication protocol, such as a communication connection based on the User Datagram Protocol (UDP).

[0038] The acquisition device can be any device with sound acquisition and communication capabilities, such as a smartphone. The playback device can be any device with data processing, audio playback, and communication capabilities, such as a smart TV, in-vehicle infotainment system, etc.

[0039] In this embodiment of the application, the user can use a smartphone or other acquisition device to capture their singing voice. The smartphone then sends the captured singing voice data to the playback device, which integrates the user's singing voice data and the accompaniment data of the song and outputs it to the speaker for playback.

[0040] Figure 3 The diagram illustrates the structure of a playback device. The playback device comprises a multi-layered structure, including an Application layer, a Native Framework, and a Kernel Driver layer. Microphone middleware, used to process the dry audio data from user singing, can be installed within the playback device. The application layer of the microphone middleware can be deployed within the playback device's application layer, while the system service layer of the microphone middleware can be deployed within the native audio framework. The playback device's kernel driver layer can be configured with the Audio driver service of the Hardware Abstraction Layer (HAL) to drive the speakers to play audio data.

[0041] The following is combined Figure 2 Network framework and Figure 3 The structure of the playback device shown is illustrated, and the audio playback method in the embodiments of this application is described below:

[0042] Please see Figure 4 One embodiment of the audio playback method in this application includes:

[0043] 401. Receive dry sound data sent by the acquisition terminal device based on the microphone middleware;

[0044] Users can use a data acquisition device (such as a mobile phone) to capture their singing voice. The data acquisition device collects the user's dry voice data and sends this dry voice data to the playback device. The playback device receives the dry voice data sent by the data acquisition device through a microphone middleware.

[0045] 402. The application layer based on the microphone middleware receives the song accompaniment data sent by the target application in the application layer of the playback device;

[0046] The playback device also has a target application installed, which is deployed at the application layer of the playback device. The target application can provide accompaniment data for multiple songs. When the user sings, the target application can send the accompaniment data of the song being sung to the application layer of the microphone middleware. This accompaniment data can be used as accompaniment for the dry audio data received by the microphone middleware.

[0047] 403. The application layer based on the microphone middleware sends the song accompaniment data to the system service layer;

[0048] After receiving the accompaniment data for the song, the application layer of the microphone middleware sends the accompaniment data to the system service layer of the microphone middleware so that the system service layer of the microphone middleware can synthesize the dry audio data and the accompaniment data.

[0049] 404. Based on the system service layer, the dry audio data and the song accompaniment data are synthesized to obtain synthesized audio data;

[0050] After receiving the song accompaniment data, the system service layer of the microphone middleware synthesizes the song accompaniment data and the dry audio data sent by the acquisition device to obtain the synthesized audio data.

[0051] 405. Based on the system service layer, the synthesized audio data is sent to the kernel driver layer, so that the kernel driver layer drives the speaker to play the synthesized audio data;

[0052] After the system service layer obtains the synthesized audio data, it sends the synthesized audio data to the kernel driver layer of the playback device, which can then drive the speaker to play the synthesized audio data.

[0053] In this embodiment, the microphone middleware receives dry audio data sent by the acquisition device, and its application layer receives song accompaniment data sent by the target application in the playback device's application layer. The application layer of the microphone middleware then sends the song accompaniment data to the system service layer. The system service layer synthesizes the dry audio data and the song accompaniment data to obtain synthesized audio data, which is then sent to the kernel driver layer. The kernel driver layer drives the speaker to play the synthesized audio data. Therefore, the synthesized audio data can be transmitted within the microphone middleware without going through application or system processes, avoiding cross-process transmission and reducing latency in the transmission of synthesized audio data in the playback device, thus improving the user experience.

[0054] The following will be discussed in the preceding text. Figure 4 Based on the illustrated embodiments, embodiments of this application will be described in further detail. Please refer to [link / reference]. Figure 5 Another embodiment of the audio playback method in this application includes:

[0055] 501. Receive dry sound data sent by the acquisition terminal device based on the microphone middleware;

[0056] In this embodiment, the system service layer of the microphone middleware includes a voice receiving module, which can be connected to the acquisition device. Therefore, the dry voice data of the user acquired by the acquisition device can be received by the voice receiving module, and then the voice receiving module sends the dry voice data to the system service layer. The system service layer receives the dry voice data sent by the voice receiving module, and then the system service layer of the microphone middleware processes the dry voice data.

[0057] The acquisition device and the voice receiving module can be connected based on any communication protocol. For example, if the communication connection is based on the UDP protocol, then the voice receiving module is a communication module based on the User Datagram Protocol (UDP), which receives the dry sound data sent by the acquisition device based on the UDP protocol.

[0058] 502. The application layer based on the microphone middleware receives the song accompaniment data sent by the target application in the application layer of the playback device;

[0059] The playback device can install a target application that provides the accompaniment data for a song; for example, the target application could be a karaoke app. The target application can send the accompaniment data of the song the user is currently singing to the application layer of the microphone middleware, and the application layer of the microphone middleware then receives the accompaniment data.

[0060] In this embodiment, the application layer of the microphone middleware can be the Activity component of the Android system. The Activity component of the Android system serves as the carrier of interaction, and its functions can be defined and extended. For example, the functions of the Activity component can be declared in the AndroidManifest.xml configuration file, thereby customizing its functions. For example, the information and interaction methods between the Activity component and other carriers can be defined, thereby realizing the functions implemented by the application layer of the microphone middleware in this embodiment.

[0061] The system service layer of the microphone middleware can be a Service component of the Android system. Android System Service components typically run in the background to perform user-specified operations, and their functionality can be declared and customized in the AndroidManifest.xml configuration file. Therefore, by customizing the functionality of the Android System Service component, the operations and functions performed by the system service layer of the microphone middleware in this embodiment can be implemented.

[0062] 503. The application layer based on the microphone middleware sends the song accompaniment data to the system service layer;

[0063] After receiving the song accompaniment data, the application layer of the microphone middleware can send the song accompaniment data to the system service layer of the microphone middleware so that the system service layer can process the song accompaniment data.

[0064] 504. Based on the system service layer, the dry audio data and the song accompaniment data are synthesized to obtain synthesized audio data;

[0065] The system service layer of the microphone middleware can synthesize dry vocal data and song accompaniment data to obtain synthesized audio data. This synthesized audio data combines the user's dry vocals with the song's accompaniment, which can enhance the listening experience of the user's singing.

[0066] 505. Based on the system service layer, the synthesized audio data is sent to the kernel driver layer, so that the kernel driver layer drives the speaker to play the synthesized audio data;

[0067] After synthesizing the dry audio data and the song accompaniment data to obtain the synthesized audio data, the system service layer sends the synthesized audio data to the kernel driver layer of the playback device, so that the kernel driver layer can drive the speaker to play the synthesized audio data.

[0068] like Figure 6 As shown, in this embodiment, the acquisition device can be a smartphone, and the playback device can be a smart TV. The smartphone and the smart TV are connected to the same local area network and establish a UDP communication connection. The smartphone's microphone hardware acquires the dry audio data of the user singing, which passes through the kernel driver layer and the Native Framework layer to reach the karaoke app process. The karaoke app sends the data to the human voice receiving module of the microphone middleware on the TV end via the local area network UDP. After the TV end receives the dry audio data sent by the mobile phone, the system service layer of the microphone middleware combines the song accompaniment data sent by the microphone middleware with the dry audio data to obtain synthesized audio data, and then sends the synthesized audio data to the kernel driver layer, which drives the speaker to play the synthesized audio data.

[0069] Figure 7 Multiple latency tests were conducted on the process of collecting dry audio data from a mobile device and playing the synthesized audio data on a TV. It can be seen that before the optimization of the relevant technical solution in this embodiment, the average latency was as high as 369ms; while after optimization, the average latency was reduced to 138ms, which is a significant reduction in latency.

[0070] As can be seen, since the synthesized audio data can be transmitted in the microphone middleware without going through application or system processes, there is no cross-process transmission, which greatly reduces latency. The latency of transmitting dry audio data from the mobile phone to the TV is 10-20ms, while the latency of processing the dry audio data and playing the synthesized audio data within the TV is 50-60ms. The latency of the entire process from mobile phone acquisition of dry audio to speaker playback of synthesized audio data is only 130-140ms, which is a significant reduction compared to related solutions. Users will hardly feel the delay between their singing and the speaker playing the accompaniment, thus improving the user experience of karaoke software applications.

[0071] In this embodiment, after the system service layer synthesizes the dry vocal data and the song accompaniment data to obtain the synthesized audio data, the system service layer can also send the synthesized audio data to the application layer of the microphone middleware. The application layer of the microphone middleware forwards the synthesized audio data to the target application, so that the target application can process the synthesized audio data, such as scoring the synthesized audio data, evaluating whether each note of the user's dry vocals matches the note of the accompaniment, or evaluating whether the pitch of the user's dry vocals matches the pitch of the accompaniment; or generating a karaoke song based on the synthesized audio data, or performing other processing on the synthesized audio data.

[0072] The playback device in the embodiments of this application is described below. Please refer to [link / reference]. Figure 8 One embodiment of the playback device in this application includes:

[0073] The playback device 800 may include one or more central processing units (CPUs) 801 and a memory 805, in which one or more applications or data are stored.

[0074] The memory 805 can be volatile or persistent storage. The program stored in the memory 805 can include one or more modules, each module including a series of instruction operations on the playback device. Furthermore, the central processing unit 801 can be configured to communicate with the memory 805 and execute the series of instruction operations in the memory 805 on the playback device 800.

[0075] The playback device 800 may also include one or more power supplies 802, one or more wired or wireless network interfaces 803, one or more input / output interfaces 804, and / or one or more operating systems, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0076] The central processing unit 801 can perform the aforementioned... Figures 4 to 5 The specific operations performed by the playback device in the illustrated embodiment will not be described in detail here.

[0077] This application also provides a computer storage medium, one embodiment of which includes: the computer storage medium storing instructions, which, when executed on a computer, cause the computer to perform the aforementioned... Figures 4 to 5 The operation performed by the playback device in the illustrated embodiment.

[0078] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0079] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between apparatuses or units through some interfaces, and may be electrical, mechanical, or other forms.

[0080] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0081] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0082] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

Claims

1. An audio playback method, characterized in that, The method is applied to a playback device, which includes an application layer, a native audio framework, and a kernel driver layer. The playback device is also configured with a microphone middleware, which includes an application layer and a system service layer. The application layer of the microphone middleware is deployed on the application layer of the playback device, and the system service layer of the microphone middleware is deployed on the native audio framework. The playback device is connected to a acquisition terminal device used to collect human voices; The method includes: The microphone middleware receives dry sound data sent by the acquisition device. The application layer based on the microphone middleware receives the song accompaniment data sent by the target application in the application layer of the playback device; The application layer of the microphone middleware sends the song accompaniment data to the system service layer. Based on the system service layer, the dry audio data and the song accompaniment data are synthesized to obtain synthesized audio data; The system service layer sends the synthesized audio data to the kernel driver layer, so that the kernel driver layer drives the speaker to play the synthesized audio data.

2. The method according to claim 1, characterized in that, The system service layer includes a human voice receiver module; The step of receiving dry audio data sent by the acquisition device based on the microphone middleware includes: The human voice receiving module receives the dry sound data sent by the acquisition device; The synthesis of the dry vocal data and the song accompaniment data based on the system service layer includes: The system service layer receives the dry sound data sent by the human voice receiving module. The system service layer synthesizes the dry audio data and the song accompaniment data to obtain the synthesized audio data.

3. The method according to claim 2, characterized in that, The human voice receiving module is a communication module based on the User Datagram Protocol.

4. The method according to claim 1, characterized in that, The application layer of the microphone middleware is the Activity component of the Android system, and the system service layer is the Service component of the Android system.

5. The method according to claim 1, characterized in that, The method further includes: The system service layer sends the synthesized audio data to the application layer of the microphone middleware; The application layer based on the microphone middleware sends the synthesized audio data to the target application, so that the target application can process the synthesized audio data.

6. A playback device, characterized in that, The playback device includes an application layer, a native audio framework, and a kernel driver layer; the playback device is also configured with a microphone middleware, which includes an application layer and a system service layer. The application layer of the microphone middleware is deployed on the application layer of the playback device, and the system service layer of the microphone middleware is deployed on the native audio framework; the playback device is connected to a acquisition end device for acquiring human voice. The microphone middleware is used to receive dry sound data sent by the acquisition device; The application layer of the microphone middleware is used to receive song accompaniment data sent by the target application in the application layer of the playback device; The application layer of the microphone middleware is also used to send the song accompaniment data to the system service layer; The system service layer is used to synthesize the dry audio data and the song accompaniment data to obtain synthesized audio data; The system service layer is also used to send the synthesized audio data to the kernel driver layer, so that the kernel driver layer drives the speaker to play the synthesized audio data.

7. The playback device according to claim 6, characterized in that, The system service layer includes a human voice receiver module; The human voice receiving module is used to receive the dry sound data sent by the acquisition device; The system service layer is specifically used to receive the dry audio data sent by the human voice receiving module, and to synthesize the dry audio data and the song accompaniment data to obtain the synthesized audio data.

8. The playback device according to claim 7, characterized in that, The human voice receiving module is a communication module based on the User Datagram Protocol.

9. A playback device, comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 5.

10. A computer storage medium, characterized in that, The computer storage medium stores instructions that, when executed on a playback device, cause the playback device to perform the method as described in any one of claims 1 to 5.