Audio control method and apparatus, and medium and device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By detecting the sound-emitting movements of target objects in the TV screen, determining the location information of the sound-emitting part, and generating audio output parameters, the problem of mismatch between the sound source location and the screen movement is solved, thus improving the user's audiovisual experience.

WO2026124088A1PCT designated stage Publication Date: 2026-06-18SHENZHEN TCL NEW-TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SHENZHEN TCL NEW-TECH CO LTD
Filing Date: 2025-11-10
Publication Date: 2026-06-18

Application Information

Patent Timeline

10 Nov 2025

Application

18 Jun 2026

Publication

WO2026124088A1

IPC: H04N21/485; H04N21/439; G10L21/003

AI Tagging

Application Domain

Speech analysis Selective content distribution

Technology Topics

Sound source locationSound sources

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Method, device and storage medium for audio processing based on multi-loudspeaker scene
CN122248317AFrequency/directions obtaining arrangementsSound source locationSound sources
Smart microphone stand orienting toward sound using acoustic direction finding
WO2026127939A2MicrocontrollerSound sources
A volume feedback sound control method and system
CN122248321ATransducer circuitsSound sourcesEngineering
A target depth discrimination method based on negative eikonal waveguide horizontal array wave number difference domain feature extraction
CN122241203AAcoustic wave reradiationDistribution matrixSound sources

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The increase in TV screen size has caused a mismatch between the location of the sound source and the on-screen action, affecting the user's audiovisual experience.

Method used

By detecting the sound-emitting action of the target object in the playback screen, the target position information of the sound-emitting part is determined, audio output parameters are generated, and the audio output device is controlled to output the audio at the corresponding position, so as to achieve synchronization between the sound source position and the screen action.

Benefits of technology

It enhances the user's audiovisual experience by synchronizing the sound source location with the visual movement, thus improving the matching of audio control.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025133960_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Provided in the embodiments of the present application are an audio control method and apparatus, and a medium and a device. The method comprises: when it is detected that there is a target subject in a currently playing frame and the target subject executes a sound production action, determining target position information of a sound production part corresponding to the sound production action; on the basis of the target position information, generating a target audio output parameter required by an audio output device to output audio; and on the basis of the target audio output parameter, controlling the audio output device to output audio corresponding to the target position information. By using the audio control method provided in the embodiments of the present application, when it is detected that there is a target subject in the currently playing frame and the target subject executes the sound production action, the target audio output parameter required by the audio output device is generated on the basis of the target position information of the sound production part corresponding to the sound production action, and then the audio output device is controlled to output the audio corresponding to the target position information, thereby synchronizing a sound source position with an action in a frame, and improving the audio-visual experience of a user.

Need to check novelty before this filing date? Find Prior Art

Description

Audio control methods, devices, media and equipment

[0001] This application claims priority to Chinese Patent Application No. 202411844297.X, filed on December 13, 2024, entitled "Audio Control Method, Apparatus, Medium and Device", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of smart device technology, and more particularly to the field of audio control technology, and especially to an audio control method, device, medium and equipment. Background Technology

[0003] As television screen sizes continue to increase, so too does the problem of sound and picture mismatch arise. Especially in dynamic scenes where characters move around a lot, when a character in the video is speaking and moving from the left to the right side of the screen, the sound emitted by the television remains locked in the center of the screen. This mismatch between the sound source location and the on-screen movement negatively impacts the user's audiovisual experience. Technical solutions

[0004] This application provides an audio control method, apparatus, medium, and device. The audio control method provided in this application addresses the problem in current audio control technologies where the location of the sound source does not match the on-screen action, thus affecting the user's audiovisual experience.

[0005] One embodiment of this application provides an audio control method applied to an audio output device, including:

[0006] When a target object is detected in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined;

[0007] Based on the target location information, generate the target audio output parameters required for the audio output device to output audio;

[0008] Based on the target audio output parameters, the audio output device is controlled to output audio corresponding to the target location information.

[0009] Furthermore, in the audio control method described in this application embodiment, when a target object is detected in the current playback screen and the target object performs a vocalization action, determining the target position information of the vocalization part corresponding to the vocalization action includes:

[0010] Obtain a screenshot of the currently playing screen, input the screenshot into a pre-trained target detection model to perform target detection, and output the first detection result;

[0011] If the first detection result indicates that a target object exists in the screenshot, then further detect whether the target object performs a sound-making action and output a second detection result;

[0012] If the second detection result indicates that the target object is performing a vocalization action, then the target location information of the vocalization part corresponding to the vocalization action is determined.

[0013] Furthermore, in the audio control method described in the embodiments of this application, the further detection of whether the target object performs a sound-emitting action includes:

[0014] Using the moment when the target object is detected as a reference moment, at least two screenshots containing the target object are obtained from the currently playing screen after the reference moment;

[0015] Extract the position coordinates of each sound-emitting part from each of the at least two screenshots;

[0016] The system detects whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, thereby determining whether the target object has performed a sound-emitting action.

[0017] Furthermore, in the audio control method described in the embodiments of this application, after outputting the first detection result, the method further includes:

[0018] If the first detection result indicates that there is no target object in the screenshot, then the target audio output parameters required for the audio output device to output audio are generated based on the center coordinates of the currently playing screen.

[0019] Based on the target audio output parameters, the audio output device is controlled to output audio at a position corresponding to the center position coordinates.

[0020] Furthermore, in the audio control method described in the embodiments of this application, determining the target position information of the sound-producing part corresponding to the sound-producing action includes:

[0021] Based on the coordinates of all key points of the target object, an array structure corresponding to the target object is constructed, and target position information corresponding to the vocalization part is selected from the array structure;

[0022] The step of generating the target audio output parameters required for the audio output device to output audio based on the target location information includes:

[0023] The target location information is numerically converted to obtain a target value within a preset range.

[0024] Based on the pre-created mapping relationship between the numerical value and the volume ratio of each channel in the audio output device, the target channel volume ratio corresponding to the target numerical value is determined;

[0025] Adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

[0026] Furthermore, in the audio control method described in this application embodiment, before determining the target position information of the sound-producing part corresponding to the sound-producing action when a target object is detected in the current playback screen and the target object performs a sound-producing action, the method further includes:

[0027] In response to the activation of a preset function, the system detects whether a target object exists in the current playback screen and whether the target object performs a sound-emitting action.

[0028] When a target object is detected for the first time in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined;

[0029] A preset identifier is invoked and displayed at the location corresponding to the target location information, and the preset identifier is hidden after a preset duration.

[0030] Furthermore, in the audio control method described in the embodiments of this application, after controlling the audio output device to output audio corresponding to the target location information, the method further includes:

[0031] Detect the distance between the current user and the audio output device;

[0032] If the distance value is greater than a first preset threshold, then the output volume of the audio output device is increased;

[0033] If the distance value is less than the second preset threshold, the output volume of the audio output device is reduced, where the first preset threshold is greater than the second preset threshold.

[0034] Accordingly, another aspect of this application embodiment also provides an audio control device, applied to an audio output device, including:

[0035] The position determination module is used to determine the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action.

[0036] The parameter generation module is used to generate target audio output parameters required for the audio output device to output audio based on the target location information.

[0037] An audio control module is used to control the audio output device to output audio corresponding to the target location information based on the target audio output parameters.

[0038] Accordingly, another aspect of this application embodiment also provides a computer-readable storage medium storing a plurality of instructions adapted for loading by a processor to execute the audio control method described above.

[0039] Accordingly, another aspect of this application provides an electronic device, including a processor and a memory, wherein the memory stores a plurality of instructions, and the processor loads the instructions to execute the audio control method described above.

[0040] This application provides an audio control method, apparatus, medium, and device. The method determines the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the currently playing screen and the target object performs a sound-emitting action; generates target audio output parameters required for the audio output device to output audio based on the target position information; and controls the audio output device to output audio at the position corresponding to the target position information based on the target audio output parameters. By using the audio control method provided in this application, when a target object is detected in the currently playing screen and the target object performs a sound-emitting action, the target audio output parameters required for the audio output device are generated based on the target position information of the sound-emitting part corresponding to the sound-emitting action, thereby controlling the audio output device to output audio corresponding to the target position information, achieving synchronization between the sound source position and the screen action, and improving the user's audiovisual experience. Attached Figure Description

[0041] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0042] Figure 1 is a flowchart illustrating the audio control method provided in an embodiment of this application.

[0043] Figure 2 is a schematic diagram of the specific process of S102 in the audio control method provided in the embodiment of this application.

[0044] Figure 3 is a schematic diagram of the specific process of S202 in the audio control method provided in the embodiment of this application.

[0045] Figure 4 is a schematic diagram of the processing flow corresponding to the first detection result indicating that there is no target object in the screenshot of the audio control method provided in the embodiment of this application.

[0046] Figure 5 is a schematic diagram of the specific process of S103 in the audio control method provided in the embodiment of this application.

[0047] Figure 6 is a schematic diagram of the processing flow corresponding to the first activation of the preset function in the audio control method provided in the embodiment of this application.

[0048] Figure 7 is a schematic diagram of the overall flow of the audio control method provided in the embodiments of this application.

[0049] Figure 8 is a schematic flowchart of the audio control method provided in the embodiments of this application.

[0050] Figure 9 is a schematic diagram of the structure of the audio control device provided in the embodiment of this application.

[0051] Figure 10 is another structural schematic diagram of the audio control device provided in the embodiment of this application.

[0052] Figure 11 is a schematic diagram of the structure of the electronic device provided in the embodiment of this application.

[0053] Implementation methods of this application

[0054] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the protection scope of this application.

[0055] This application provides an audio control method. By detecting the presence of a target object in the currently playing screen and the target object performing a sound-emitting action, the method generates target audio output parameters required by the audio output device based on the target position information of the sound-emitting part corresponding to the sound-emitting action. This controls the audio output device to output audio corresponding to the target position information, thereby synchronizing the sound source position with the screen action and improving the user's audiovisual experience.

[0056] The term "and / or" appearing in this application can describe the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Additionally, the character " / " in this application generally indicates that the preceding and following related objects have an "or" relationship.

[0057] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or television that includes a series of steps or modules is not necessarily limited to those explicitly listed steps or modules, but may include other steps or modules not explicitly listed or inherent to these processes, methods, products, or televisions. The naming or numbering of steps appearing in this application does not imply that the steps in the method flow must be performed in the chronological / logical order indicated by the naming or numbering. The execution order of named or numbered process steps can be changed according to the desired technical purpose, as long as the same or similar technical effect is achieved. The module division described in this application is a logical division. In practical applications, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection between modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between modules may be electrical or other similar forms, none of which are limited in this application. Furthermore, the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in multiple circuit modules. Some or all of the modules may be selected to achieve the purpose of the solution in this application according to actual needs.

[0058] The audio control method described in this application is mainly applied to electronic devices, such as televisions, smartphones, tablets, laptops, desktop computers, and servers, etc., without limitation. Optionally, the server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or an IoT cloud (Internet of Things cloud) that provides the ability to store, process, and manage data generated by IoT televisions, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms, without limitation.

[0059] For ease of understanding, the specific process in the embodiments of this application is described below. Please refer to Figure 1, which is a schematic flowchart of an embodiment of the audio control method provided in this application.

[0060] In the embodiment shown in Figure 1, the method specifically includes the following steps:

[0061] S101, when a target object is detected in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined.

[0062] It should be clarified that the target object specifically refers to an object capable of making vocalizations, such as a person or animal, or even any item imbued with the ability to make sounds (including virtual characters), without limitation. Vocalization specifically refers to the act of speaking or singing, or even any act that represents making a sound, without limitation. The vocal part specifically refers to the part of the body that performs the vocalization, such as the mouth, or other artificially defined vocal parts, without limitation.

[0063] For example, when it is detected that a character A in the current playback screen is speaking through his mouth (specifically, the opening and closing of his lips), it means that there is a target object in the current playback screen and the target object is performing a vocalization action. At this time, the target position information (x, y) of the vocalization part corresponding to the vocalization action is determined, which is used to control the audio output device (such as a TV) to output the audio of the corresponding vocalization action as key parameters.

[0064] S102, Based on the target location information, generate the target audio output parameters required for the audio output device to output audio.

[0065] It should be explained that the target audio output parameters refer to the key parameters required by the audio output device when outputting audio, such as volume and frequency response (frequency response refers to the frequency range and gain that the audio device can output). In this embodiment, the target audio output parameters are generated based on the target location information to highlight the sound effects at the location of the target location information, so as to keep the sound source position of the target object in the current playback screen synchronized with the screen action, thereby improving the user's audiovisual experience. This will be described in detail below.

[0066] S103, based on the target audio output parameters, control the audio output device to output audio corresponding to the target location information.

[0067] In this embodiment, based on the target audio output parameters corresponding to the target location information, the audio output device is controlled to output the audio corresponding to the target location information, so as to realize the synchronization of the sound source location and the screen action and improve the user's audiovisual experience.

[0068] In some embodiments, as shown in FIG2, when a target object is detected in the current playback screen and the target object performs a vocalization action, determining the target position information of the vocalization part corresponding to the vocalization action includes:

[0069] S201, Obtain a screenshot of the current playback screen, input the screenshot into a pre-trained target detection model for target detection, and output the first detection result;

[0070] In this embodiment, a screenshot of the currently playing screen is acquired, and a pre-trained object detection model is used to perform object detection on the screenshot. The presence of a target object in the currently playing screen is determined based on the first detection result. To ensure the real-time performance and accuracy of the object detection operation, a screenshot of the currently playing screen can be acquired every 60ms.

[0071] It should be noted that the object detection model can be trained based on conventional neural networks, such as R-CNN (Regions with CNN features) or YOLO (You Only Look Once). Since this solution does not make substantial improvements to existing models, the specific training process will not be described in detail here.

[0072] S202, if the first detection result indicates that there is a target object in the screenshot, then further detect whether the target object performs a sound-making action, and output the second detection result;

[0073] In this embodiment, if a target object is detected in the current playback screen, since the premise for triggering the target location information detection action in this scheme is that the target object exists in the current playback screen and the target object performs a sound action, after determining that the target object exists in the screenshot based on the first detection result, it is necessary to further detect whether the target object performs a sound action, output a second detection result, and determine whether the two conditions of the target object and the target object performing a sound action are met simultaneously in the current playback screen based on the output second detection result.

[0074] In some embodiments, it is also possible to detect whether the target object performs a vocalization action within a preset time period. The main purpose of detecting whether the target object performs a vocalization action within a preset time period is to determine the vocalization action, because vocalization is a dynamic effect from a visual perspective. For example, the opening and closing of the mouth is usually understood as a speaking action. Therefore, the limiting condition of "preset time period" is added.

[0075] S203, if the second detection result indicates that the target object is performing a vocalization action, then the target position information of the vocalization part corresponding to the vocalization action is determined.

[0076] In this embodiment, since the second detection result is based on the first detection result, when the second detection result indicates that the target object is performing a vocalization action, an instruction to obtain the target position information of the vocalization part corresponding to the vocalization action can be triggered.

[0077] In some embodiments, as shown in FIG3, further detecting whether the target object performs a vocalization action includes:

[0078] S2021, using the time when the target object was detected as the reference time, at least two screenshots of the current playback screen that are located after the reference time and contain the target object are obtained from the current playback screen;

[0079] S2022, extract the position coordinates of each sound-emitting part from the at least two screenshots;

[0080] S2023, detect whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, and then determine whether the target object has performed a sound-emitting action.

[0081] In this embodiment, by determining whether there are differences between at least two screenshots captured after the target object is detected, specifically by comparing whether the position coordinates of each sound-emitting part in the at least two screenshots have changed, it is determined whether the target object performs a sound-emitting action within a preset time period.

[0082] For example, y_top and y_bottom represent the y coordinates of the top and bottom edges of the mouth, respectively, d_open represents the mouth opening degree within a preset time period t, and d_open-1 represents the mouth opening degree of the previous frame.

[0083] The formula for lip opening is: d_open = |y_bottom - y_top|

[0084] Calculate the inter-frame difference: Δd = |d_open - d_open - 1|

[0085] When Δd exceeds the threshold T, it is considered that the mouth has performed a vocalization. In some embodiments, vocalizations can also be automatically identified from the currently playing video using a convolutional neural network (CNN) or other deep learning models. Furthermore, speech recognition technology can be combined to analyze the sound waveforms in the currently playing video, synchronized with video frames, and thus detect the occurrence of an action.

[0086] In some embodiments, as shown in FIG4, after S201, the method further includes the following steps:

[0087] S2011, if the first detection result indicates that there is no target object in the screenshot, then the target audio output parameters required for the audio output device to output audio are generated according to the center position coordinates of the currently playing screen;

[0088] In this embodiment, if no target object is detected in the current playback screen, the target audio output parameters can be generated by running the default values. In this embodiment, the center position coordinates of the current playback screen are selected, and the target audio output parameters required for the audio output device to output audio are generated based on the center position coordinates.

[0089] S2012, Based on the target audio output parameters, control the audio output device to output audio at a position corresponding to the center position coordinates.

[0090] In this embodiment, based on the target audio output parameters, the audio output device is controlled to output audio at the center of the current playback screen, ensuring that the audio is played in the center unless there are special circumstances.

[0091] In some embodiments, S102 specifically includes:

[0092] Based on the coordinates of all key points of the target object, an array structure corresponding to the target object is constructed, and target position information corresponding to the vocalization part is selected from the array structure;

[0093] In some embodiments, as shown in FIG5, S103 specifically includes:

[0094] S1031, The target location information is subjected to numerical conversion processing to obtain a target value within a preset range;

[0095] S1032, Based on the pre-created mapping relationship between the numerical value and the volume ratio of each channel in the audio output device, determine the target channel volume ratio corresponding to the target numerical value;

[0096] S1033, adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

[0097] In this embodiment, the acquired facial key points (e.g., the vertices corresponding to the eyes, nose, and mouth) can be constructed into an array structure containing the coordinate information of the key points. First, the coordinates of each key point are normalized to fit a specified coordinate range, ensuring that all coordinates use the same scale for easy comparison and calculation. Next, the coordinate information of the sound-producing part (e.g., the mouth) is extracted, and the target position information is numerically transformed (e.g., linear mapping) to obtain a target value within a preset range (e.g., 4-22). After obtaining the target value, based on the pre-created mapping relationship between the value and the volume ratio of each channel in the audio output device, the target channel volume ratio corresponding to the target value is determined. According to the target channel volume ratio, the target audio output parameters corresponding to each channel of the audio output device are adjusted. This is achieved by controlling the gain of different channels, so that the final output audio effect appears to be emanating from the sound-producing part of the target object, achieving synchronization between the sound source position and the on-screen action, and enhancing the user's audiovisual experience.

[0098] In some embodiments, as shown in Figures 6-7, prior to step S101, the method further includes:

[0099] S301, in response to the activation operation of the preset function, detect whether there is a target object in the current playback screen and whether the target object performs a sound action;

[0100] In this embodiment, the preset function refers to the function of controlling the sound-emitting part of the target object to be as consistent as possible with the position of the audio output device, so as to realize the synchronization of the sound source position and the on-screen action, thereby improving the user's audiovisual experience. This function can be encapsulated into a control for the user to choose whether to enable it. When the preset function is detected to be enabled, a detection command is triggered to determine whether there is a target object in the currently playing screen and whether the target object is performing a sound-emitting action.

[0101] S302, when a target object is detected for the first time in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined;

[0102] S303, invoke a preset identifier to display at the position corresponding to the target location information, and hide the preset identifier after a preset duration.

[0103] In this embodiment, when the preset function is first activated after the audio output device is powered on, in order to prompt the user that the current function has been activated, when the target object is detected to exist in the current playback screen and the target object performs a sound action, the target position information of the sound-producing part corresponding to the sound action can be determined, and a preset identifier (such as an edit box, annotation, or animation) can be displayed at the position of the corresponding target position information to prompt the user that the current preset function has been activated. After a preset duration (such as 3 seconds), the preset identifier is hidden to avoid affecting the user's viewing of the playback screen.

[0104] In some embodiments, after controlling the audio output device to output audio corresponding to the target location information, the method further includes:

[0105] Detect the distance between the current user and the audio output device;

[0106] If the distance value is greater than a first preset threshold, then the output volume of the audio output device is increased;

[0107] If the distance value is less than the second preset threshold, the output volume of the audio output device is reduced, where the first preset threshold is greater than the second preset threshold.

[0108] In this embodiment, to provide a more comfortable listening experience for the user and avoid issues such as excessively loud or soft audio output due to the user being too close or too far from the audio output device, the distance between the user and the audio output device is detected every 1 second. The distance is calculated, assuming the optimal viewing experience is between 3 and 5 meters. If the distance is less than 3 meters, the gain is reduced to decrease the output volume of the audio output device, satisfying the user's experience while ensuring the sound isn't too loud. If the distance is greater than 5 meters, the gain is increased to amplify the output volume of the audio output device, satisfying the user's experience while ensuring the sound isn't too quiet.

[0109] In some embodiments, when a target object in a back-to-back state is detected in the currently playing screen, and the currently output audio belongs to that target object, the position of the sound-emitting part on the back-to-back target object can be predicted. Based on this position, target audio output parameters required for the audio output device to output audio are generated. Based on the target audio output parameters, the audio output device is controlled to output audio corresponding to the target position information. This ensures that even when the target object is in a back-to-back state, the user can clearly perceive that the output audio is emanating from the sound-emitting part of the target object.

[0110] In some embodiments, when the area where the target object is located exceeds the maximum threshold (e.g., 90%) of the current playback screen, it means that the target object has been enlarged and fills the current playback screen. At this time, the operation of locating the sound-emitting part is not very meaningful. In order to reduce the data processing pressure of the system, the current operation of obtaining the position information of the sound-emitting part can be skipped, and the default centering sound output strategy can be directly used to output audio.

[0111] In some embodiments, for ease of understanding, please refer to Figure 8, which is a schematic flowchart of the audio control method provided in the embodiments of this application. The following is an explanation of the terms appearing in the figure:

[0112] UI (User Interface) layer: In the system architecture, it is mainly responsible for the interface design that interacts with users and is the direct interface for users to interact with the system.

[0113] Framework Layer: In the Android system, the Framework Layer is the core of the system architecture, located between the Application Layer and the operating system (Linux Kernel). It provides a series of services, libraries, and management functions to help developers create efficient and feature-rich applications.

[0114] Audio HAL (Audio Hardware Abstraction Layer): Audio HAL is a software layer used to handle interaction with audio hardware. It provides a standardized interface that allows operating systems and applications to communicate with audio hardware without directly dealing with hardware details. This abstraction layer helps simplify the development and maintenance of audio device drivers and promotes cross-platform compatibility.

[0115] SOC (System on Chip): SOC is an integrated circuit (IC) technology that integrates all or most of the functions of a computer or electronic system onto a single chip. An SOC typically integrates multiple functional modules, such as a central processing unit (CPU), graphics processing unit (GPU), memory controller, input / output ports, various interfaces (such as USB, HDMI, Ethernet), audio and video processing units, DSP modules, and power management units. Here, it refers to a TV processing chip.

[0116] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.

[0117] In practice, this application is not limited by the execution order of the described steps. Without causing conflicts, some steps may be performed in other orders or simultaneously.

[0118] As can be seen from the above, the audio control method provided in this application determines the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action; generates target audio output parameters required for the audio output device to output audio based on the target position information; and controls the audio output device to output audio corresponding to the target position information based on the target audio output parameters. By using the audio control method provided in this application, when a target object is detected in the current playback screen and the target object performs a sound-emitting action, target audio output parameters required for the audio output device are generated based on the target position information of the sound-emitting part corresponding to the sound-emitting action, thereby controlling the audio output device to output audio corresponding to the target position information, achieving synchronization between the sound source position and the screen action, and improving the user's audiovisual experience.

[0119] This application also provides an audio control device that can be integrated into an electronic device.

[0120] Please refer to Figure 9, which is a schematic diagram of the structure of the audio control device provided in an embodiment of this application. The audio control device 30 may include:

[0121] The position determination module 31 is used to determine the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action.

[0122] The parameter generation module 32 is used to generate target audio output parameters required for the audio output device to output audio based on the target location information.

[0123] The audio control module 33 is used to control the audio output device to output audio corresponding to the target location information based on the target audio output parameters.

[0124] In some embodiments, the position determination module 31 is used to obtain a screenshot of the currently playing screen, input the screenshot into a pre-trained target detection model for target detection, and output a first detection result; if the first detection result indicates that there is a target object in the screenshot, then it further detects whether the target object performs a vocalization action and outputs a second detection result; if the second detection result indicates that the target object performs a vocalization action, then it determines the target position information of the vocalization part corresponding to the vocalization action.

[0125] In some embodiments, the position determination module 31 is used to obtain at least two screenshots of the current playback screen that are located after the reference time and contain the target object, using the time when the target object is detected as the reference time; extract the position coordinates of each sound-emitting part in the at least two screenshots respectively; detect whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, and then determine whether the target object has performed a sound-emitting action.

[0126] In some embodiments, the device further includes a control module, configured to, if the first detection result indicates that there is no target object in the screenshot, generate target audio output parameters required for the audio output device to output audio based on the center position coordinates of the currently playing screen; and control the audio output device to output audio at a position corresponding to the center position coordinates based on the target audio output parameters.

[0127] In some embodiments, the position determination module 31 is used to construct an array structure corresponding to the target object based on the coordinates of all key points of the target object, and select target position information corresponding to the sound-producing part from the array structure.

[0128] In some embodiments, the parameter generation module 32 is used to perform numerical conversion processing on the target location information to obtain a target value within a preset range; determine the target channel volume ratio corresponding to the target value based on the pre-created mapping relationship between the value and the volume ratio of each channel in the audio output device; and adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

[0129] In some embodiments, the device further includes an effect display module, configured to, in response to an operation to enable a preset function, detect whether there is a target object in the current playback screen and whether the target object performs a vocal action; when the target object is detected for the first time in the current playback screen and the target object performs a vocal action, determine the target position information of the vocal part corresponding to the vocal action; call a preset identifier to display at the position corresponding to the target position information, and hide the preset identifier after a preset duration.

[0130] In some embodiments, the device further includes a volume adjustment module for detecting the distance between the current user and the audio output device; if the distance is greater than a first preset threshold, increasing the output volume of the audio output device; if the distance is less than a second preset threshold, decreasing the output volume of the audio output device, wherein the first preset threshold is greater than the second preset threshold.

[0131] In practice, the above modules can be implemented as independent entities or combined in any way to be implemented as the same or several entities.

[0132] As can be seen from the above, the audio control device 30 provided in this application embodiment includes a position determination module 31, which is used to determine the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action; a parameter generation module 32, which is used to generate target audio output parameters required for the audio output device to output audio based on the target position information; and an audio control module 33, which is used to control the audio output device to output audio corresponding to the target position information based on the target audio output parameters. By generating target audio output parameters required for the audio output device based on the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action, the audio output device is controlled to output audio corresponding to the target position information, thereby synchronizing the sound source position with the screen action and improving the user's audiovisual experience.

[0133] This application also provides an audio control device that can be integrated into an electronic device.

[0134] In practice, the above modules can be implemented as independent entities or combined in any way to be implemented as the same or several entities.

[0135] Please refer to Figure 10, which is another structural schematic diagram of the audio control device provided in this application embodiment. The audio control device 30 includes a memory 120, one or more processors 180, and one or more application programs, wherein the one or more application programs are stored in the memory 120 and configured to be executed by the processor 180; the processor 180 may include a position determination module 31, a parameter generation module 32, and an audio control 33. For example, the structure and connection relationship of the above components can be as follows:

[0136] Memory 120 can be used to store applications and data. The applications stored in memory 120 contain executable code. Applications can be composed of various functional modules. Processor 180 executes various functional applications and data processing by running the applications stored in memory 120. Furthermore, memory 120 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, memory 120 may also include a memory controller to provide processor 180 with access to memory 120.

[0137] The processor 180 is the control center of the device, connecting all parts of the terminal through various interfaces and lines. It performs various functions and processes data by running or executing applications stored in the memory 120 and calling data stored in the memory 120, thereby providing overall monitoring of the device. Optionally, the processor 180 may include one or more processing cores; preferably, the processor 180 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications.

[0138] Specifically, in this embodiment, the processor 180 loads the executable code corresponding to the processes of one or more applications into the memory 120 according to the following instructions, and the processor 180 runs the applications stored in the memory 120 to achieve various functions:

[0139] The position determination instruction is used to determine the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action.

[0140] The parameter generation instruction is used to generate the target audio output parameters required for the audio output device to output audio based on the target location information.

[0141] Audio control commands are used to control the audio output device to output audio corresponding to the target location information based on the target audio output parameters.

[0142] In some embodiments, the location determination instruction is used to obtain a screenshot corresponding to the currently playing screen, input the screenshot into a pre-trained target detection model for target detection, and output a first detection result; if the first detection result indicates that there is a target object in the screenshot, then it is further detected whether the target object performs a vocalization action, and a second detection result is output; if the second detection result indicates that the target object performs a vocalization action, then the target location information of the vocalization part corresponding to the vocalization action is determined.

[0143] In some embodiments, the position determination instruction is used to obtain at least two screenshots from the currently playing screen that are located after the reference time and contain the target object, using the time when the target object is detected as a reference time; extract the position coordinates of each sound-emitting part in the at least two screenshots respectively; detect whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, and then determine whether the target object has performed a sound-emitting action.

[0144] In some embodiments, the program further includes control instructions for generating target audio output parameters required for the audio output device to output audio based on the center position coordinates of the currently playing screen if the first detection result indicates that there is no target object in the screenshot; and controlling the audio output device to output audio at the position corresponding to the center position coordinates based on the target audio output parameters.

[0145] In some embodiments, the location determination instruction is used to construct an array structure corresponding to the target object based on the coordinates of all key points of the target object, and to select target location information corresponding to the vocalization part from the array structure.

[0146] In some embodiments, the parameter generation instruction is used to perform numerical conversion processing on the target location information to obtain a target value within a preset range; determine the target channel volume ratio corresponding to the target value based on the pre-created mapping relationship between the value and the volume ratio of each channel in the audio output device; and adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

[0147] In some embodiments, the program further includes an effect display instruction, which, in response to an operation to enable a preset function, detects whether a target object exists in the current playback screen and whether the target object performs a vocal action; when the target object is detected for the first time in the current playback screen and the target object performs a vocal action, determines the target position information of the vocal part corresponding to the vocal action; calls a preset identifier to display at the position corresponding to the target position information, and hides the preset identifier after a preset duration.

[0148] In some embodiments, the program further includes a volume adjustment command for detecting the distance between the current user and the audio output device; if the distance is greater than a first preset threshold, increasing the output volume of the audio output device; if the distance is less than a second preset threshold, decreasing the output volume of the audio output device, wherein the first preset threshold is greater than the second preset threshold.

[0149] This application also provides an electronic device. Please refer to Figure 11, which shows a schematic diagram of the structure of the electronic device provided in this application embodiment. This electronic device can be used to implement the audio control method provided in the above embodiments. The electronic device 1200 can be a television set, a smartphone, or a tablet computer.

[0150] As shown in Figure 11, the electronic device 1200 may include an RF (Radio Frequency) circuit 110, a memory 120 including one or more (only one is shown in the figure) computer-readable storage media, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a transmission module 170, a processor 180 including one or more (only one is shown in the figure) processing cores, and a power supply 190, etc. Those skilled in the art will understand that the structure of the electronic device 1200 shown in Figure 11 does not constitute a limitation on the electronic device 1200, and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:

[0151] RF circuit 110 is used to receive and transmit electromagnetic waves, converting electromagnetic waves into electrical signals and vice versa, thereby enabling communication with communication networks or other television sets. RF circuit 110 may include various existing circuit elements used to perform these functions, such as antennas, radio frequency transceivers, digital signal processors, encryption / decryption chips, user identity module (SIM) cards, memory, etc. RF circuit 110 can communicate with various networks such as the Internet, corporate intranets, wireless networks, or communicate with other television sets via wireless networks.

[0152] The memory 120 can be used to store software programs and modules, such as the program instructions / modules corresponding to the audio control method in the above embodiment. The processor 180 executes various functional applications and data processing by running the software programs and modules stored in the memory 120. It can automatically select a vibration alert mode for audio control based on the current scenario of the electronic device, ensuring that scenarios such as meetings are not disturbed while allowing users to sense incoming calls, thus improving the intelligence of the electronic device. The memory 120 may include high-speed random access memory and non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memories. In some instances, the memory 120 may further include memories remotely located relative to the processor 180, which can be connected to the electronic device 1200 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0153] Input unit 130 can be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control. Specifically, input unit 130 may include touch-sensitive surface 131 and other input televisions 132. Touch-sensitive surface 131, also known as a touch display screen or touchpad, can collect user touch operations on or near it (such as user operations using fingers, styluses, or any suitable object or accessory on or near touch-sensitive surface 131), and drive corresponding connection devices according to a pre-set program. Optionally, touch-sensitive surface 131 may include two parts: a touch detection device and a touch controller. The touch detection device detects the user's touch position and the signal generated by the touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, sends it to the processor 180, and can receive and execute commands from the processor 180. Furthermore, the touch-sensitive surface 131 can be implemented using various methods such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch-sensitive surface 131, the input unit 130 may also include other input devices 132. Specifically, the other input devices 132 may include, but are not limited to, one or more of the following: a physical keyboard, function keys (such as volume control buttons, power buttons, etc.), a trackball, a mouse, and a joystick.

[0154] Display unit 140 can be used to display information input by the user or information provided to the user, as well as various graphical user interfaces of electronic device 1200. These graphical user interfaces can be composed of graphics, text, icons, video, and any combination thereof. Display unit 140 may include display panel 141, which may optionally be configured as an LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or similar form. Further, touch-sensitive surface 131 may cover display panel 141. When touch-sensitive surface 131 detects a touch operation on or near it, it transmits the information to processor 180 to determine the type of touch event. Subsequently, processor 180 provides corresponding visual output on display panel 141 according to the type of touch event. Although in FIG. 11, touch-sensitive surface 131 and display panel 141 are implemented as two separate components to realize input and output functions, in some embodiments, touch-sensitive surface 131 and display panel 141 can be integrated to realize input and output functions.

[0155] The electronic device 1200 may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 141 according to the ambient light level, and the proximity sensor can turn off the display panel 141 and / or backlight when the electronic device 1200 is moved to the ear. As a type of motion sensor, a gravity acceleration sensor can detect the magnitude of acceleration in various directions (generally three axes). When stationary, it can detect the magnitude and direction of gravity and can be used for applications that recognize the phone's posture (such as landscape / portrait switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc. Other sensors that may be configured in the electronic device 1200, such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors, will not be described in detail here.

[0156] Audio circuitry 160, speaker 161, and microphone 162 provide an audio interface between the user and electronic device 1200. Audio circuitry 160 converts received audio data into electrical signals and transmits them to speaker 161, where speaker 161 converts them into sound signals for output. Conversely, microphone 162 converts collected sound signals into electrical signals, which are then received by audio circuitry 160, converted back into audio data, and processed by processor 180. The audio data is then transmitted via RF circuitry 110 to another terminal, or output to memory 120 for further processing. Audio circuitry 160 may also include an earphone jack to facilitate communication between external headphones and electronic device 1200.

[0157] Electronic device 1200, through transmission module 170 (e.g., Wi-Fi module), enables users to send and receive emails, browse web pages, and access streaming media, providing users with wireless broadband internet access. Although Figure 11 shows transmission module 170, it is understood that it is not an essential component of electronic device 1200 and can be omitted as needed without altering the essence of the invention.

[0158] The processor 180 is the control center of the electronic device 1200. It connects to various parts of the phone via various interfaces and lines, and performs various functions and processes data of the electronic device 1200 by running or executing software programs and / or modules stored in the memory 120, and by calling data stored in the memory 120, thereby providing overall monitoring of the phone. Optionally, the processor 180 may include one or more processing cores; in some embodiments, the processor 180 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may also not be integrated into the processor 180.

[0159] The electronic device 1200 also includes a power supply 190 that supplies power to the various components. In some embodiments, the power supply can be logically connected to the processor 180 through a power management system, thereby enabling functions such as discharge management and power consumption management through the power management system. The power supply 190 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0160] Although not shown, the electronic device 1200 may also include a camera (such as a front-facing camera and a rear-facing camera), a Bluetooth module, etc., which will not be described in detail here. Specifically, in this embodiment, the display unit 140 of the electronic device 1200 is a touch screen display, and the electronic device 1200 also includes a memory 120 and one or more programs, one or more of which are stored in the memory 120 and configured to be executed by one or more processors 180. One or more programs contain instructions for performing the following operations:

[0161] The position determination instruction is used to determine the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action.

[0162] The parameter generation instruction is used to generate the target audio output parameters required for the audio output device to output audio based on the target location information.

[0163] Audio control commands are used to control the audio output device to output audio corresponding to the target location information based on the target audio output parameters.

[0164] In some embodiments, the location determination instruction is used to obtain a screenshot corresponding to the currently playing screen, input the screenshot into a pre-trained target detection model for target detection, and output a first detection result; if the first detection result indicates that there is a target object in the screenshot, then it is further detected whether the target object performs a vocalization action, and a second detection result is output; if the second detection result indicates that the target object performs a vocalization action, then the target location information of the vocalization part corresponding to the vocalization action is determined.

[0165] In some embodiments, the position determination instruction is used to obtain at least two screenshots from the currently playing screen that are located after the reference time and contain the target object, using the time when the target object is detected as a reference time; extract the position coordinates of each sound-emitting part in the at least two screenshots respectively; detect whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, and then determine whether the target object has performed a sound-emitting action.

[0166] In some embodiments, the program further includes control instructions for generating target audio output parameters required for the audio output device to output audio based on the center position coordinates of the currently playing screen if the first detection result indicates that there is no target object in the screenshot; and controlling the audio output device to output audio at the position corresponding to the center position coordinates based on the target audio output parameters.

[0167] In some embodiments, the location determination instruction is used to construct an array structure corresponding to the target object based on the coordinates of all key points of the target object, and to select target location information corresponding to the vocalization part from the array structure.

[0168] In some embodiments, the parameter generation instruction is used to perform numerical conversion processing on the target location information to obtain a target value within a preset range; determine the target channel volume ratio corresponding to the target value based on the pre-created mapping relationship between the value and the volume ratio of each channel in the audio output device; and adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

[0169] In some embodiments, the program further includes an effect display instruction, which, in response to an operation to enable a preset function, detects whether a target object exists in the current playback screen and whether the target object performs a vocal action; when the target object is detected for the first time in the current playback screen and the target object performs a vocal action, determines the target position information of the vocal part corresponding to the vocal action; calls a preset identifier to display at the position corresponding to the target position information, and hides the preset identifier after a preset duration.

[0170] In some embodiments, the program further includes a volume adjustment command for detecting the distance between the current user and the audio output device; if the distance is greater than a first preset threshold, increasing the output volume of the audio output device; if the distance is less than a second preset threshold, decreasing the output volume of the audio output device, wherein the first preset threshold is greater than the second preset threshold.

[0171] This application also provides an electronic device. The electronic device may be a smartphone, television, computer, or other similar device.

[0172] As can be seen from the above, this application embodiment provides an electronic device 1200, which performs the following steps:

[0173] When a target object is detected in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined;

[0174] Based on the target location information, generate the target audio output parameters required for the audio output device to output audio;

[0175] Based on the target audio output parameters, the audio output device is controlled to output audio corresponding to the target location information.

[0176] This application also provides a storage medium storing a computer program. When the computer program is run on a computer, the computer executes the audio control method described in any of the above embodiments.

[0177] It should be noted that, for the audio control method described in this application, those skilled in the art will understand that all or part of the processes of the audio control method described in the embodiments of this application can be implemented by a computer program controlling related hardware. The computer program can be stored in a computer-readable storage medium, such as in the memory of an electronic device, and executed by at least one processor within the electronic device. During execution, it can include the processes of the embodiments of the audio control method described. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), random access memory (RAM), etc.

[0178] For the audio control device described in this application embodiment, its functional modules can be integrated into a single processing chip, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.

[0179] The audio control method, apparatus, medium, and device provided in the embodiments of this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are only for the purpose of helping to understand the methods and core ideas of this application; at the same time, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. An audio control method applied to an audio output device, wherein, include: When a target object is detected in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined; Based on the target location information, generate the target audio output parameters required for the audio output device to output audio; Based on the target audio output parameters, the audio output device is controlled to output audio corresponding to the target location information.

2. The audio control method as described in claim 1, wherein, When a target object is detected in the currently playing screen and the target object performs a vocalization action, the target position information of the vocalization part corresponding to the vocalization action is determined, including: Obtain a screenshot of the currently playing screen, input the screenshot into a pre-trained target detection model to perform target detection, and output the first detection result; If the first detection result indicates that a target object exists in the screenshot, then further detect whether the target object performs a sound-making action and output a second detection result; If the second detection result indicates that the target object is performing a vocalization action, then the target location information of the vocalization part corresponding to the vocalization action is determined.

3. The audio control method as described in claim 2, wherein, The further detection of whether the target object performs a vocalization includes: Using the moment when the target object is detected as a reference moment, at least two screenshots containing the target object are obtained from the currently playing screen after the reference moment; Extract the position coordinates of each sound-emitting part from each of the at least two screenshots; The system detects whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, thereby determining whether the target object has performed a sound-emitting action.

4. The audio control method according to claim 3, wherein, Determining whether the target object performs a sound-emitting action is based on calculating the inter-frame difference using the position coordinates of each sound-emitting part in two screenshots. The formula for calculating the inter-frame difference is: d_open=|y_bottom-y_top| Δd=|d_open-d_open-1| Where Δd is the inter-frame difference between the two screenshots, representing whether the relative position coordinates of each sound-emitting part in the two screenshots have changed, y_top and y_bottom represent the y coordinates of the top and bottom edges of the sound-emitting part, respectively, d_open represents the opening degree of the sound-emitting part within the preset time period t, and d_open-1 represents the opening degree of the sound-emitting part in the previous frame. If the inter-frame difference exceeds a threshold, then the target object is determined to perform a vocalization action.

5. The audio control method as described in claim 2, wherein, After outputting the first detection result, the method further includes: If the first detection result indicates that there is no target object in the screenshot, then the target audio output parameters required for the audio output device to output audio are generated based on the center coordinates of the currently playing screen. Based on the target audio output parameters, the audio output device is controlled to output audio at a position corresponding to the center position coordinates.

6. The audio control method as described in claim 1, wherein, Determining the target location information of the vocalization part corresponding to the vocalization action includes: Based on the coordinates of all key points of the target object, an array structure corresponding to the target object is constructed, and target position information corresponding to the vocalization part is selected from the array structure; The step of generating the target audio output parameters required for the audio output device to output audio based on the target location information includes: The target location information is numerically converted to obtain a target value within a preset range. Based on the pre-created mapping relationship between the numerical value and the volume ratio of each channel in the audio output device, the target channel volume ratio corresponding to the target numerical value is determined; Adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

7. The audio control method as described in claim 1, wherein, Before determining the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the currently playing screen and the target object performs a sound-emitting action, the method further includes: In response to the activation of a preset function, the system detects whether a target object exists in the current playback screen and whether the target object performs a sound-emitting action. When a target object is detected for the first time in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined; A preset identifier is invoked and displayed at the location corresponding to the target location information, and the preset identifier is hidden after a preset duration.

8. The audio control method as described in claim 1, wherein, After controlling the audio output device to output audio corresponding to the target location information, the method further includes: Detect the distance between the current user and the audio output device; If the distance value is greater than a first preset threshold, then the output volume of the audio output device is increased; If the distance value is less than the second preset threshold, the output volume of the audio output device is reduced, where the first preset threshold is greater than the second preset threshold.

9. The audio control method as described in claim 1, wherein, The audio control method further includes: When a target object corresponding to the currently output audio is detected in the currently playing screen and the target object is in a back-facing state, the position of the sound-emitting part of the target object in the back-facing state is predicted. Based on the location of the sound-producing part, the target audio output parameters required for the output audio are generated; Based on the target audio output parameters, output the audio corresponding to the target location information.

10. An audio control device, applied to an audio output device, wherein, The audio control device includes: The position determination module is used to determine the target position information of the sound-emitting part corresponding to the sound-emitting action when a target object is detected in the current playback screen and the target object performs a sound-emitting action. The parameter generation module is used to generate target audio output parameters required for the audio output device to output audio based on the target location information. An audio control module is used to control the audio output device to output audio corresponding to the target location information based on the target audio output parameters.

11. The audio control device as claimed in claim 10, wherein, The position determination module is further configured to, when detecting the presence of a target object in the currently playing screen and the target object performing a vocalization action, determine the target position information of the vocalization part corresponding to the vocalization action, including: Obtain a screenshot of the currently playing screen, input the screenshot into a pre-trained target detection model to perform target detection, and output the first detection result; If the first detection result indicates that a target object exists in the screenshot, then further detect whether the target object performs a sound-making action and output a second detection result; If the second detection result indicates that the target object is performing a vocalization action, then the target location information of the vocalization part corresponding to the vocalization action is determined.

12. The audio control device as claimed in claim 11, wherein, The location determination module is further used to detect whether the target object performs a vocalization action, including: Using the moment when the target object is detected as a reference moment, at least two screenshots containing the target object are obtained from the currently playing screen after the reference moment; Extract the position coordinates of each sound-emitting part from each of the at least two screenshots; The system detects whether the relative position coordinates of each sound-emitting part in the at least two screenshots have changed, thereby determining whether the target object has performed a sound-emitting action.

13. The audio control device as claimed in claim 12, wherein, The position determination module is also used to determine whether the target object performs a sound-emitting action based on the position coordinates of each sound-emitting part in two screenshots, calculating the inter-frame difference. The formula for calculating the inter-frame difference is: d_open=|y_bottom-y_top| Δd=|d_open-d_open-1| Where Δd is the inter-frame difference between the two screenshots, representing whether the relative position coordinates of each sound-emitting part in the two screenshots have changed, y_top and y_bottom represent the y coordinates of the top and bottom edges of the sound-emitting part, respectively, d_open represents the opening degree of the sound-emitting part within the preset time period t, and d_open-1 represents the opening degree of the sound-emitting part in the previous frame. If the inter-frame difference exceeds a threshold, then the target object is determined to perform a vocalization action.

14. The audio control device as claimed in claim 11, wherein, After outputting the first detection result, the location determination module is further configured to: If the first detection result indicates that there is no target object in the screenshot, then the target audio output parameters required for the audio output device to output audio are generated based on the center coordinates of the currently playing screen. Based on the target audio output parameters, the audio output device is controlled to output audio at a position corresponding to the center position coordinates.

15. The audio control device as claimed in claim 10, wherein, The parameter generation module is also used to determine the target location information of the vocalization part corresponding to the vocalization action, including: Based on the coordinates of all key points of the target object, an array structure corresponding to the target object is constructed, and target position information corresponding to the vocalization part is selected from the array structure; The step of generating the target audio output parameters required for the audio output device to output audio based on the target location information includes: The target location information is numerically converted to obtain a target value within a preset range. Based on the pre-created mapping relationship between the numerical value and the volume ratio of each channel in the audio output device, the target channel volume ratio corresponding to the target numerical value is determined; Adjust the target audio output parameters corresponding to each channel of the audio output device according to the target channel volume ratio.

16. The audio control device as claimed in claim 10, wherein, Before determining the target position information of the vocal part corresponding to the vocal action when the position determination module detects the presence of a target object in the currently playing screen and the target object performs a vocal action, it is further configured to: In response to the activation of a preset function, the system detects whether a target object exists in the current playback screen and whether the target object performs a sound-emitting action. When a target object is detected for the first time in the current playback screen and the target object performs a sound-emitting action, the target position information of the sound-emitting part corresponding to the sound-emitting action is determined; A preset identifier is invoked and displayed at the location corresponding to the target location information, and the preset identifier is hidden after a preset duration.

17. The audio control device as claimed in claim 1, wherein, After controlling the audio output device to output audio corresponding to the target location information, the audio control module is further configured to: Detect the distance between the current user and the audio output device; If the distance value is greater than a first preset threshold, then the output volume of the audio output device is increased; If the distance value is less than the second preset threshold, the output volume of the audio output device is reduced, where the first preset threshold is greater than the second preset threshold.

18. The audio control device as claimed in claim 1, wherein, The audio control module is also used for: When a target object corresponding to the currently output audio is detected in the currently playing screen and the target object is in a back-facing state, the position of the sound-emitting part of the target object in the back-facing state is predicted. Based on the location of the sound-producing part, the target audio output parameters required for the output audio are generated; Based on the target audio output parameters, output the audio corresponding to the target location information.

19. A computer-readable storage medium, wherein, The computer-readable storage medium stores a plurality of instructions adapted for loading by a processor to perform the audio control method according to any one of claims 1 to 9.

20. An electronic device, wherein, The device includes a processor and a memory, the memory storing a plurality of instructions, the processor loading the instructions to execute the audio control method according to any one of claims 1 to 9.