Method, apparatus, device and medium for mixing sounds of a remote interventional procedure
By filtering and processing audio data in the remote interventional surgery system, the problem of transmitting human voice and background noise together was solved, achieving balanced human voice volume and improving the quality of voice calls during remote interventional surgery.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN INST OF ADVANCED BIOMEDICAL ROBOT CO LTD
- Filing Date
- 2023-04-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies have the problem of transmitting human voice and background noise together during remote interventional surgery, making it difficult for doctors to hear what is being said, and the volume of human voice in the mixed audio may be too low.
The media server performs voice detection on audio data from multiple remote clients, filters out target audio signals containing only human voices, performs superposition processing, and calculates mixed data packets based on attenuation factors and preset adjustment factors to ensure balanced volume of different human voices.
It effectively removes background noise interference, ensures balanced human voice volume in the mixed audio data packet, and improves the quality of voice calls and user experience during remote interventional surgery.
Smart Images

Figure CN116453534B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of speech processing technology, and in particular to a mixing method, apparatus, device, and medium for remote interventional surgery. Background Technology
[0002] With the continuous development of information technology, remote surgery has been increasingly widely used. Remote surgery offers advantages such as speed, timely processing, and the elimination of the need for patients to travel. Remote surgery requires multi-person voice communication. Typically, one surgeon operates a remote client, while multiple doctors in different regions (e.g., different provinces / cities) operate their respective remote clients to conduct multi-person voice communication.
[0003] Existing technologies, such as CN109863553A, disclose acquiring sound commands through a voice sensor, converting the sound commands into speech signals through a signal transmitter, and sending one or more speech signals to a processor. The signal transmitter acts as a media server, the voice sensor as a remote client, and the processor as another remote client. This prior art cannot distinguish between human voices and background noise, sending both human voices and background noise from one remote client to another. Existing technologies, such as CN109510905B, disclose selecting a centralized or distributed mixing strategy and mixing audio data streams based on an adaptive normalization mixing algorithm. However, current adaptive normalization mixing algorithms directly add the attenuation factor and the iteration step size to update the attenuation factor, which may result in an underestimation of human voices in the mixed result.
[0004] In summary, existing audio mixing technologies used in remote interventional surgery have several drawbacks. Firstly, they may transmit both human voice and background noise from one remote client to another. Secondly, directly adding the attenuation factor and iteration step size to update the attenuation factor can result in an under-represented human voice in the mixed audio. Furthermore, with existing mixing solutions, too many people speaking simultaneously on each remote client during remote interventional surgery makes it difficult for doctors to hear clearly, hindering the smooth execution of the procedure. Summary of the Invention
[0005] The purpose of this application is to provide a mixing method, device, equipment, and medium for remote interventional surgery, which can solve the problems in the prior art where a user on one remote client cannot hear the voice of another user on a remote client, the voice and background noise of one remote client are sent to another remote client together, and when each remote client plays the mixing, too many people are talking at the same time, making it difficult for doctors to hear the content clearly.
[0006] To achieve the above objectives, this application provides a mixing method for remote interventional surgery, applied to a remote interventional surgery system. The remote interventional surgery system includes a media server and multiple remote clients connected to the media server. The method includes the following steps performed by the media server:
[0007] Acquire audio data from multiple remote clients, perform voice detection on the audio data, and filter out multiple target audio signals that contain only human voices. The audio data is generated during multi-person calls in remote interventional surgery.
[0008] All the target audio signals are superimposed to obtain a superimposed audio signal;
[0009] The attenuation signal value of the i-th frame is obtained based on the signal value of the i-th frame of the superimposed audio signal and the preset attenuation factor;
[0010] If the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, then the signal ratio between the maximum value of the audio signal and the attenuation signal value of the i-th frame is calculated, and the largest integer value is selected from the range less than the signal ratio to obtain the attenuation factor of the i-th frame;
[0011] The output value of the i-th frame is determined based on the attenuation factor of the i-th frame, and the average value of the attenuation factor of the i-th frame and the preset adjustment factor is used as the attenuation factor of the (i+1)-th frame.
[0012] Based on the attenuation factor of the (i+1)th frame and the output value of the ith frame, a mixed audio data packet is obtained; the mixed audio data packet is distributed to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
[0013] Preferably, the step of performing speech detection on the audio data and filtering out multiple target audio signals containing only human voices includes:
[0014] The audio data is subjected to speech detection using the VAD method to obtain the audio signal corresponding to each audio data; or, the speech features of the audio data are extracted and input into a trained speech detection model to obtain the audio signal corresponding to each audio data.
[0015] Detect whether the total number of audio signals is greater than the signal threshold. If so, calculate the volume threshold corresponding to each audio signal.
[0016] Select N target thresholds from all the stated volume thresholds;
[0017] Multiple target audio signals are selected based on N target thresholds.
[0018] Preferably, the step of filtering N target thresholds from all the volume thresholds includes:
[0019] All the volume thresholds are sorted in ascending order using the bubble sort method to obtain the threshold sequence;
[0020] The last N volume thresholds of the threshold sequence are used as the N target thresholds.
[0021] Preferably, after obtaining the attenuation signal value of the i-th frame, the method further includes:
[0022] If the attenuation signal value of the i-th frame is less than the minimum value of the audio signal, then the signal ratio between the minimum value of the audio signal and the attenuation signal value of the i-th frame is calculated, and the smallest integer value is selected from the range greater than the signal ratio to obtain the attenuation factor of the i-th frame;
[0023] The output value of the i-th frame is set to the minimum value of the audio signal based on the attenuation factor of the i-th frame.
[0024] Preferably, obtaining the mixed audio data packet based on the attenuation factor of the (i+1)th frame and the output value of the i-th frame includes:
[0025] Multiply the (i+1)th frame signal value of the superimposed audio signal by the (i+1)th frame attenuation factor to obtain the (i+1)th frame attenuation signal value;
[0026] If the attenuation signal value of the (i+1)th frame is greater than the maximum value of the audio signal, then the difference between the attenuation factor of the (i+1)th frame and the preset step size is used as the attenuation factor of the (i+2)th frame.
[0027] If the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, then the signal ratio between the minimum value of the audio signal and the attenuation signal value of the (i+1)th frame is calculated, and the smallest integer value from the range greater than the signal ratio is selected as the attenuation factor of the (i+2)th frame.
[0028] Detect whether the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal. If so, calculate the output value of the (i+1)th frame based on the attenuation signal value of the (i+1)th frame, and combine the output value of the first frame to the output value of the (i+1)th frame to form the mixed audio data packet.
[0029] If the (i+1)th frame of the superimposed audio signal is not the last frame of the superimposed audio signal, then the output value of the (i+2)th frame is calculated based on the signal value of the (i+2)th frame of the superimposed audio signal and the attenuation factor of the (i+2)th frame.
[0030] Preferably, distributing the mixed audio data packet to each of the remote clients for broadcasting the mixed audio data packet includes:
[0031] The system detects whether the mixed audio data packet contains the voice of the target remote client. If it does, the target remote client is designated as the first target remote client. The voice of the first target remote client is removed to obtain a de-echo data packet. The de-echo data packet is distributed to the first target remote client, and the first target remote client is controlled to broadcast the de-echo data packet.
[0032] If the mixed audio data packet does not contain the human voice corresponding to the target remote client, then the target remote client is designated as the second target remote client; the mixed audio data packet is distributed to the second target remote client, and the second target remote client is controlled to broadcast the mixed audio data packet.
[0033] Preferably, determining the output value of the i-th frame based on the attenuation factor of the i-th frame includes:
[0034] The maximum value of the audio signal is used as the output value of the i-th frame.
[0035] This application provides a mixing device for remote interventional surgery, applied to a remote interventional surgery system. The remote interventional surgery system includes a media server and multiple remote clients connected to the media server. The device is located on the media server and includes:
[0036] The target audio signal filtering module is used to acquire audio data from multiple remote clients, perform voice detection on the audio data, and filter out multiple target audio signals that contain only human voices. The audio data is generated during multi-person calls in remote interventional surgery.
[0037] A signal superposition module is used to superimpose all the target audio signals to obtain a superimposed audio signal;
[0038] The i-th frame attenuation signal value calculation module is used to obtain the i-th frame attenuation signal value based on the i-th frame signal value of the superimposed audio signal and a preset attenuation factor.
[0039] The attenuation factor selection module for the i-th frame is used to calculate the signal ratio between the maximum value of the audio signal and the attenuation signal value of the i-th frame if the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, and select the largest integer value from the range less than the signal ratio to obtain the attenuation factor of the i-th frame.
[0040] The attenuation factor calculation module for the (i+1)th frame is used to determine the output value of the i-th frame based on the attenuation factor of the i-th frame, and to use the average value of the attenuation factor of the i-th frame and the preset adjustment factor as the attenuation factor of the (i+1)th frame.
[0041] The data packet broadcasting module is used to obtain the mixed audio data packet based on the attenuation factor of the (i+1)th frame and the output value of the i-th frame; and to distribute the mixed audio data packet to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
[0042] This application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the mixing method for remote interventional surgery described in any of the above claims and / or the steps of the mixing method for remote interventional surgery described in any of the above claims.
[0043] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the audio mixing method for remote interventional surgery described in any of the preceding claims and / or the audio mixing method for remote interventional surgery described in any of the preceding claims.
[0044] This application discloses a mixing method for remote interventional surgery, applied to a remote interventional surgery system. The system includes a media server and multiple remote clients connected to the media server, and the method is executed by the media server. By using voice detection to filter out target audio signals containing only human voices from audio data, noise can be removed, preventing users from being unable to identify the voice content in the mixed audio data packet due to noise interference. If the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, the largest integer value within the range less than the signal ratio is selected as the attenuation factor for the i-th frame, reducing the signal value of the i-th frame to the maximum value within a preset range. Using the average of the attenuation factor of the i-th frame and a preset adjustment factor as the attenuation factor of the (i+1)-th frame, compared to directly adding the attenuation factor of the i-th frame to the iteration step size, avoids the human voice volume in the mixed audio data packet being too low. Furthermore, selecting the largest integer value within the range less than the signal ratio as the attenuation factor of the i-th frame, compared to selecting a decimal as the attenuation factor of the i-th frame, reduces the computational load of the attenuation factor of the i-th frame, improves mixing efficiency, and reduces the data storage space occupied. Attached Figure Description
[0045] Figure 1 This is a flowchart illustrating a method for mixing audio during remote interventional surgery, as shown in one embodiment.
[0046] Figure 2 This is a schematic diagram illustrating the process of filtering multiple target audio signals according to one embodiment;
[0047] Figure 3 This is a schematic diagram illustrating the process of setting the output value of the i-th frame according to one embodiment;
[0048] Figure 4 This is a schematic diagram of the process of obtaining a mixed audio data packet according to one embodiment;
[0049] Figure 5 This is a schematic diagram illustrating the process of broadcasting mixed audio data packets according to one embodiment;
[0050] Figure 6 A schematic block diagram of a mixing device for remote interventional surgery according to one embodiment;
[0051] Figure 7 This is a schematic block diagram of the structure of a computer device according to one embodiment.
[0052] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0053] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0054] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of features, integers, steps, operations, elements, modules, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any modules and all combinations of one or more associated listed items.
[0055] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0056] In one embodiment, refer to Figure 1This is a flowchart illustrating the audio mixing method for remote interventional surgery disclosed in this application. The method is applied to a remote interventional surgery system, which includes a media server and multiple remote clients connected to the media server. The method includes the following steps performed by the media server:
[0057] S1: Acquire audio data from multiple remote clients, perform voice detection on the audio data, and filter out multiple target audio signals that contain only human voices, wherein the audio data is data generated during multi-person calls in remote interventional surgery.
[0058] The remote client can be a mobile app, computer software, or a device with the ability to receive and send remote voice messages; there are no restrictions here.
[0059] To conduct remote multi-person calls during remote interventional surgery, a media server and multiple remote clients are required. The media server is used to acquire and process audio data, while the remote clients are used to receive and send audio data.
[0060] By filtering out multiple target audio signals that contain only human voices from audio data, interference from background noise can be avoided, ensuring that the mixed audio data packets obtained based on the target audio signals contain only human voices, thus improving the user experience of users participating in remote multi-person calls.
[0061] S2: Superimpose all the target audio signals to obtain a superimposed audio signal.
[0062] One remote client corresponds to one target audio signal. By superimposing all target audio signals, the content of speech of all people currently participating in a remote multi-person call can be reflected.
[0063] As an example, there are 10 remote clients in total. There are users talking near 6 of the remote clients. The media server superimposes the target audio signals corresponding to the 6 remote clients to obtain the superimposed audio signal.
[0064] S3: Obtain the attenuation signal value of the i-th frame based on the signal value of the i-th frame of the superimposed audio signal and the preset attenuation factor.
[0065] The superimposed audio signal contains multiple frames, all arranged chronologically, and each frame has a signal value. The attenuated signal value of the i-th frame is obtained by multiplying the signal value of the i-th frame by a preset attenuation factor, calculated using the following formula:
[0066] d[i] = mixing[i] × f[0];
[0067] Where mixing[i] is the signal value of the i-th frame, f[0] is the preset attenuation factor, d[i] is the attenuation signal value of the i-th frame, i≤M, M is the total number of frames of the superimposed audio signal, where i and M are both positive integers.
[0068] The preset attenuation factor has a value range of 0.8-1.2, and preferably, the preset attenuation factor is set to 1.
[0069] S4: If the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, then calculate the signal ratio between the maximum value of the audio signal and the attenuation signal value of the i-th frame, and select the largest integer value from the range less than the signal ratio to obtain the attenuation factor of the i-th frame.
[0070] If the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, i.e., d[i] > MAX, it is necessary to calculate the attenuation factor of the i-th frame such that d[i] × f[i] < MAX, thereby avoiding signal overflow and preventing mixing errors. The ratio of the maximum value of the audio signal to the attenuation signal value of the i-th frame is calculated.
[0071] The attenuation factor for the i-th frame is obtained by selecting the largest integer value from the range smaller than the signal ratio. For example, if the signal ratio is 0.95, then 0 is used as the attenuation factor for the i-th frame.
[0072] The target audio signal is a 32-bit signal with a maximum value of 32767.
[0073] S5: Determine the output value of the i-th frame based on the attenuation factor of the i-th frame, and use the average value of the attenuation factor of the i-th frame and the preset adjustment factor as the attenuation factor of the (i+1)-th frame.
[0074] The attenuation factor for the (i+1)th frame is calculated using the following formula:
[0075]
[0076] Where f[i+1] is the attenuation factor of the (i+1)th frame, a is the preset adjustment factor, and f[i] is the attenuation factor of the ith frame.
[0077] The preset adjustment factor is greater than or equal to 1. Preferably, the preset adjustment factor is set to 1.
[0078] After step S4, the attenuation factor of the i-th frame is less than 1. The average of the attenuation factor of the i-th frame and the preset adjustment factor is used as the attenuation factor of the (i+1)-th frame. This increases the attenuation factor of the i-th frame, thus preventing the output value of the (i+1)-th frame from being too small when processing the signal value, and avoiding the volume of human voices in the mixed audio data packet being too low. If the attenuation factor of the i-th frame is less than 1, the maximum value of the audio signal is used as the output value of the i-th frame. If the attenuation factor of the i-th frame is greater than 1, the minimum value of the audio signal is used as the output value of the i-th frame.
[0079] S6: Based on the attenuation factor of the (i+1)th frame and the output value of the ith frame, obtain the mixing data packet; distribute the mixing data packet to each of the remote clients to broadcast the mixing data packet, wherein the volume of different human voices in the mixing data packet is equalized.
[0080] Multiply the (i+1)th frame signal value of the superimposed audio signal by the (i+1)th frame attenuation factor to obtain the (i+1)th frame attenuation signal value;
[0081] If the attenuation signal value of the (i+1)th frame is greater than the maximum value of the audio signal, then the difference between the attenuation factor of the (i+1)th frame and the preset step size is used as the attenuation factor of the (i+2)th frame.
[0082] If the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, then the signal ratio between the minimum value of the audio signal and the attenuation signal value of the (i+1)th frame is calculated, and the smallest integer value from the range greater than the signal ratio is selected as the attenuation factor of the (i+2)th frame.
[0083] If the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal, then the output value of the (i+1)th frame is calculated based on the attenuation signal value of the (i+1)th frame, and the output values of the first frame to the (i+1)th frame are combined to form the mixed audio data packet.
[0084] When frame i is the first frame, the output value of frame i is the output value of the first frame; when frame i is the last frame, the output value of frame i is the output value of the last frame. The output values from the first frame to the last frame are combined into a mixed audio data packet, that is, all the output values of frame i are combined into a mixed audio data packet.
[0085] The audio mixing data packet is a data packet containing the voices of all remote clients. The transmission of the audio mixing data packet is divided into two different cases. The first case is that the audio mixing data packet does not contain the voices of the remote clients to be transmitted. The second case is that the audio mixing data packet contains the voices of the remote clients to be transmitted. The embodiments of this application adopt different processing methods for the above two cases.
[0086] This application discloses a mixing method for remote interventional surgery, applied to a remote interventional surgery system. The system includes a media server and multiple remote clients connected to the media server, and the method is executed by the media server. By using voice detection to filter out target audio signals containing only human voices from audio data, noise can be removed, preventing users from being unable to identify the voice content in the mixed audio data packet due to noise interference. If the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, the largest integer value within the range less than the signal ratio is selected as the attenuation factor of the i-th frame, reducing the signal value of the i-th frame to the maximum value within a preset range. Using the average of the attenuation factor of the i-th frame and a preset adjustment factor as the attenuation factor of the (i+1)-th frame, compared to directly adding the attenuation factor of the i-th frame to the iteration step size, avoids the human voice volume in the mixed audio data packet being too low. Furthermore, selecting the largest integer value within the range less than the signal ratio as the attenuation factor of the i-th frame, compared to selecting a decimal as the attenuation factor of the i-th frame, reduces the computational load of the attenuation factor of the i-th frame, improves mixing efficiency, and reduces the data storage space occupied.
[0087] In one embodiment, refer to Figure 2 The step of performing speech detection on the audio data and filtering out multiple target audio signals containing only human voices includes:
[0088] S12: Perform speech detection on the audio data using the VAD method to obtain the audio signal corresponding to each audio data; or, extract the speech features of the audio data and input the speech features into a trained speech detection model to obtain the audio signal corresponding to each audio data.
[0089] Before step S12, step S11 is also included: acquiring audio data from multiple remote clients.
[0090] The number of frames in the audio data of each remote client is the same, and the frames with the same sequence number in the audio data of different remote clients correspond to the same time.
[0091] The VAD (Voice Activation Detection) method is used to detect human voices in audio data. Specifically, it uses VAD to detect gaps in the audio data, segments the audio data into multiple short audio segments based on these gaps, and then performs human voice recognition using short-time energy and short-time zero-crossing rate. Short-time energy is the energy of a single frame of speech in a short audio segment, and short-time zero-crossing rate is the number of times the time-domain signal of a single frame of speech crosses zero. The VAD method is used to filter out multiple audio signals containing only human voices from each audio data segment.
[0092] As another implementation, one or more of the Mel-frequency cepstral coefficients, fundamental frequency, and loudness of each audio data can be extracted to obtain speech features. The speech features are then input into a trained speech detection model, such as WaveNet, to filter out multiple audio signals containing only human voices corresponding to each audio data.
[0093] S13: Detect whether the total number of audio signals is greater than the number of signal thresholds. If so, calculate the volume threshold corresponding to each audio signal.
[0094] The signal threshold number can be set to 3, or it can be set to other values. There is no limitation here. In this embodiment, the signal threshold number is set to 3.
[0095] As an example, the total number of audio signals is 6, which is greater than the number of signal thresholds. Calculate the volume thresholds for each of the 6 audio signals.
[0096] The average volume of each audio signal can be used as the corresponding volume threshold, or the maximum volume of each audio signal can be used as the corresponding volume threshold. There is no limitation here. In this embodiment, the average volume of each audio signal is used as the corresponding volume threshold.
[0097] S14: Select N target thresholds from all the volume thresholds.
[0098] All the volume thresholds are sorted in ascending order using the bubble sort method to obtain the threshold sequence;
[0099] The last N volume thresholds of the threshold sequence are used as the N target thresholds.
[0100] First, the volume thresholds are randomly sorted according to the order or number of the remote clients. Then, the volume thresholds are sorted in ascending order using the bubble sort method. The bubble sort method compares adjacent volume thresholds. If the previous volume threshold is larger than the next volume threshold, the two volume thresholds are swapped until the last two volume thresholds are compared.
[0101] The volume thresholds in the threshold sequence are arranged in ascending order. The last N volume thresholds in the sequence are larger than the others. A larger volume threshold indicates a louder voice, resulting in better multi-person voice communication during remote surgery. Therefore, using the last N volume thresholds of the threshold sequence as N target thresholds achieves the best multi-person voice communication effect, solving the problem of chaotic sound and inability to obtain effective information during surgery.
[0102] S15: Select multiple target audio signals based on N target thresholds.
[0103] The N audio signals corresponding to the N target thresholds are taken as the N target audio signals.
[0104] As described above, speech detection is performed on audio data to filter out multiple target audio signals containing only human voices. This includes using the VAD method to perform speech detection on the audio data to obtain the audio signal corresponding to each audio data point; or extracting speech features from the audio data and inputting these features into a trained speech detection model to obtain the audio signal corresponding to each audio data point. The total number of audio signals is checked to see if it exceeds a signal threshold. If so, a volume threshold corresponding to each audio signal is calculated. N target thresholds are selected from all volume thresholds, and multiple target audio signals are then selected based on these N target thresholds. The last N volume thresholds of the threshold sequence are larger than the other volume thresholds. Using the last N volume thresholds of the threshold sequence as the N target thresholds achieves the best multi-person voice communication effect, solving the problem of chaotic sound and inability to obtain effective information during surgery.
[0105] In one embodiment, refer to Figure 3 After obtaining the attenuation signal value of the i-th frame, the method further includes:
[0106] S41': If the attenuation signal value of the i-th frame is less than the minimum value of the audio signal, then calculate the signal ratio between the minimum value of the audio signal and the attenuation signal value of the i-th frame, and select the smallest integer value from the range greater than the signal ratio to obtain the attenuation factor of the i-th frame.
[0107] If the attenuation signal value of the i-th frame is less than the minimum value of the audio signal, it means that the attenuation of the attenuation signal value of the i-th frame is too large, which may cause the user of the remote client receiving the attenuation signal value of the i-th frame to not be able to hear the voice content clearly.
[0108] Calculate the ratio of the minimum value of the audio signal to the attenuation value of the signal in the i-th frame, i.e. The smallest integer value within the range greater than the signal ratio is selected to obtain the attenuation factor of the i-th frame. As an example, the signal ratio of the minimum audio signal value to the attenuation signal value of the i-th frame is calculated to be 1.65, and the attenuation factor of the i-th frame is 2.
[0109] S42': Set the output value of the i-th frame to the minimum value of the audio signal according to the attenuation factor of the i-th frame.
[0110] At this point, the attenuation factor of the i-th frame is greater than 1, so the output value of the i-th frame is set to the minimum value of the audio signal, where the minimum value of the audio signal is -32768. The attenuation factor of the i-th frame is multiplied by the signal value of the (i+1)-th frame of the superimposed audio signal to obtain the attenuated signal value of the (i+1)-th frame.
[0111] As described above, after obtaining the attenuation signal value of the i-th frame, the process further includes calculating the signal ratio between the minimum audio signal value and the attenuation signal value of the i-th frame if the attenuation signal value of the i-th frame is less than the minimum audio signal value. The smallest integer value greater than this signal ratio is then selected to obtain the attenuation factor of the i-th frame. In this case, the attenuation factor of the i-th frame is greater than 1, and the output value of the i-th frame is set to the minimum audio signal value. When the attenuation signal value of the i-th frame is less than the minimum audio signal value, the selected attenuation factor of the i-th frame is greater than 1. Generally, the signal value of the i-th frame and the signal value of the (i+1)-th frame are relatively close. Using the product of the attenuation factor of the i-th frame and the signal value of the (i+1)-th frame superimposed on the audio signal as the attenuation signal value of the (i+1)-th frame allows for better processing of the (i+1)-th frame signal value, resulting in a smoother processing outcome.
[0112] In one embodiment, refer to Figure 4 The step of obtaining the mixed audio data packet based on the attenuation factor of the (i+1)th frame and the output value of the i-th frame includes:
[0113] S61: Multiply the signal value of the (i+1)th frame of the superimposed audio signal by the attenuation factor of the (i+1)th frame to obtain the attenuation signal value of the (i+1)th frame.
[0114] The formula for calculating the attenuation signal value of the (i+1)th frame is:
[0115] d[i+1]=mixing[i+1]×f[i+1];
[0116] Where d[i+1] is the attenuation signal value of the (i+1)th frame, mixing[i+1] is the signal value of the (i+1)th frame of the superimposed audio signal, and f[i+1] is the attenuation factor of the (i+1)th frame.
[0117] When the attenuation signal value of the (i+1)th frame is less than or equal to the maximum value of the audio signal and greater than or equal to the minimum value of the audio signal, the attenuation signal value of the (i+1)th frame is used as the output value of the (i+1)th frame.
[0118] S62: If the attenuation signal value of the (i+1)th frame is greater than the maximum value of the audio signal, then the difference between the attenuation factor of the (i+1)th frame and the preset step size is used as the attenuation factor of the (i+2)th frame.
[0119] The preset step size is a fixed value that is set in advance to adjust the attenuation factor of the (i+1)th frame.
[0120] The formula for updating the attenuation factor in frame i+1 is as follows:
[0121] f[i+2] = f[i+1] - stepsize;
[0122] Where f[i+2] is the attenuation factor of the (i+2)th frame, f[i+1] is the attenuation factor of the (i+1)th frame, and stepsize is the preset step size.
[0123] The preset step size ranges from 0.01 to 0.2, and preferably, the preset step size is set to 0.05.
[0124] After calculating the attenuation factor for the (i+2)th frame, calculate the product of the attenuation factor and the signal value of the (i+2)th frame. If the product is less than or equal to the maximum value of the audio signal and greater than or equal to the minimum value of the audio signal, then the product is used as the output value of the (i+2)th frame. If the product is greater than the maximum value of the audio signal or less than the minimum value of the audio signal, then the maximum value of the audio signal is used as the output value of the (i+2)th frame.
[0125] S63: If the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, then calculate the signal ratio between the minimum value of the audio signal and the attenuation signal value of the (i+1)th frame, and select the smallest integer value from the range greater than the signal ratio as the attenuation factor of the (i+2)th frame.
[0126] Calculate the ratio of the minimum value of the audio signal to the attenuation signal value of the (i+1)th frame, i.e. As an example, with a signal-to-weight ratio of 1.7, the attenuation factor d[i+2] for the (i+2)th frame is set to 2.
[0127] S64: Detect whether the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal. If so, calculate the output value of the (i+1)th frame based on the attenuation signal value of the (i+1)th frame, and combine the output value of the first frame to the output value of the (i+1)th frame to form the mixed audio data packet.
[0128] If the attenuation signal value in the (i+1)th frame is greater than the maximum value of the audio signal, the output value in the (i+1)th frame is set to the maximum value of the audio signal.
[0129] If the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, the output value of the (i+1)th frame is set to the minimum value of the audio signal.
[0130] When the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal, it means that all frames of the superimposed audio signal have been processed. The output values of the first frame to the output value of the i-th frame, and the output value of the (i+1)th frame are combined to form a mixed audio data packet.
[0131] If the (i+1)th frame of the superimposed audio signal is not the last frame of the superimposed audio signal, continue to calculate the attenuation signal values of subsequent frames in the superimposed audio signal and update the corresponding attenuation factors.
[0132] As described above, the mixed audio data packet is obtained based on the attenuation factor of the (i+1)th frame and the output value of the (i)th frame, including multiplying the signal value of the (i+1)th frame of the superimposed audio signal with the attenuation factor of the (i+1)th frame to obtain the attenuated signal value of the (i+1)th frame. If the attenuation signal value of the (i+1)th frame is greater than the maximum value of the audio signal, the difference between the attenuation factor of the (i+1)th frame and the preset step size is used as the attenuation factor of the (i+2)th frame. If the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, the signal ratio between the minimum value of the audio signal and the attenuation signal value of the (i+1)th frame is calculated, and the smallest integer value greater than the signal ratio is selected as the attenuation factor of the (i+2)th frame. It is then checked whether the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal. If it is, the output value of the (i+1)th frame is calculated based on the attenuation signal value of the (i+1)th frame, and the output values of the first frame to the (i+1)th frame are combined to form a mixing data packet. If the (i+1)th frame of the superimposed audio signal is not the last frame of the superimposed audio signal, the output value of the (i+2)th frame is calculated based on the signal value of the (i+2)th frame and the attenuation factor of the (i+2)th frame. Using the difference between the attenuation factor of the (i+1)th frame and the preset step size as the attenuation factor of the (i+2)th frame can prevent the attenuation factor of the (i+2)th frame from being too small, which would lead to smaller output values for subsequent frames.
[0133] In one embodiment, refer to Figure 5 The step of distributing the mixed audio data packet to each of the remote clients for broadcasting the mixed audio data packet includes:
[0134] S65: Detect whether there is a human voice corresponding to the target remote client in the mixed audio data packet. If there is, then take the target remote client as the first target remote client; remove the human voice corresponding to the first target remote client to obtain a de-echo data packet; distribute the de-echo data packet to the first target remote client, and control the first target remote client to broadcast the de-echo data packet.
[0135] As an example, there are 10 remote clients in total. The first three remote clients are all target remote clients. If the mixed audio data packets contain the voices of the first three remote clients, then the first three remote clients are all designated as the first target remote clients. The voices corresponding to the first three remote clients are removed to obtain de-echo data packets. The de-echo data packets are then distributed to the first three remote clients, and the first three remote clients are controlled to broadcast the de-echo data packets.
[0136] S66: If the mixed audio data packet does not contain the human voice corresponding to the target remote client, then the target remote client is designated as the second target remote client; the mixed audio data packet is distributed to the second target remote client, and the second target remote client is controlled to broadcast the mixed audio data packet.
[0137] As an example, the 4th to 10th remote clients are all target remote clients. The voice of the 4th to 10th remote clients is not present in the mixed audio data packet. The 4th to 10th remote clients are used as the second target remote clients. The mixed audio data packet is distributed to the 4th to 10th remote clients, and the 4th to 10th remote clients are controlled to broadcast the mixed audio data packet.
[0138] As described above, the target remote client is first selected, and then the presence of the corresponding human voice for the target remote client is detected in the mixed audio data packet. Different strategies are then used to distribute the data packets. Sending an echo-de-echo data packet to the first target remote client and a mixed audio data packet to the second target remote client can effectively prevent echoes while playing the human voices of other remote clients.
[0139] Reference Figure 6 This is a schematic block diagram of a mixing device for remote interventional surgery disclosed in this application. The mixing device for remote interventional surgery is applied to a remote interventional surgery system, which includes a media server and multiple remote clients connected to the media server. The device is located on the media server and includes:
[0140] The target audio signal filtering module 10 is used to acquire audio data from multiple remote clients, perform voice detection on the audio data, and filter out multiple target audio signals that contain only human voices. The audio data is generated during multi-person calls in remote interventional surgery.
[0141] The signal superposition module 20 is used to superimpose all the target audio signals to obtain a superimposed audio signal;
[0142] The i-th frame attenuation signal value calculation module 30 is used to obtain the i-th frame attenuation signal value based on the i-th frame signal value of the superimposed audio signal and a preset attenuation factor.
[0143] The i-th frame attenuation factor selection module 40 is used to calculate the signal ratio between the maximum value of the audio signal and the attenuation signal value of the i-th frame if the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, and select the largest integer value from the range less than the signal ratio to obtain the attenuation factor of the i-th frame.
[0144] The attenuation factor calculation module 50 for the (i+1)th frame is used to determine the output value of the i-th frame based on the attenuation factor of the i-th frame, and to use the average value of the attenuation factor of the i-th frame and the preset adjustment factor as the attenuation factor of the (i+1)th frame.
[0145] The data packet broadcasting module 60 is used to obtain a mixed audio data packet based on the attenuation factor of the (i+1)th frame and the output value of the i-th frame; and to distribute the mixed audio data packet to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
[0146] As described above, the mixing device for remote interventional surgery enables the mixing method for remote interventional surgery.
[0147] In one embodiment, the target audio signal filtering module 10 further includes:
[0148] An audio signal extraction unit is used to perform speech detection on the audio data using the VAD method to obtain an audio signal corresponding to each audio data; or, to extract the speech features of the audio data and input the speech features into a trained speech detection model to obtain an audio signal corresponding to each audio data.
[0149] The volume threshold calculation unit is used to detect whether the total number of audio signals is greater than the number of signal thresholds. If so, it calculates the volume threshold corresponding to each audio signal.
[0150] A target threshold filtering unit is used to filter out N target thresholds from all the volume thresholds;
[0151] A target audio signal filtering unit is used to filter out multiple target audio signals based on N target thresholds.
[0152] In one embodiment, the target threshold filtering unit further includes:
[0153] The sorting subunit is used to sort all the volume thresholds in ascending order using the bubble sort method to obtain a threshold sequence.
[0154] The target threshold definition subunit is used to take the last N volume thresholds of the threshold sequence as N target thresholds.
[0155] In one embodiment, the audio mixing device for remote interventional surgery further includes:
[0156] The attenuation factor calculation module for the i-th frame is used to calculate the signal ratio between the minimum value of the audio signal and the attenuation signal value of the i-th frame if the attenuation signal value of the i-th frame is less than the minimum value of the audio signal, and select the smallest integer value from the range greater than the signal ratio to obtain the attenuation factor of the i-th frame.
[0157] The i-th frame output value setting module is used to set the i-th frame output value to the minimum value of the audio signal according to the i-th frame attenuation factor.
[0158] In one embodiment, the data packet broadcasting module 60 further includes:
[0159] The attenuation signal value calculation unit for the (i+1)th frame is used to multiply the signal value of the (i+1)th frame of the superimposed audio signal and the attenuation factor of the (i+1)th frame to obtain the attenuation signal value of the (i+1)th frame.
[0160] The first calculation unit for the attenuation factor of the (i+2)th frame is used to use the difference between the attenuation factor of the (i+1)th frame and the preset step size as the attenuation factor of the (i+2)th frame if the attenuation signal value of the (i+1)th frame is greater than the maximum value of the audio signal.
[0161] The second calculation unit for the attenuation factor of the (i+2)th frame is used to calculate the signal ratio between the minimum value of the audio signal and the attenuation signal value of the (i+1)th frame if the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, and select the smallest integer value from the range greater than the signal ratio as the attenuation factor of the (i+2)th frame.
[0162] The audio mixing data packet composition unit is used to detect whether the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal. If so, the output value of the (i+1)th frame is calculated based on the attenuation signal value of the (i+1)th frame, and the output values of the first frame to the (i+1)th frame are used to compose the audio mixing data packet.
[0163] In one embodiment, the data packet broadcasting module 60 further includes:
[0164] The echo data packet broadcasting unit is used to detect whether there is a human voice corresponding to the target remote client in the mixed data packet. If there is, the target remote client is designated as the first target remote client. The human voice corresponding to the first target remote client is removed to obtain the de-echo data packet. The de-echo data packet is distributed to the first target remote client, and the first target remote client is controlled to broadcast the de-echo data packet.
[0165] The audio mixing data packet broadcasting unit is configured to, if the audio mixing data packet does not contain the human voice corresponding to the target remote client, designate the target remote client as the second target remote client; distribute the audio mixing data packet to the second target remote client; and control the second target remote client to broadcast the audio mixing data packet.
[0166] In one embodiment, the attenuation factor calculation module 50 for the (i+1)th frame further includes:
[0167] The i-th frame output value determination unit uses the maximum value of the audio signal as the i-th frame output value.
[0168] Reference Figure 7This application also provides a computer device whose internal structure can be as follows: Figure 7 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor is designed to provide computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores operating devices, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores information such as the attenuation factor of the i-th frame. The network interface is used for communication with external terminals via a network connection. Furthermore, the computer device may also include input devices and a display screen. This computer program, when executed by a processor, implements a mixing method for remote interventional surgery. It is applied to a remote interventional surgery system, which includes a media server and multiple remote clients connected to the media server. The method includes the following steps performed by the media server: acquiring audio data from multiple remote clients; performing voice detection on the audio data; filtering out multiple target audio signals containing only human voices, wherein the audio data is generated during multi-person conversations during remote interventional surgery; superimposing all the target audio signals to obtain a superimposed audio signal; and obtaining a... The attenuation signal value of the i-th frame; if the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, then the signal ratio of the maximum value of the audio signal to the attenuation signal value of the i-th frame is calculated, and the largest integer value is selected from the range less than the signal ratio to obtain the attenuation factor of the i-th frame; the output value of the i-th frame is determined according to the attenuation factor of the i-th frame, and the average value of the attenuation factor of the i-th frame and the preset adjustment factor is used as the attenuation factor of the (i+1)-th frame; based on the attenuation factor of the (i+1)-th frame and the output value of the i-th frame, a mixed audio data packet is obtained; the mixed audio data packet is distributed to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
[0169] Those skilled in the art will understand that Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer equipment on which the present application is applied.
[0170] One embodiment of this application also provides a computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements a mixing method for remote interventional surgery, applied to a remote interventional surgery system. The remote interventional surgery system includes a media server and multiple remote clients connected to the media server. The method includes the following steps performed by the media server: acquiring audio data from multiple remote clients; performing voice detection on the audio data; filtering out multiple target audio signals containing only human voices, wherein the audio data is data generated during multi-person calls in remote interventional surgery; superimposing all the target audio signals to obtain a superimposed audio signal; and according to the first... The attenuation signal value of frame i is obtained by comparing the signal value of frame i with a preset attenuation factor. If the attenuation signal value of frame i is greater than the maximum value of the audio signal, the signal ratio between the maximum value of the audio signal and the attenuation signal value of frame i is calculated. The largest integer value is selected from the range less than the signal ratio to obtain the attenuation factor of frame i. The output value of frame i is determined according to the attenuation factor of frame i. The average value of the attenuation factor of frame i and the preset adjustment factor is used as the attenuation factor of frame i+1. Based on the attenuation factor of frame i+1 and the output value of frame i, a mixed audio data packet is obtained. The mixed audio data packet is distributed to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
[0171] It is understood that the computer-readable storage medium in this embodiment can be a volatile readable storage medium or a non-volatile readable storage medium.
[0172] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media provided in this application and in the embodiments may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-speed SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
[0173] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, apparatus, article, or method. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, apparatus, article, or method that includes that element.
[0174] The above description is merely a preferred embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.
Claims
1. A method of mixing sound for a remotely-intervened surgery, characterized in that, An application in a remote interventional surgery system, the remote interventional surgery system including a media server and multiple remote clients connected to the media server, the method comprising the following steps performed by the media server: The system acquires audio data from multiple remote clients, performs voice detection on the audio data, and filters out multiple target audio signals that contain only human voices. The audio data is generated during multi-person calls in remote interventional surgery. All the target audio signals are superimposed to obtain a superimposed audio signal; The attenuation signal value of the i-th frame is obtained based on the signal value of the i-th frame of the superimposed audio signal and the preset attenuation factor; If the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, then the signal ratio between the maximum value of the audio signal and the attenuation signal value of the i-th frame is calculated, and the largest integer value is selected from the range less than the signal ratio to obtain the attenuation factor of the i-th frame; The output value of the i-th frame is determined based on the attenuation factor of the i-th frame, and the average value of the attenuation factor of the i-th frame and the preset adjustment factor is used as the attenuation factor of the (i+1)-th frame, wherein the preset adjustment factor is greater than or equal to 1. Based on the attenuation factor of the (i+1)th frame and the output value of the ith frame, a mixed audio data packet is obtained; the mixed audio data packet is distributed to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
2. The mixing method of a remote intervention surgery according to claim 1, wherein, The step of performing speech detection on the audio data and filtering out multiple target audio signals containing only human voices includes: The audio data is subjected to speech detection using the VAD method to obtain the audio signal corresponding to each audio data; or, the speech features of the audio data are extracted and input into a trained speech detection model to obtain the audio signal corresponding to each audio data. Detect whether the total number of audio signals is greater than the signal threshold. If so, calculate the volume threshold corresponding to each audio signal. Select N target thresholds from all the stated volume thresholds; Multiple target audio signals are selected based on N target thresholds.
3. The audio mixing method for remote interventional surgery according to claim 2, characterized in that, The step of filtering N target thresholds from all the volume thresholds includes: All the volume thresholds are sorted in ascending order using the bubble sort method to obtain the threshold sequence; The last N volume thresholds of the threshold sequence are used as the N target thresholds.
4. The audio mixing method for remote interventional surgery according to claim 1, characterized in that, After obtaining the attenuation signal value of the i-th frame, the process further includes: If the attenuation signal value of the i-th frame is less than the minimum value of the audio signal, then the signal ratio between the minimum value of the audio signal and the attenuation signal value of the i-th frame is calculated, and the smallest integer value is selected from the range greater than the signal ratio to obtain the attenuation factor of the i-th frame; The output value of the i-th frame is set to the minimum value of the audio signal based on the attenuation factor of the i-th frame.
5. The mixing method of remote intervention surgery according to claim 4, wherein, The step of obtaining the mixed audio data packet based on the attenuation factor of the (i+1)th frame and the output value of the i-th frame includes: Multiply the (i+1)th frame signal value of the superimposed audio signal by the (i+1)th frame attenuation factor to obtain the (i+1)th frame attenuation signal value; If the attenuation signal value of the (i+1)th frame is greater than the maximum value of the audio signal, then the difference between the attenuation factor of the (i+1)th frame and the preset step size is used as the attenuation factor of the (i+2)th frame. If the attenuation signal value of the (i+1)th frame is less than the minimum value of the audio signal, then the signal ratio between the minimum value of the audio signal and the attenuation signal value of the (i+1)th frame is calculated, and the smallest integer value from the range greater than the signal ratio is selected as the attenuation factor of the (i+2)th frame. Detect whether the (i+1)th frame of the superimposed audio signal is the last frame of the superimposed audio signal. If so, calculate the output value of the (i+1)th frame based on the attenuation signal value of the (i+1)th frame, and combine the output value of the first frame to the output value of the (i+1)th frame to form the mixed audio data packet. If the (i+1)th frame of the superimposed audio signal is not the last frame of the superimposed audio signal, then the output value of the (i+2)th frame is calculated based on the signal value of the (i+2)th frame of the superimposed audio signal and the attenuation factor of the (i+2)th frame.
6. The mixing method of remote intervention surgery according to claim 1, wherein, The step of distributing the mixed audio data packet to each of the remote clients for broadcasting the mixed audio data packet includes: The system detects whether the mixed audio data packet contains the voice of the target remote client. If it does, the target remote client is designated as the first target remote client. The voice of the first target remote client is removed to obtain a de-echo data packet. The de-echo data packet is distributed to the first target remote client, and the first target remote client is controlled to broadcast the de-echo data packet. If the mixed audio data packet does not contain the human voice corresponding to the target remote client, then the target remote client is designated as the second target remote client; the mixed audio data packet is distributed to the second target remote client, and the second target remote client is controlled to broadcast the mixed audio data packet.
7. The mixing method of remote intervention surgery according to claim 1, wherein, The step of determining the output value of the i-th frame based on the attenuation factor of the i-th frame includes: The maximum value of the audio signal is used as the output value of the i-th frame.
8. A mixing device for a remotely-intervened surgery, characterized in that, An apparatus for use in a remote interventional surgery system, the remote interventional surgery system comprising a media server and multiple remote clients connected to the media server, the apparatus being located on the media server, the apparatus comprising: The target audio signal filtering module is used to acquire audio data from multiple remote clients, perform voice detection on the audio data, and filter out multiple target audio signals that contain only human voices. The audio data is generated during multi-person calls in remote interventional surgery. A signal superposition module is used to superimpose all the target audio signals to obtain a superimposed audio signal; The i-th frame attenuation signal value calculation module is used to obtain the i-th frame attenuation signal value based on the i-th frame signal value of the superimposed audio signal and a preset attenuation factor. The attenuation factor selection module for the i-th frame is used to calculate the signal ratio between the maximum value of the audio signal and the attenuation signal value of the i-th frame if the attenuation signal value of the i-th frame is greater than the maximum value of the audio signal, and select the largest integer value from the range less than the signal ratio to obtain the attenuation factor of the i-th frame. The attenuation factor calculation module for the (i+1)th frame is used to determine the output value of the i-th frame based on the attenuation factor of the i-th frame, and to take the average value of the attenuation factor of the i-th frame and the preset adjustment factor as the attenuation factor of the (i+1)th frame, wherein the preset adjustment factor is greater than or equal to 1. The data packet broadcasting module is used to obtain the mixed audio data packet based on the attenuation factor of the (i+1)th frame and the output value of the i-th frame; and to distribute the mixed audio data packet to each of the remote clients to broadcast the mixed audio data packet, wherein the volume of different human voices in the mixed audio data packet is equalized.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the audio mixing method for remote interventional surgery as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the audio mixing method for remote interventional surgery as described in any one of claims 1 to 7.