Chat audio violation detection method and apparatus, device, and medium
By delegating the speech recognition task to the client and combining it with terminal attribute information to filter and assign clients, the problem of high server resource consumption in group audio violation detection is solved, achieving accurate user positioning and low-cost violation detection, and improving system stability and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU FANGSI INFORMATION TECH CO LTD
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies for detecting violations in group audio, server resource consumption is directly proportional to the number of users speaking in the channel, leading to a sharp increase in hardware costs, making it difficult for enterprises to bear the economic burden of large-scale review.
The speech recognition and conversion task is devolved to the client for execution. By obtaining terminal attribute information, the client is selected and assigned to review tasks. The server only makes a preliminary judgment on violations and conducts a secondary review when necessary, so as to achieve accurate user positioning.
It significantly reduces server bandwidth usage, machine computing power consumption, and graphics card resource requirements, enabling precise user positioning and a healthy and orderly chat environment. This reduces enterprise hardware costs and improves system stability and user experience.
Smart Images

Figure CN122247961A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of Internet technology, and in particular to a method for detecting chat audio violations and corresponding devices, electronic devices, and computer-readable storage media. Background Technology
[0002] With the rapid development of internet communication technology, real-time voice-based interactive scenarios are becoming increasingly popular, especially in group chat scenarios such as game voice chat and voice-based social networking. Real-time voice communication through chat channels has become one of the core functions. While these applications enrich the user experience, they also bring significant challenges to the compliance review of audio content. To create a healthy and orderly online environment, platforms need to monitor the massive amounts of audio content generated within chat channels in real time, promptly identifying and addressing any potentially illegal information.
[0003] Current technologies for detecting audio violations in chat groups typically employ a centralized, server-centric architecture. The server needs to fetch the upstream audio streams from every user speaking in the chat channel in real time, perform speech recognition on each stream, and then match the text information against a pre-defined list of prohibited terms. While this approach enables user-level violation detection, its resource consumption is directly proportional to the number of users speaking in the channel. When the number of users is large, the server needs to process dozens of audio streams simultaneously, placing immense pressure on server bandwidth, computing power, and the GPU resources required for the speech recognition model. This leads to a sharp increase in hardware costs, making it difficult for enterprises to afford the economic burden of large-scale review.
[0004] Therefore, how to significantly reduce server resource consumption while ensuring review coverage and accurately locate users of infringing content has become a pressing technical problem in this field. Summary of the Invention
[0005] The primary objective of this application is to provide a method for detecting chat audio violations and a corresponding device, electronic device, or computer program product to solve one of the aforementioned problems.
[0006] To achieve the various objectives of this application, the following technical solution is adopted: A chat audio violation detection method provided for one of the purposes of this application includes the following steps: Obtain terminal attribute information of each client in the target chat channel, wherein the terminal attribute information includes hardware resource parameters and operating status characteristics; Based on the terminal attribute information, at least one assignment client is determined from the clients, and an audit task assignment instruction is issued to the assignment client; The system receives review content information reported by the assignment client, which is generated by the assignment client after performing speech recognition and conversion on the mixed audio in the target chat channel. Based on the review content information, a violation review is conducted. If a violation is determined, independent audio data of the relevant clients within the target chat channel is obtained for a second review to identify the violating clients within the target chat channel.
[0007] On the other hand, a chat audio violation detection device provided for one of the purposes of this application includes: The terminal attribute acquisition module is used to acquire terminal attribute information of each client in the target chat channel. The terminal attribute information includes hardware resource parameters and operating status characteristics. The audit task assignment module is used to determine at least one assignment client from the clients based on the terminal attribute information and to issue an audit task assignment instruction to the assignment client. The review content receiving module is used to receive review content information reported by the assigning client. The review content information is generated by the assigning client after performing speech recognition and conversion on the mixed audio in the target chat channel. The module for identifying violating terminals is used to conduct a violation review based on the review content information. If a violation is determined, the module obtains the independent audio data of the relevant clients in the target chat channel for a second review to identify the violating clients in the target chat channel.
[0008] On another front, an electronic device provided for one of the purposes of this application includes a central processing unit and a memory, the central processing unit being used to invoke and run a computer program stored in the memory to perform the steps of the chat audio violation detection method described in this application.
[0009] In another aspect, a computer program product provided for another purpose of this application includes a computer program / instructions that, when executed by a processor, implement the steps of the method described in any embodiment of this application.
[0010] The technical solution of this application has many advantages, including but not limited to the following aspects: First, this application fundamentally changes the resource consumption pattern of the traditional pure server-centralized processing model by offloading the speech recognition conversion task to the client. The client performs speech recognition conversion on the channel's mixed audio locally and only reports the content to be reviewed. The server no longer needs to deploy a speech recognition model for each audio stream, nor does it need to fetch the original audio streams of all speaking users in real time. This significantly reduces server bandwidth usage, machine computing power consumption, and graphics card resource requirements. Especially in high-concurrency scenarios such as game voice chat and voice-based social networking, this application can transfer the vast majority of audio processing load to the client, enabling the server to support large-scale audio review operations at extremely low cost. This effectively solves the problem of rapidly rising hardware costs and the difficulty for enterprises to bear the economic burden of large-scale review in existing technologies.
[0011] Secondly, this application employs a two-stage review mechanism to accurately pinpoint users of infringing content, ensuring comprehensive review coverage while avoiding wrongful penalties for innocent users. The server first makes an initial violation determination based on the mixed audio recognition results reported by the client. Only after confirming the presence of infringing content does it retrieve individual audio data from the relevant client for a second review. Through secondary voice recognition and keyword matching, the violation is precisely located to the specific client. This mechanism avoids the shortcomings of traditional mixed audio submission solutions, which can only identify violations across the entire channel but cannot pinpoint individuals. While ensuring accuracy, it further reduces server overhead, enabling the platform to implement precise penalties for violating users and maintain a healthy and orderly chat environment.
[0012] Furthermore, this application achieves intelligent allocation of review tasks by introducing a terminal attribute information collection and dynamic scheduling mechanism, significantly improving system stability and user experience. The server dynamically selects the most suitable client for the review task based on the client's hardware resource parameters and operating status characteristics, ensuring that review tasks are only initiated on clients with sufficient hardware computing power and suitable operating conditions. This avoids review delays or failures due to insufficient client performance or poor operating environment, and also prevents interference with normal user voice communication. Simultaneously, the differentiated scheduling strategy based on factors such as operating status characteristics makes the allocation of review tasks more scientific and reasonable, achieving load balancing among different clients and ensuring the smooth and efficient operation of the entire review system under large-scale concurrent scenarios. Attached Figure Description
[0013] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein: Figure 1 A typical network deployment architecture diagram related to the implementation of the technical solution of this application; Figure 2This is a flowchart illustrating a typical embodiment of the chat audio violation detection method of this application; Figure 3 This is a schematic diagram of the chat audio violation detection device of this application; Figure 4 This is a schematic diagram of the structure of an electronic device used in this application. Detailed Implementation
[0014] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application.
[0015] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this application means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.
[0016] Those skilled in the art will understand that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0017] Those skilled in the art will understand that the terms "client," "terminal," and "terminal device" as used herein include both devices that receive wireless signals, devices that only possess wireless signal receiver capabilities without transmission capabilities, and devices with receiving and transmitting hardware, devices that have receiving and transmitting hardware capable of bidirectional communication over a bidirectional communication link. Such devices may include: cellular or other communication devices such as personal computers or tablets, having single-line displays, multi-line displays, or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service) that can combine voice, data processing, fax, and / or data communication capabilities; PDAs (Personal Digital Assistants) that may include radio frequency receivers, pagers, internet / intranet access, web browsers, notebooks, calendars, and / or GPS (Global Positioning System) receivers; and conventional laptops and / or handheld computers or other devices that have and / or include radio frequency receivers. As used herein, "client," "terminal," and "terminal device" can be portable, transportable, installed in a means of transportation (air, sea, and / or land), or suitable and / or configured to operate locally and / or in a distributed manner, operating in any other location on Earth and / or in space. "Client," "terminal," and "terminal device" as used herein can also be a communication terminal, an internet access terminal, or a music / video playback terminal, such as a PDA, a MID (Mobile Internet Device), and / or a mobile phone with music / video playback capabilities, or a smart TV, set-top box, etc.
[0018] The hardware referred to by the names "server," "client," and "service node" in this application is essentially an electronic device with the equivalent capabilities of a personal computer. It is a hardware device with the necessary components revealed by the von Neumann architecture, such as a central processing unit (including an arithmetic logic unit and a control unit), memory, input devices, and output devices. The computer program is stored in its memory, and the central processing unit loads the program stored in the secondary storage into the main memory to run it, execute the instructions in the program, and interact with the input and output devices to complete specific functions.
[0019] It should be noted that the concept of "server" used in this application can also be extended to the case of server clusters. Based on the network deployment principles understood by those skilled in the art, the servers should be logically divided. Physically, these servers can be independent of each other but accessible through interfaces, or they can be integrated into a single physical computer or a computer cluster. Those skilled in the art should understand this flexibility and should not use it to constrain the implementation of the network deployment method described in this application.
[0020] Please see Figure 1 The hardware infrastructure required for implementing the technical solutions of this application can be deployed according to the architecture shown in the figure. The server 80 mentioned in this application is deployed in the cloud and acts as a business server. It can further connect to relevant data servers and other servers providing related support, thereby forming a logically related service cluster to provide services to relevant terminal devices such as the smartphone 81 and personal computer 82 shown in the figure, or third-party servers (not shown). Both the smartphone and personal computer can access the Internet through known network access methods and establish a data communication link with the cloud server 80 to run terminal applications related to the services provided by the server.
[0021] For servers, the application is usually built as a service process, with corresponding program interfaces exposed for remote calls by applications running on various terminal devices. The relevant technical solutions in this application that are suitable for running on servers can be implemented in servers in this way.
[0022] The application mentioned refers to an application running on a server or terminal device. This application implements the relevant technical solutions of this application in a programmed manner. Its program code can be stored in a non-volatile storage medium that can be recognized by a computer in the form of computer-executable instructions, and is loaded into memory by the central processing unit for execution. The relevant device of this application is constructed by the operation of the application on the computer.
[0023] For servers, the application is usually built as a service process, with corresponding program interfaces exposed for remote calls by applications running on various terminal devices. The relevant technical solutions in this application that are suitable for running on servers can be implemented in servers in this way.
[0024] For various popular terminal devices, especially mobile devices such as tablets and mobile phones, they are usually equipped with built-in camera devices, or personal computers can also be connected to the aforementioned camera devices. Theoretically, the application of the terminal device in this application can call the camera devices in these situations.
[0025] Unless otherwise expressly stated, the various embodiments disclosed in this application can be combined in a cross-cutting manner to flexibly construct new embodiments, as long as such combination does not depart from the inventive spirit of this application and can meet the needs of the prior art or solve a certain deficiency in the prior art. Those skilled in the art should be aware of such modifications.
[0026] Those skilled in the art will understand that although the various methods in this application are described based on the same concept and thus present commonality among them, they can be performed independently unless otherwise specified. Similarly, the various embodiments disclosed in this application are all based on the same inventive concept; therefore, concepts expressed in the same way, as well as concepts that are appropriately changed for convenience but are expressed differently, should be understood equivalently.
[0027] Please see Figure 2 The present application discloses a method for detecting chat audio violations, which, in a typical embodiment, includes the following steps: Step S11: Obtain the terminal attribute information of each client within the target chat channel. The terminal attribute information includes hardware resource parameters and operating status characteristics. A chat channel is a logical space that hosts real-time voice interaction among multiple users, and it serves as the basic unit for executing audio violation detection methods. In the field of internet technology, a chat channel can be understood as a virtual voice room or session group. Multiple users can connect to the same channel through a client to achieve real-time voice communication. These channels are widely present in various voice interaction scenarios, including but not limited to game voice rooms, voice social chat rooms, online conference groups, and multi-person call sessions.
[0028] Taking a gaming scenario as an example, when players team up for a battle, the system assigns each team a separate voice chat room. Teammates can then communicate tactically via voice once inside the room. This voice chat room is a typical chat channel. Similarly, in voice-based social applications, users can create or join chat rooms on different themes, such as music lovers' discussion rooms or language learning discussion rooms. Each chat room is an independent chat channel, where users can freely speak or listen to others. Furthermore, in remote work scenarios, temporary voice conference groups created for each project team also fall under the category of chat channels.
[0029] The core feature of a chat channel is its ability to accommodate multiple clients online simultaneously and support real-time audio data distribution between clients. From a technical implementation perspective, once a client successfully joins a chat channel, it can receive audio streams uploaded by other clients within the channel, and simultaneously send its own audio stream to other clients within the channel. This many-to-many real-time communication architecture results in aliasing characteristics for the audio data within the chat channel; that is, multiple users may speak simultaneously, and the audio streams of different users intertwine during transmission, ultimately appearing as a mixed stream at the receiving end.
[0030] Terminal attribute information is a data set describing the client's hardware capabilities and current operating environment status. Its specific composition includes at least two dimensions: hardware resource parameters and operating status characteristics. Hardware resource parameters characterize the physical hardware performance of the client device and are the fundamental basis for determining whether the client can handle audio processing tasks. These parameters may include, but are not limited to, the number and clock speed of the central processing unit (CPU), the size of the RAM, the device's system version information, and graphics processor performance indicators. For example, the server can learn from the hardware resource parameters reported by the client that one device is equipped with an octa-core processor, 8GB of RAM, and the Android 12 system version, while another device only has a quad-core processor, 4GB of RAM, and the Android 9 system version. This information directly reflects the client's computing power and processing efficiency, providing quantitative indicators for subsequently selecting clients suitable for performing speech recognition tasks.
[0031] The operational status characteristics describe the dynamic operating conditions and environmental attributes of the client at the current moment, covering a broader range than hardware resource parameters. This characteristic includes at least the type of operating scenario the client is in, such as public locations (e.g., internet cafes, coffee shops, school computer labs), private locations (e.g., homes, personal rooms), or office areas. The determination of the operating scenario type can be based on a combination of methods, including the client's network address information, physical location information, and wireless network service set identifier. For example, by querying a pre-built public location IP address database, the server can identify that a client's public IP address belongs to the address range of a chain of internet cafes, thus determining that the client is currently in an internet cafe scenario. Similarly, if the wireless network service set identifier reported by the client is "Starbucks_WiFi," the server can infer that it is in a coffee shop scenario. The operating scenario type is directly related to the level of competition and stability of network resources. For example, clients in internet cafe scenarios typically share the same outbound bandwidth, and multiple devices downloading resources simultaneously can easily cause network congestion; while clients in home scenarios have a relatively independent network environment with more abundant bandwidth resources.
[0032] Operational status characteristics can further include the client's network connection method and download speed. Network connection method refers to the specific form the client uses to access the internet, including but not limited to wireless network connections, wired network connections, and mobile data network connections. Download speed quantifies the client's currently available network download bandwidth, which can be obtained through real-time speed testing or historical data statistics. For example, a client accessing the internet via home fiber broadband may have a download speed of 100Mbps, while another client accessing via a 4G mobile network may only have a download speed of 20Mbps. This information provides important reference for the server to subsequently formulate differentiated resource scheduling strategies. Furthermore, operational status characteristics can be expanded to include other dimensions as needed, such as the client's battery status and current real-time CPU and memory utilization, to more comprehensively assess the suitability of the client for handling auditing tasks.
[0033] The server can acquire terminal attribute information using a combination of active collection and passive reception strategies. One approach is for the client to proactively report its terminal attribute information to the server when it first joins the target chat channel. Another approach is for the server to periodically send attribute query commands to all clients within the channel, and the clients, upon receiving the commands, collect and report their current attribute information in real time. For rapidly changing metrics in the operational status characteristics, such as download speed and CPU utilization, periodic reporting or threshold-triggered reporting can be used to ensure the real-time nature of the information held by the server. Through these mechanisms, the server can establish a dynamic information view of each client within the target chat channel, laying the data foundation for subsequent operations such as client selection and assignment based on terminal attribute information and the formulation of resource package download strategies.
[0034] Step S12: Based on the terminal attribute information, determine at least one assigning client from the clients, and issue an audit task assignment instruction to the assigning client: Based on the acquired client terminal attribute information, the server identifies at least one designated client from all clients within the target chat channel and issues review task assignment instructions to these selected clients. Designated clients are specific clients selected by the server to undertake audio review tasks. These clients will subsequently perform speech recognition and conversion on the mixed audio within the channel and report the review content information.
[0035] The process of determining the assigned client is essentially a multi-dimensional screening and decision-making mechanism. Its core objective is to select the most suitable device from among numerous clients to perform real-time speech recognition tasks. This process requires comprehensive consideration of multiple factors, such as the client's hardware computing resources, the availability of speech recognition resource packages, available storage space, and current operating status characteristics, to ensure that the selected client can stably complete the review task without negatively impacting its normal communication experience.
[0036] The server first performs an initial screening of clients based on hardware resource parameters. These parameters include at least the number and clock speed of the processor, and the amount of RAM. Based on preset processing power requirements, the server selects one or more target clients from all clients whose hardware computing resources meet these requirements. These preset processing power requirements can be flexibly set according to actual business scenarios; for example, requiring the client to have at least four CPU cores, at least 4GB of RAM, or a certain processor clock speed. Clients that do not meet the hardware resource requirements are excluded from the candidate pool, as these devices may not be able to run the speech recognition model smoothly, and forcibly assigning them review tasks could lead to device lag or recognition failure.
[0037] After initially identifying target clients with sufficient hardware computing power, the server further evaluates the readiness of these clients' speech recognition resource packages. Speech recognition resource packages consist of model files and libraries necessary for clients to perform speech recognition conversion; they are typically large and need to be pre-downloaded to the client's local storage. The server categorizes target clients based on whether they have successfully downloaded the resource package. For target clients that have already downloaded the speech recognition resource package, the server directly designates them as candidate clients; these devices already possess the software capabilities to perform the review task and can undertake the task without additional processing.
[0038] For target clients that have not yet downloaded the speech recognition resource package, the server needs to check whether their available storage space meets the requirements for storing the resource package. Available storage space refers to the client's current remaining local storage capacity, which must be greater than the storage space required by the speech recognition resource package for successful download and decompression. For example, if the speech recognition resource package is 200MB in size, the target client's available storage space must be no less than 200MB. The server obtains the available storage space data by checking the hardware resource parameters reported by the client or by querying in real time, and also adds devices that meet the storage requirements to the list of candidate clients. For target clients with insufficient storage space, even if their hardware computing power meets the requirements, they cannot support the resource package and are therefore temporarily excluded from the candidate pool.
[0039] After the above screening, the server obtains a set of candidate clients. These devices all meet the hardware computing power requirements and have the software capabilities to perform speech recognition tasks. Next, the server needs to determine the final assigned client based on the operational status characteristics of each candidate client. Operational status characteristics describe the client's current dynamic operating conditions, including at least the client's operational scenario type. Operational scenario types can be categorized as public places, private places, office areas, etc., with different types of scenarios corresponding to different levels of network resource competition and device stability. The server prioritizes devices with better operational status as assigned clients. For example, it prioritizes devices in private places because these devices typically have a more stable network environment and less resource competition, enabling them to complete the review task more reliably. Operational status characteristics can also include indicators such as the client's real-time CPU utilization, memory utilization, and battery status. For instance, the server can select devices with CPU utilization below 70% and memory utilization below 80% as assigned clients to ensure that the review task is not affected by device resource constraints, thus ensuring normal user experience.
[0040] In one embodiment, the server can determine one or more assigned clients based on the actual needs of the target chat channel. A typical implementation involves determining two assigned clients within each channel, serving as the executing assigned client and the backup assigned client, respectively. The executing assigned client immediately undertakes the review task, performing speech recognition and conversion on the channel's mixed audio in real time; the backup assigned client remains in standby mode, not actually performing the recognition task, but keeping its resource packages ready. If the executing assigned client is detected to have exited abnormally, the backup assigned client can quickly take over the task, ensuring uninterrupted channel review capabilities. During the process of determining the assigned client, the server can select from candidate clients according to preset priority rules, such as prioritizing the device with the strongest hardware computing power and optimal operating status as the executing assigned client, and selecting the next best device as the backup assigned client.
[0041] The audit task assignment instruction is a control signal issued by the server to the selected assignment client. This instruction is used at least to notify the client to start the audit task and may carry parameter information required for task execution. The content of the instruction may include a task start identifier, data reporting method during task execution, reporting cycle, or triggering conditions. After receiving the instruction, the assignment client enters the audit task execution state, begins listening to audio data in the channel, and performs speech recognition and conversion.
[0042] Through the above mechanism, the server achieves dynamic scheduling and precise allocation of review tasks, ensuring that review tasks are only started on clients that meet the requirements in terms of hardware conditions and operating status. This not only guarantees the execution quality of review tasks but also avoids causing additional burden to low-end devices or devices with poor network environments.
[0043] Step S13: Receive the review content information reported by the assigning client. The review content information is generated by the assigning client after performing speech recognition and conversion on the mixed audio in the target chat channel. The server receives the review content information reported by the assigned client. This information is generated by the assigned client through speech recognition and conversion of the mixed audio in the target chat channel. This step is a key connection point for the collaboration between the client and the server. The assigned client completes the audio processing task locally and sends the processing result to the server in the form of structured data for subsequent violation review.
[0044] Upon receiving the task assignment instruction from the server, the assigned client enters the audio acquisition and processing state. This client continuously monitors all audio data within the target chat channel, acquiring audio in a manner identical to its communication mechanism as a regular channel member. Specifically, the assigned client receives audio streams uploaded by other speaking clients within the channel via a real-time communication protocol, while simultaneously capturing its own user's voice input through its local microphone. These two or more audio streams are mixed locally on the assigned client to generate a single, complete channel-mixed audio stream. The mixing operation refers to the process of aligning and superimposing multiple audio signals according to time, merging them into a single audio stream. For example, when user A and user B speak simultaneously in the channel, assigned client C mixes the received audio streams from user A, user B, and its own microphone, generating a mixed audio stream containing all three. This mixing method fully utilizes the fact that the client already needs to receive audio within the channel for normal voice communication, eliminating the need to fetch additional audio streams and thus avoiding bandwidth waste.
[0045] After mixing is complete, the client is instructed to invoke the locally deployed speech recognition engine to perform speech recognition conversion on the mixed audio. The speech recognition engine is a combination of acoustic and language models trained on deep neural networks, capable of converting the input audio signal frame by frame into the corresponding text sequence. During recognition, the engine first performs preprocessing operations such as pre-emphasis, framing, and windowing on the audio, extracting acoustic features such as Mel-frequency cepstral coefficients. Then, it calculates the phoneme probability corresponding to each frame using the acoustic model, and finally combines it with the language model to decode the most probable text result. To adapt to different accents and background noise environments, the speech recognition engine can use a general model or a dedicated model optimized for specific scenarios. The recognition conversion result includes not only the final text information but also timestamps for each word or sentence in the audio and a recognition confidence score. The timestamps accurately record the start and end times of the text content in the mixed audio, providing a basis for subsequently locating the specific position of inappropriate content. The recognition confidence score reflects the reliability of the recognition result, usually represented by a value between 0 and 1, with higher values indicating more reliable recognition.
[0046] The review content information refers to the data set reported by the client to the server. Its specific components include at least the text information generated by the speech recognition conversion, timestamp information, and recognition confidence level. In one implementation, the review content information may further include alternative recognition results generated during the speech recognition process, i.e., a list of candidate texts output by the recognition engine. These alternative results correspond to recognition options with similar pronunciations but slightly lower confidence levels, which can be used to assist the server in detecting homophonic variants of prohibited words. The organization format of the review content information can adopt a structured data format, such as JSON or Protocol Buffers, to facilitate server parsing and processing. The reporting timing can be real-time streaming or batch reporting at fixed intervals, such as packaging and sending accumulated recognition results every few seconds, or reporting immediately whenever a complete utterance is recognized. The choice of reporting method needs to balance real-time performance and server processing pressure. For violation detection scenarios requiring rapid response, streaming reporting can be used to shorten latency.
[0047] For example, in a game voice channel, users A and B are communicating during a match. Client C is assigned as the channel's moderator. User A says "Attack the enemy base quickly," and user B responds "Received." Client C's local microphone does not pick up the audio. Client C mixes the two audio streams, and its speech recognition engine converts them into the text "Attack the enemy base quickly, received," generating timestamps for each word, such as "Attack quickly" from millisecond 100 to 300, and "enemy" from millisecond 310 to 450, with an overall confidence level of 0.92. Client C then encapsulates this information into moderation content and reports it to the server via a long connection. The server can then use this data for subsequent keyword matching and violation determination. It can be seen that through this mechanism, the speech recognition task, which originally needed to be performed by the server, is successfully transferred to the client, significantly reducing the server's computational and bandwidth overhead.
[0048] Step S14: Perform a violation review based on the review content information. If a violation is determined, obtain the independent audio data of the relevant clients within the target chat channel for a second review to identify the violating clients within the target chat channel. The server performs violation reviews based on the information reported by the assigned client. If violations are detected, it further obtains independent audio data from relevant clients within the target chat channel for secondary review, ultimately identifying the specific violating client. This step is the core of the overall detection method. Through a two-stage review mechanism, it achieves precise positioning from "whether there is a violation at the channel level" to "who specifically violated the rule at the user level," ensuring broad review coverage while minimizing server resource consumption.
[0049] After receiving the review content information reported by the assigned client, the server first performs a preliminary review of the information for violations. The review content information includes at least the text information generated by the assigned client through speech recognition and conversion of the channel's mixed audio, as well as the corresponding timestamp information of the text information. The server matches the text information against a pre-set list of prohibited words. This list of prohibited words is a set of prohibited words pre-built according to business needs, covering various types of inappropriate content such as insults, pornography, political content, and advertisements, and can be dynamically updated according to the evolution of internet slang and variant words. During the matching process, the server can perform exact matching or fuzzy matching, such as detecting whether the text completely contains a prohibited word, or identifying homophonic variants through pinyin conversion, near-homophone expansion, etc. Once a match is successful, it indicates that there is prohibited content at the current time and location, and the server determines that the chat channel is suspected of violating regulations.
[0050] In one embodiment, to reduce false triggers caused by speech recognition errors or contextual ambiguity, the server can introduce a confidence threshold mechanism to filter matching results. The review content information carries the recognition confidence level generated by the client during the speech recognition conversion process. This confidence level quantifies the reliability of the recognition result and is typically represented by a value between 0 and 1. After matching the text information with a preset violation word library, the server obtains the recognition confidence level corresponding to the matched violation word and compares it with a preset confidence threshold. If the recognition confidence level is lower than the preset confidence threshold, it indicates that the current recognition result may have a large error. Even if a violation word appears in the text, it may be a misidentification. In this case, the server determines the matching result is invalid and does not consider it valid violation content, thus not triggering subsequent secondary review. If the recognition confidence level reaches or exceeds the preset confidence threshold, the matching result is determined to be valid, violation content is identified, and the time position of the violation word in the mixed audio is determined based on the timestamp information, thereby initiating the secondary review process. The preset reliability threshold can be flexibly set according to business needs. For example, it can be set to 0.6 for scenarios with low tolerance, and to 0.8 for scenarios that require higher accuracy.
[0051] In another embodiment, the server can employ a tiered strategy for handling violations to achieve a differentiated response mechanism. A pre-defined violation keyword database can be divided into a high-risk database and a medium-risk database based on the severity of the violation. The high-risk database includes serious violations that clearly violate laws, regulations, or platform guidelines, such as inappropriate content involving violence or terrorism. The medium-risk database includes relatively minor violations such as general insults and advertising promotions. For violations that hit the high-risk database, the server directly determines that there is a violation and triggers a secondary review without assessing the recognition confidence level, ensuring the fastest possible response to serious violations. For violations that hit the medium-risk database, the server makes a comprehensive judgment based on the recognition confidence level. Only when the confidence level reaches or exceeds a pre-defined medium-risk keyword confidence threshold is a violation determined and a secondary review triggered. If the confidence level is below this threshold, no action is taken or only a log entry is recorded for later analysis. This tiered strategy can ensure zero tolerance for high-risk content while effectively filtering out false triggers of medium-risk keywords due to recognition errors, avoiding unnecessary secondary review overhead.
[0052] In another embodiment, the server can perform contextual judgment on the matching results of prohibited words by combining contextual information. Some words may have completely different meanings in different contexts. For example, the word "garbage" is often used as an insult in the context of online games, but may only refer to waste in discussions on environmental protection topics. The server can pre-build an exemption word library, which includes words that, when combined with prohibited words, may form a non-prohibited context. When a prohibited word is matched in the text information, the server further obtains the contextual words before and after the prohibited word and matches these contextual words with the exemption word library. If the contextual words match the exemption word library, for example, if words such as "classification" and "recycling" appear before the prohibited word "garbage," the current matching result is determined to be context-exempt and is not considered valid prohibited content, thus not triggering a second review. If the contextual words do not match the exemption word library, it is normally determined to be suspected of being prohibited. This contextual judgment mechanism can effectively reduce misjudgments caused by the polysemy of words and further improve the accuracy of the review.
[0053] In another embodiment, the server can also employ a frequency statistics and time window limitation mechanism to cumulatively determine suspected violations. If the same client repeatedly matches suspected violations with low confidence levels within a preset time window, the server can accumulate these matching events. When the cumulative number reaches a preset threshold, the server determines that the client is suspected of violating regulations and triggers a secondary review. For example, the time window is set to 30 seconds, and the cumulative threshold is 3 times. If a client is detected with three suspected violations with confidence levels between 0.5 and 0.7 within 30 seconds, even if none of the individual matches meet the triggering standard, the server determines that it is suspected of violating regulations after accumulating three matches and initiates a secondary review for further investigation. This mechanism can effectively capture violations that attempt to evade detection by using low volume or quick skipping, while avoiding overreaction to a single low-confidence match.
[0054] The above embodiments can be implemented individually or in combination according to actual business needs. For example, multiple mechanisms such as direct triggering of high-risk words, combining medium-risk words with confidence thresholds, and contextual exemption judgments can be used simultaneously to construct a multi-layered review and filtering system. Through these optimization strategies, the server can further reduce invalid secondary reviews caused by accidental triggering while ensuring review coverage, concentrating limited computing resources on high-risk or high-confidence violation content that truly requires in-depth analysis, thereby improving the overall efficiency and economy of the review system.
[0055] The initial review results include not only a binary judgment of whether a violation exists, but also the specific temporal position of the violating content in the mixed audio. This information is provided by the timestamp information in the reviewed content information. The timestamp information accurately records the start and end times of each word or sentence in the mixed audio, allowing the server to accurately pinpoint the moment the violation occurred. For example, if the word "garbage" is detected in the text information and matches the violation word list, and the timestamp corresponding to this word is between 15 and 17 seconds, then the server can know that there is violating content in the mixed audio during the period from 15 to 17 seconds. This temporal position information forms the basis for subsequent secondary reviews.
[0056] After the initial review determines that there is inappropriate content, the server immediately initiates a second review process. The goal of the second review is to pinpoint the specific user responsible for the violation, that is, to determine which client in the channel uttered the inappropriate content. Since the mixed audio used in the initial review is a superposition of multiple users' voices, it is impossible to distinguish the source of the sound based solely on the mixed audio. Therefore, it is necessary to obtain the independent audio data of each speaking client for in-depth analysis. Independent audio data refers to the raw audio stream uploaded separately by each client. These audio streams are uploaded to the server during normal client communication or are temporarily retrieved by the server as needed. Based on the time position determined in the initial review, the server retrieves the independent audio data of each speaking client within that time period from the target chat channel. This independent audio data retains each user's individual voice signal and has not undergone mixing processing, thus clearly reflecting the content of each user's speech.
[0057] After obtaining the independent audio data, the server performs speech recognition and conversion on each audio stream to obtain the independent text information corresponding to each speaking client. The recognition and conversion method is similar to the local recognition assigned to the client, but because it is executed on the server, a more powerful acoustic model or a more refined post-processing algorithm can be used. Subsequently, the server performs keyword matching on these independent text information against a preset violation word library, identifying speaking clients that match the keywords as violating clients. For example, in the above example, the server retrieved independent audio data from users A, B, and C from seconds 15 to 17. After recognition, it was found that only user A's independent text information contained the word "spam," while user B and user C's texts did not contain any violating words. Therefore, user A can be identified as the violating client. In this way, the server successfully refined the violation behavior from the channel level to the user level, providing an accurate basis for subsequent penalty measures.
[0058] The secondary review process fully demonstrates the technical advantages of this application. On the one hand, since only a small amount of audio data needs to be processed after a violation is determined, continuous full recognition of all audio streams by the server is avoided, significantly reducing the consumption of computing resources. On the other hand, the secondary review is based on independent audio data for re-recognition, avoiding crosstalk interference that may be caused by mixed audio, and significantly improving the accuracy of localization. For complex cases with low recognition confidence or involving homophonic variants, the secondary review can also combine alternative recognition results, voiceprint features, and other auxiliary information for comprehensive judgment, further ensuring the reliability of the localization results.
[0059] The identified violating client information can be used to trigger subsequent manual review or automated penalty processes. For example, the server can package the violating client's independent audio data, preliminary review text information, and secondary review results into a review case, which can then be pushed to a manual review platform for further confirmation by reviewers, or directly take measures such as muting or banning the violating client according to preset rules. Through this complete review chain, this application achieves precise targeting of violations while ensuring high coverage, effectively maintaining a healthy chat channel environment.
[0060] It is easy to understand from the above embodiments that, compared with the prior art, this application has many advantages, including at least: First, this application fundamentally changes the resource consumption pattern of the traditional pure server-centralized processing model by offloading the speech recognition conversion task to the client. The client performs speech recognition conversion on the channel's mixed audio locally and only reports the content to be reviewed. The server no longer needs to deploy a speech recognition model for each audio stream, nor does it need to fetch the original audio streams of all speaking users in real time. This significantly reduces server bandwidth usage, machine computing power consumption, and graphics card resource requirements. Especially in high-concurrency scenarios such as game voice chat and voice-based social networking, this application can transfer the vast majority of audio processing load to the client, enabling the server to support large-scale audio review operations at extremely low cost. This effectively solves the problem of rapidly rising hardware costs and the difficulty for enterprises to bear the economic burden of large-scale review in existing technologies.
[0061] Secondly, this application employs a two-stage review mechanism to accurately pinpoint users of infringing content, ensuring comprehensive review coverage while avoiding wrongful penalties for innocent users. The server first makes an initial violation determination based on the mixed audio recognition results reported by the client. Only after confirming the presence of infringing content does it retrieve individual audio data from the relevant client for a second review. Through secondary voice recognition and keyword matching, the violation is precisely located to the specific client. This mechanism avoids the shortcomings of traditional mixed audio submission solutions, which can only identify violations across the entire channel but cannot pinpoint individuals. While ensuring accuracy, it further reduces server overhead, enabling the platform to implement precise penalties for violating users and maintain a healthy and orderly chat environment.
[0062] Furthermore, this application achieves intelligent allocation of review tasks by introducing a terminal attribute information collection and dynamic scheduling mechanism, significantly improving system stability and user experience. The server dynamically selects the most suitable client for the review task based on the client's hardware resource parameters and operating status characteristics, ensuring that review tasks are only initiated on clients with sufficient hardware computing power and suitable operating conditions. This avoids review delays or failures due to insufficient client performance or poor operating environment, and also prevents interference with normal user voice communication. Simultaneously, the differentiated scheduling strategy based on factors such as operating status characteristics makes the allocation of review tasks more scientific and reasonable, achieving load balancing among different clients and ensuring the smooth and efficient operation of the entire review system under large-scale concurrent scenarios.
[0063] In a further embodiment, the step of determining at least one assigned client from the clients based on the terminal attribute information includes the following steps: Step S121: Based on the hardware resource parameters, determine one or more target clients whose hardware computing power resources meet the preset processing capability requirements. The server identifies one or more target clients whose hardware computing power meets preset processing capability requirements based on hardware resource parameters. These parameters include at least the number and clock speed of the central processing unit (CPU), and the amount of RAM, directly reflecting the client's computing power. The preset processing capability requirements can be flexibly set according to actual business needs; for example, requiring the client to have at least four CPU cores and at least 4GB of RAM, or requiring a certain CPU clock speed. Through this screening, the server can exclude devices whose hardware performance is insufficient to smoothly run the speech recognition model, avoiding review delays or recognition failures due to insufficient client computing power. For example, in a large-scale game battle, dozens of clients may exist simultaneously in the channel, some of which may be older models equipped with only a dual-core processor and 2GB of RAM; these devices will be excluded from the target client pool.
[0064] Step S122: The target client that has downloaded the speech recognition resource package is identified as a candidate client. After identifying target clients with sufficient hardware computing power, the server further evaluates the readiness of their speech recognition resource packages. Speech recognition resource packages consist of model files and libraries necessary for the client to perform speech recognition conversion; they are typically large and need to be pre-downloaded to the client's local storage. For target clients that have already downloaded the speech recognition resource package, the server directly identifies them as candidate clients. These candidate clients already possess the software capabilities to perform the review task and can undertake the task without additional processing. For example, a high-spec gaming phone may have already downloaded and installed the resource package in the background, allowing the server to include it in the candidate pool during the screening process.
[0065] Step S123: Based on the hardware resource parameters, detect whether the available storage space of the target client that has not downloaded the speech recognition resource package meets the requirements for storing the speech recognition resource package, and determine the target client that meets the requirements as a candidate client: For target clients that have not yet downloaded the speech recognition resource package, the server needs to check whether their available storage space meets the requirements for storing the speech recognition resource package based on hardware resource parameters. Available storage space refers to the client's current remaining local storage capacity, which must be greater than the storage space required by the speech recognition resource package for successful download and decompression. For example, if the speech recognition resource package is 200MB in size, the target client's available storage space must be no less than 200MB. The server obtains the available storage space data by checking the hardware resource parameters reported by the target client or by querying in real time, and also confirms target clients that meet the storage requirements as candidate clients. For target clients with insufficient storage space, even if their hardware computing power meets the requirements, they cannot support the resource package and are therefore temporarily excluded from the candidate range. This step ensures that the selected clients have the basic conditions to complete the deployment of the resource package, laying the foundation for the subsequent review task.
[0066] Step S124: Based on the running status characteristics, determine the client whose running status meets the audit task requirements from the candidate clients as the assigned client: After obtaining a set of candidate clients that simultaneously meet the hardware computing power requirements and resource package storage conditions, the server needs to determine the client whose operating status meets the requirements of the review task as the final assigned client based on the operating status characteristics of each candidate client. Operating status characteristics describe the client's dynamic operating conditions and environmental attributes at the current moment, and their specific components may include, but are not limited to, the client's operating scenario type, real-time CPU utilization, memory utilization, network connection method, download speed, and battery status. The server prioritizes target clients with better operating status as assigned clients. For example, it prioritizes devices in private locations, as these target clients typically have a more stable network environment and less resource contention, enabling them to complete the review task more reliably; or it selects target clients with CPU utilization below 70% and memory utilization below 80% to ensure that the review task will not affect normal user experience due to resource constraints on the target client. Through this comprehensive evaluation, the server ultimately selects the most suitable assigned client from the candidate clients to undertake the real-time speech recognition task, providing reliable assurance for subsequent channel mixing audio processing and violation review.
[0067] In this embodiment, a multi-level screening mechanism is used to ensure that the selected client has the hardware conditions, software preparation and suitable operating status to perform the speech recognition task, thereby ensuring the stable and efficient execution of the review task.
[0068] In a further embodiment, the step of issuing an audit task assignment instruction to the assignment client includes the following steps: Step S121': Determine the number of assigned clients already in the target chat channel, and obtain the running status characteristics of each client in the target chat channel, including the client's running scenario type: Before issuing the task assignment instruction to the client, the server needs to ensure that the target chat channel already has the software foundation to perform the speech recognition task, that is, each client needs to download and install the speech recognition resource package in advance.
[0069] The server first determines the number of assigned clients already existing in the target chat channel. Assigned clients refer to clients that have been selected by the server and are currently undertaking audio moderation tasks; their number reflects the current channel's moderation coverage at the user level.
[0070] In one embodiment, the server can obtain this quantity information in real time through a maintained chat channel-client mapping table. For example, it can record the identification information of the currently assigned client and the backup assigned client for each chat channel, and update the list synchronously when a client is assigned or leaves. The number of assigned clients is an important basis for the server to determine whether to allocate resource packages to other clients. If there are no assigned clients in the channel, it means that the current channel is in a review blind spot, and resource packages need to be allocated to other clients first to establish review capabilities as soon as possible. If there are already one or more assigned clients in the channel, it indicates that the review needs of the current channel have been partially met, and the server can appropriately slow down the download pace of resource packages for other clients to avoid resource redundancy and unnecessary bandwidth consumption.
[0071] While determining the number of assigned clients, the server acquires the operational status characteristics of each client within the target chat channel. These characteristics include at least the client's operational scenario type. The operational scenario type characterizes the environment in which the client is located, and may include, but is not limited to, public places, private places, and office areas. Public places can encompass shared network environments such as internet cafes, coffee shops, and school computer labs, which typically feature intense competition for network resources and complex bandwidth allocation. Private places refer to relatively independent network environments such as homes and individual rooms, where network stability and bandwidth availability are usually high. Office areas may involve various network types such as corporate intranets and guest networks. The determination of the operational scenario type can be based on a comprehensive identification of the client's network address information, physical location information, and device usage patterns. For example, the server can query a pre-configured public place IP address database to identify whether the client is in an internet cafe address range; the client can also obtain the identifier of the currently connected wireless network service set through the operating system interface and report it to the server. The server, based on the identifier's prefix or a pre-configured list of public place wireless networks, determines whether the client is in a coffee shop or hotel providing a shared wireless network.
[0072] Step S122': Based on the number of assigned clients and the type of operating scenario, determine the download backoff time for the speech recognition resource package for each client, and determine the download priority of each client in ascending order of the download backoff time. The server determines the download backoff time for the speech recognition resource package for each client based on the number of assigned clients and the type of operating scenario, and then prioritizes each client according to the order of download backoff time from shortest to longest. Download backoff time refers to the length of time a client needs to wait after triggering the download condition; the shorter the backoff time, the higher the download priority. The specific strategy for determining the backoff time can be flexibly configured according to actual business needs. As a specific implementation method, a tiered backoff mechanism can be adopted. For example, if there are no assigned clients in the current channel, indicating an urgent need for review, a shorter download backoff time (e.g., 1 second) is set for clients in non-public areas within the chat channel; a slightly longer backoff time (e.g., 5 seconds) is set for clients in public areas to reduce the impact on the shared network. If there is already one assigned client in the chat channel, a 5-second backoff time is set for clients in non-public areas, and a 7-second backoff time is set for clients in public areas. If there are already two assigned clients in the channel, a 7-second backoff time is set for clients in non-public areas, and a 9-second backoff time is set for clients in public areas. This tiered strategy allows the server to prioritize download resources for clients with good network environments while ensuring comprehensive review coverage, thus avoiding bandwidth spikes caused by a large number of clients downloading simultaneously.
[0073] Step S123': Distribute the speech recognition resource package to clients that meet the download conditions according to the download priority: After determining the download backoff time for each client, the server creates a download priority list in ascending order of backoff time. The client with the shortest backoff time receives the highest download priority and will be prioritized for inclusion in the download scheduling queue. Finally, the server distributes the speech recognition resource package to clients that meet the download criteria based on their download priority. Clients that meet the download criteria are those whose downloads are not currently paused and whose backoff time has expired.
[0074] In this embodiment, dynamic scheduling of the resource package download process is implemented, which not only ensures the timely deployment of the review task, but also effectively controls the instantaneous bandwidth pressure, so as to achieve reasonable allocation and utilization of bandwidth resources, laying a solid foundation for the subsequent client to perform speech recognition tasks.
[0075] In a further embodiment, the step of sending the speech recognition resource package to a client that meets the download conditions includes the following steps: Step S1231': Determine the upper limit of the download rate for each client based on the client's network connection method or downlink speed, so as to control the actual download speed of the client when downloading the speech recognition resource package.
[0076] During the process of distributing voice recognition resource packages to clients that meet the download conditions according to download priority, the server can further refine the control of the download process to avoid affecting the normal voice communication experience of client users due to excessive network bandwidth consumption caused by resource package downloads.
[0077] The server first obtains the network connection method or downlink speed information of each client. This information is part of the operational status characteristics and can be obtained through client reporting or server-initiated queries. The network connection method refers to the specific form in which the client accesses the internet, including but not limited to wireless network connections, wired network connections, and mobile data network connections. Different types of network connection methods have different bandwidth characteristics and stability features. For example, clients accessing the internet via home fiber broadband or wired networks typically have high bandwidth capacity and stable connection quality; clients accessing via Wi-Fi wireless networks may experience bandwidth limitations due to signal strength and channel interference; while clients accessing via 4G or 5G mobile data networks may face traffic restrictions, signal fluctuations, and bandwidth control policies from operators. Downlink speed quantifies the current available downlink bandwidth level for the client and can be obtained through real-time speed testing, historical data statistics, or network interface queries. It is usually expressed in bits per second, such as 100Mbps, 50Mbps, or 20Mbps.
[0078] The server determines the download speed limit for each client based on their network connection method or download speed. The download speed limit refers to the maximum network bandwidth the server allows a client to use while downloading the speech recognition resource package. Clients will be constrained by this limit during the download process, and the actual download speed will not exceed this set value. The specific strategy for determining the download speed limit can be flexibly configured according to actual business needs. Its core objective is to minimize interference with normal voice communication on the client while ensuring the resource package can be downloaded successfully.
[0079] In one implementation, the server can directly set the corresponding download speed limit based on the client's network connection method. For example, for clients connected via a wired network, a higher download speed limit, such as 10MB / s, can be set because wired networks typically have ample bandwidth resources, and download tasks have minimal impact on communication. For clients connected via Wi-Fi, differentiated speed limits can be set based on Wi-Fi signal strength or network type; for example, 5MB / s for 5GHz Wi-Fi with strong signals, and 2MB / s for 2.4GHz or weaker Wi-Fi signals. For clients connected via mobile data networks, a lower download speed limit, such as 1MB / s, is set to avoid consuming excessive mobile data or occupying valuable wireless channel resources. Mobile data networks can be further subdivided into 4G and 5G networks, with different speed limits set for each; for example, 5G networks could be set to 2MB / s, and 4G networks to 1MB / s.
[0080] In another implementation, the server can dynamically adjust the download rate cap based on the client's download speed. Download speed reflects the client's current available network bandwidth, which the server can use as a direct basis for determining the download rate cap. For example, for a client with a download speed of 100Mbps, the download rate cap can be set to 8MB / s; for a client with a download speed of 50Mbps, the cap can be set to 4MB / s; and for a client with a download speed of only 20Mbps, the cap can be set to 1.5MB / s. In practice, a proportional coefficient method can be used, such as setting the download rate cap as a percentage of the download speed, like 50% or 30%, to reserve sufficient bandwidth for normal voice communication. Alternatively, a segmented threshold method can be used, pre-setting multiple download speed ranges and their corresponding download rate caps, with the server directly matching the corresponding cap based on the client's current range.
[0081] The two methods described above can also be combined to form a more refined control strategy. For example, the server first determines a basic speed limit based on the network connection method, and then fine-tunes this basic speed based on the download speed. Suppose a client connects via Wi-Fi, with a preset basic speed limit of 5MB / s, but the actual measured download speed is only 30Mbps. In this case, the server can lower the actual download speed limit to 3MB / s to adapt to the current network conditions. Alternatively, if the client connects via mobile data, with a preset basic speed limit of 1MB / s, and the download speed reaches 100Mbps and the user is on an unlimited data plan, the server can appropriately increase the speed limit to 2MB / s to accelerate the download speed of resource packages.
[0082] After determining the download speed limit for each client, the server sends this control parameter along with the command to distribute the speech recognition resource package to the client, or sends it via independent control signaling. Upon receiving the download command, the client initiates the download process and self-regulates its actual download speed according to the server-specified download speed limit. Clients can limit download speed by adjusting the TCP window size, controlling the data request frequency, or using a token bucket algorithm.
[0083] In this embodiment, the download process of the resource package via voice recognition is incorporated into the server's unified scheduling system. This ensures that the resource package can be downloaded on time to support subsequent review tasks, while effectively preventing download traffic from impacting the client's normal voice communication. Thus, while ensuring the deployment of review capabilities, the real-time interactive experience of users is maintained.
[0084] In a further embodiment, before the step of receiving the review content information reported by the assignment client, the following steps are included: Step S131: Select one execution client from among the assigned clients, and use the remaining assigned clients as backup assigned clients. Before receiving the review content information reported by the assigned client, the server needs to ensure that the client currently undertaking the audio review task can run continuously and stably. Considering that the assigned client may exit the review task due to network fluctuations, resource exhaustion, or program abnormalities, this application further introduces a primary / backup switching mechanism. By setting up an execution assigned client and a backup assigned client, and switching promptly when an abnormality is detected, the review capability of the chat channel is guaranteed to be uninterrupted.
[0085] The server first selects one client from among the assigned clients to perform the assigned task, while the remaining clients serve as backup clients. The performing client is the one chosen by the server and immediately assumes the audio review task. This client will perform speech recognition and conversion on the channel's mixed audio in real time and report the review content information. Backup clients remain in standby mode, not actually performing recognition tasks, but keeping their speech recognition resource packages ready and capable of taking over the task at any time. This primary / backup relationship can be determined based on the ranking results during the client selection process; for example, prioritizing the client with the strongest hardware computing power and optimal operating status as the performing client, and selecting the next best client as the backup client. In a typical implementation, the server can maintain a list of assigned clients for each chat channel, containing multiple entries, one marked "performing" and the rest marked "backup," and dynamically adjusted according to the real-time status of the clients.
[0086] Step S132: Monitor the running status of the execution assignment client in real time. The running status includes the client's online status, resource utilization, and task execution status. After determining the assigned client and the backup assigned client, the server monitors the running status of the assigned client in real time. Running status is a set of dynamic indicators used to assess whether the client is currently suitable to continue undertaking the review task. Its specific components may include, but are not limited to, the client's online status, resource utilization, and task execution status. Online status determines whether the client maintains a valid network connection with the server, for example, through periodic heartbeat checks. If no heartbeat response is received from the client after a preset time, the client is considered offline. Resource utilization includes CPU usage and RAM usage, reflecting the client's current computational load. Excessive resource utilization may cause speech recognition tasks to lag or fail. Task execution status monitors the client's specific performance during the review task execution process, such as whether review content information is reported on time, whether the reported data format is complete, and whether the speech recognition engine is running normally.
[0087] Step S133: When it is detected that the execution assignment client is offline, the resource utilization rate exceeds a preset threshold, or the audit task execution is abnormal, it is determined that the execution assignment client exits the audit task. When an anomaly is detected in the client executing the assignment, the server determines that the client will exit the review task. The specific criteria for determining anomalies can be flexibly set according to business needs, including but not limited to the following situations: If the client executing the assignment is detected to be offline (i.e., heartbeat timeout or network connection loss), it will immediately exit the review task. If the resource utilization of the client executing the assignment exceeds a preset threshold, such as a CPU utilization rate consistently above 90% or a RAM utilization rate above 85%, it indicates that the client is currently under excessive load and may not be able to guarantee the real-time performance and accuracy of speech recognition; in this case, the server also determines that it will exit the review task. If task execution anomalies are detected, such as the client repeatedly failing to report review content information on time, or reporting obviously incorrect data, it will also determine that it will exit the review task. For example, in a game voice channel, if user A's client, acting as the executing assignment client, experiences a CPU utilization rate of 95% due to a large game running in the background, exceeding the preset threshold of 85%, the server will immediately determine that user A will exit the review task upon detecting this state.
[0088] Step S134: Activate one of the backup assigned clients to take over the audit task from the execution assigned client: After determining that the assigned client has exited the review task, the server immediately activates one of the backup assigned clients to take over the task. The activation process involves issuing a command to the backup client, instructing it to switch from standby to execution mode, begin speech recognition and conversion of the channel's mixed audio, and report the review content information. If multiple backup assigned clients exist, the server can select a successor according to preset priority rules, such as prioritizing clients with higher hardware configurations and better network conditions. Once the new assigned client starts up and is running normally, the original assigned client is removed from the assigned client list or marked as invalid. The server can then replenish the list with new backup assigned clients based on the same rules used to select assigned clients, ensuring that there are still other assigned clients available to take over if any client exits, thus maintaining the continuity of the review task.
[0089] In this embodiment, through the primary and secondary client assignment mechanism, even if the assigned client currently performing the review task exits for any reason, the review task of the chat channel can still be seamlessly taken over by the backup client, ensuring that the review process is not interrupted and providing a continuous and reliable data source for subsequent violation detection.
[0090] In a further embodiment, the step of reviewing violations based on the review content information includes: Step S141: Obtain the text information, timestamp information, and identification confidence level from the reviewed content information; match the text information with a preset violation word database to obtain the matching result. During the violation review process based on the review content information, the server first extracts text information, timestamp information, and recognition confidence from the review content information reported by the assigned client. Text information refers to the text content generated after the assigned client performs speech recognition and conversion on the channel's mixed audio. Timestamp information records the start and end times of each word or sentence in the text information within the mixed audio. Recognition confidence is used to quantify the reliability of the speech recognition results, typically represented by a value between 0 and 1, with higher values indicating more reliable recognition results.
[0091] The server extracts text information and matches it against a pre-defined list of prohibited words. This list is a collection of prohibited words pre-built according to business needs, covering various types of inappropriate content such as insults and advertisements, and can be dynamically updated as internet slang and variant words evolve. During the matching process, precise matching can be performed, i.e., checking whether the text completely contains a word from the prohibited word list, or fuzzy matching strategies can be used, such as identifying homophonic variants through pinyin conversion and near-homophone expansion. Once a match is successful, a matching result is obtained, which includes at least the matched prohibited word and its position in the text.
[0092] Step S142: Determine the time position corresponding to the matched violation word in the mixed audio according to the timestamp information. The server determines the time position corresponding to the matched violation word in the mixed audio according to the timestamp information. The timestamp information accurately records the start and end times of each word. Therefore, the server can map the violation words hit in the text to the specific time period in the audio stream. For example, if the timestamp corresponding to the word "garbage" in the text information is from the 15th second to the 17th second, the server can know that there is violation content in the mixed audio during this time period.
[0093] Step S143: Evaluate the credibility of the matching result according to the recognition confidence. If the recognition confidence reaches or exceeds the preset confidence threshold, directly determine that the matching result is valid. After obtaining the matching result and the time position, the server further evaluates the credibility of the matching result according to the recognition confidence. The recognition confidence reflects the accuracy of the speech recognition result. For the matched violation word, the server obtains its corresponding recognition confidence and compares it with the preset confidence threshold. The preset confidence threshold can be flexibly set according to the actual business scenario. For example, it can be set to 0.6 for scenarios with low tolerance, and 0.8 for scenarios that require higher accuracy. If the recognition confidence reaches or exceeds this threshold, it indicates that the current recognition result is relatively reliable, and the server directly determines that the matching result is valid without additional processing.
[0094] Step S144: When the recognition confidence is lower than the preset confidence threshold, obtain the alternative recognition result carried in the review content information. The alternative recognition result is the candidate recognition text generated by the assigned client during the speech recognition conversion. Perform a secondary match between the alternative recognition result and the preset violation word library to confirm whether there are variant violation words that are similar in pronunciation to the violation word. When the recognition confidence is lower than the preset confidence threshold, it indicates that there may be a large error in the current recognition result, such as inaccurate recognition due to accent, background noise or unclear speech. At this time, the server needs to further obtain the alternative recognition result carried in the review content information. The alternative recognition result is a list of candidate recognition texts generated by the assigned client during the speech recognition conversion. These candidate texts correspond to recognition options that are similar in pronunciation to the main recognition result but have a slightly lower confidence. The server performs a secondary match between the alternative recognition result and the preset violation word library to confirm whether there are variant violation words that are similar in pronunciation to the violation word. For example, the main recognition result is "garbage" with a confidence of 0.65, and the alternative results include "laji" or "laki". These words may belong to Internet slang or homophonic variants and also constitute violations. If a violation word is hit in the alternative result, the server also determines that there is violation content and includes the variant word in the matching result.
[0095] Step S145: Based on the time location, matching results, and credibility assessment, a violation determination result is generated. The violation determination result includes the matched violation words and their corresponding time locations. Through the aforementioned credibility assessment and secondary matching, the server ultimately generates a violation determination result based on the time location, matching results, and credibility assessment. This determination result includes at least the matched violation word and its corresponding time location in the mixed audio, providing accurate evidence for subsequent secondary review and user location. This mechanism effectively reduces false judgments caused by speech recognition errors, while also identifying variant violation words, improving the accuracy and coverage of the review process.
[0096] In this embodiment, by filtering with confidence threshold and matching the candidate recognition results twice, the misjudgment caused by speech recognition error is effectively reduced, while accurately capturing illegal words such as homophonic variants, which significantly improves the accuracy and reliability of violation review.
[0097] In a further embodiment, the step of obtaining independent audio data of relevant clients within the target chat channel for secondary review to determine the violating clients within the target chat channel includes: Step S141': Based on the time period corresponding to the violation content, determine the target time period for which independent audio data needs to be obtained. The target time period includes the start time to the end time of the violation content appearing in the mixed audio. After the server determines that there is inappropriate content based on the review information, it needs to further pinpoint the violation to the specific client, that is, to determine which user in the channel uttered the inappropriate content. The server obtains independent audio data for the time period corresponding to the inappropriate content and re-identifies and matches it, thereby achieving user-level violation location.
[0098] The server first determines the target time period for retrieving independent audio data based on the time period corresponding to the violation content. This time period originates from the violation judgment result generated during the initial review, which includes the detected violation words and their corresponding time positions in the mixed audio. Timestamp information precisely records the start and end times of the violation words, allowing the server to determine the start and end range of the target time period. For example, if the initial review determines that the violation word "garbage" exists in the mixed audio from second 15 to second 17, the server will determine second 15 to second 17 as the target time period. This time period accurately defines the time window in which the violation occurred, providing a clear basis for subsequently retrieving independent audio data.
[0099] Step S142': Retrieve independent audio data from each client within the target time period from the target chat channel. The independent audio data is the original audio stream uploaded separately by each client. After determining the target time period, the server will retrieve the independent audio data of each client within that time period from the target chat channel. Independent audio data refers to the raw audio stream uploaded separately by each client. These audio streams are either uploaded to the server during normal client communication or retrieved by the server on demand. Unlike mixed audio, independent audio data retains each user's individual voice signal without mixing, thus clearly reflecting each user's speech. The server needs to obtain the independent audio data of all clients that may speak within the target time period, including the client executing the assignment and other clients speaking in the channel. For example, in a chat channel containing users A, B, and C, if preliminary review determines that there is inappropriate content between seconds 15 and 17, the server will retrieve the independent audio data of users A, B, and C for that time period.
[0100] Step S143': Perform speech recognition conversion on each of the independent audio data to obtain independent text information corresponding to each client: After obtaining the individual audio data, the server performs speech recognition and conversion on each audio stream to obtain the corresponding text information for each client. The speech recognition and conversion process is similar to the operation assigned to the client locally, but because it is performed on the server side, a more powerful acoustic model or a more sophisticated post-processing algorithm can be used to improve recognition accuracy. The server sequentially performs preprocessing, feature extraction, acoustic model decoding, and language model decoding on each audio stream, ultimately outputting the text information. For example, recognizing user A's individual audio data yields the text "You're such trash," user B's result is "The weather is nice today," and user C's result is "Received." This independent text information accurately reflects each user's speech content within the target time period.
[0101] Step S144': Perform keyword matching on the independent text information, and identify the client that matches keywords in the preset violation keyword database as the violation client: The server performs keyword matching on each of these individual text messages against a pre-set list of prohibited terms. This list is identical to the one used in the initial review and covers various prohibited words and homophones. During the matching process, the server checks whether each individual text message contains keywords from the prohibited term list. Clients whose keywords match are identified as violating clients. Using the example above, user A's individual text message contains the word "garbage," matching the prohibited term list, so the server identifies user A as a violating client; user B and user C's text messages do not contain any prohibited words, so they are not considered violators. In this way, the server successfully refines the violation behavior from the channel level to the user level, providing an accurate basis for subsequent penalties.
[0102] In one embodiment, during keyword matching, a fuzzy matching strategy can be used to identify homophonic variants. In addition, the server can combine the recognition confidence score generated during the speech recognition conversion process for a comprehensive judgment. The recognition confidence score quantifies the reliability of the speech recognition result, typically represented by a value between 0 and 1. The server obtains the recognition confidence score corresponding to the matched violation word in each independent text message and compares it with a preset confidence threshold. If the recognition confidence score reaches or exceeds the threshold, the matching result is confirmed as valid, and the client is identified as a violation client. If the recognition confidence score is lower than the threshold, it indicates that the current recognition result may have an error; even if a violation word appears in the text, it may be a misidentification. In this case, the server does not directly determine the client as a violation but marks the relevant data as pending review and prioritizes pushing it to the manual review platform for further confirmation. By introducing a confidence score judgment mechanism, secondary review can effectively reduce misjudgments caused by speech recognition errors, further improving the accuracy and reliability of locating violation users.
[0103] The secondary review process fully demonstrates the technical advantages of this application. On the one hand, since only a small amount of audio data needs to be processed after a violation is determined, continuous full recognition of all audio streams by the server is avoided, significantly reducing the consumption of computing resources. On the other hand, the secondary review is based on independent audio data for re-recognition, avoiding crosstalk interference that may be caused by mixed audio, and significantly improving the accuracy of localization. For complex cases with low recognition confidence or involving homophonic variants, the secondary review can also combine alternative recognition results, voiceprint features, and other auxiliary information for comprehensive judgment, further ensuring the reliability of the localization results.
[0104] In this embodiment, while ensuring the coverage of the review, it also achieved precise targeting of users who violated the rules, effectively maintaining a healthy environment for the chat channel.
[0105] Please see Figure 3This invention provides a chat audio violation detection device to meet one of the purposes of this application. It is a functional embodiment of the chat audio violation detection method of this application. On another note, this chat audio violation detection device, also to meet one of the purposes of this application, includes: a terminal attribute acquisition module 11, used to acquire terminal attribute information of each client within a target chat channel, the terminal attribute information including hardware resource parameters and operating status characteristics; an audit task assignment module 12, used to determine at least one assignment client from the clients based on the terminal attribute information and issue an audit task assignment instruction to the assignment client; an audit content receiving module 13, used to receive audit content information reported by the assignment client, the audit content information being generated by the assignment client after performing speech recognition conversion on the mixed audio within the target chat channel; and a violation terminal determination module 14, used to perform violation auditing based on the audit content information. If violation content is determined, it acquires independent audio data of relevant clients within the target chat channel for secondary auditing to determine the violation client within the target chat channel.
[0106] To address the aforementioned technical problems, embodiments of this application also provide an electronic device. For example... Figure 4 The diagram shows the internal structure of an electronic device. This electronic device includes a processor, a computer-readable storage medium, a memory, and a network interface connected via a system bus. The computer-readable storage medium stores an operating system, a database, and computer-readable instructions. The database may store a sequence of control information. When the computer-readable instructions are executed by the processor, they enable the processor to implement the chat audio violation detection method described in this application. The processor of this electronic device provides computing and control capabilities to support the operation of the entire electronic device. The memory of this electronic device may store computer-readable instructions, which, when executed by the processor, enable the processor to execute the chat audio violation detection method described in this application. The network interface of this electronic device is used for communication with a terminal. Those skilled in the art will understand that… Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the electronic device to which the present application is applied. The specific electronic device may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0107] In this execution method, the processor is used for execution. Figure 3The system contains the specific functions of each module and its submodules. The memory stores the program code and various data required to execute these modules or submodules. The network interface is used for data transmission between the user terminal and the server. In this execution method, the memory stores the program code and data required to execute all modules / submodules in the chat audio violation detection device of this application. The server can call the server's program code and data to execute the functions of all submodules.
[0108] This application also provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the method described in any embodiment of this application.
[0109] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. The aforementioned storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0110] In summary, this application significantly reduces server resource consumption while achieving precise user targeting of illegal content.
[0111] Those skilled in the art will understand that the steps, measures, and solutions in the various operations, methods, and processes discussed in this application can be alternated, modified, combined, or deleted. Furthermore, other steps, measures, and solutions in the various operations, methods, and processes discussed in this application can also be alternated, modified, rearranged, decomposed, combined, or deleted. Furthermore, steps, measures, and solutions in the prior art that are similar to those in the open-source operations, methods, and processes of this application can also be alternated, modified, rearranged, decomposed, combined, or deleted.
[0112] The above description is only a partial implementation method of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principles of this application, and these improvements and modifications should also be considered within the scope of protection of this application.
Claims
1. A method of chat audio violation detection, the method comprising: Includes the following steps: Obtain terminal attribute information of each client in the target chat channel, wherein the terminal attribute information includes hardware resource parameters and operating status characteristics; Based on the terminal attribute information, at least one assignment client is determined from the clients, and an audit task assignment instruction is issued to the assignment client; The system receives review content information reported by the assignment client, which is generated by the assignment client after performing speech recognition and conversion on the mixed audio in the target chat channel. Based on the review content information, a violation review is conducted. If a violation is determined, independent audio data of the relevant clients within the target chat channel is obtained for a second review to identify the violating clients within the target chat channel.
2. The chat audio violation detection method according to claim 1, characterized in that, The step of determining at least one assigned client from the client based on the terminal attribute information includes the following steps: Based on the hardware resource parameters, one or more target clients whose hardware computing power resources meet the preset processing capability requirements are identified; The target client that has already downloaded the speech recognition resource package is identified as a candidate client; Based on the hardware resource parameters, it is detected whether the available storage space of the target client that has not downloaded the speech recognition resource package meets the requirements for storing the speech recognition resource package, and the target client that meets the requirements is also determined as a candidate client; Based on the described operational status characteristics, the client whose operational status meets the requirements of the audit task is selected from the candidate clients and designated as the assigned client.
3. The chat audio violation detection method according to claim 1, characterized in that, The step of issuing an audit task assignment instruction to the assignment client includes the following steps: Determine the number of assigned clients already in the target chat channel, and obtain the running status characteristics of each client in the target chat channel, the running status characteristics including the running scenario type of the client; Based on the number of assigned clients and the type of operating scenario, determine the download backoff time for the speech recognition resource package for each client, and determine the download priority of each client in order of the download backoff time from shortest to longest. The speech recognition resource package is sent to clients that meet the download conditions according to the download priority.
4. The chat audio violation detection method according to claim 3, characterized in that, The step of sending the speech recognition resource package to clients that meet the download conditions includes the following steps: Based on the client's network connection method or download speed, determine the upper limit of the download rate for each client to control the actual download speed of the client when downloading the speech recognition resource package.
5. The chat audio violation detection method according to claim 1, characterized in that, Before receiving the review content information reported by the assigned client, the following steps are included: One of the assigned clients is selected to execute the assigned client, while the remaining assigned clients serve as backup assigned clients. The running status of the execution assignment client is monitored in real time, including the client's online status, resource utilization, and task execution status. When the execution assignment client is detected to be offline, resource usage exceeds a preset threshold, or the audit task is executed abnormally, it is determined that the execution assignment client will exit the audit task. Activate one of the backup assigned clients to take over the audit task from the execution assigned client.
6. The chat audio violation detection method according to claim 1, characterized in that, The steps for conducting violation review based on the aforementioned review content information include: The text information, timestamp information, and identification confidence level in the reviewed content information are obtained, and the text information is matched with a preset violation word database to obtain the matching result; Based on the timestamp information, determine the time position of the matched violation word in the mixed audio; The credibility of the matching result is evaluated based on the recognition confidence level. If the recognition confidence level reaches or exceeds the preset confidence threshold, the matching result is directly determined to be valid. When the recognition confidence is lower than the preset confidence threshold, the alternative recognition results carried in the review content information are obtained. The alternative recognition results are candidate recognition texts generated by the assigned client during the speech recognition conversion process. The alternative recognition results are matched with the preset violation word library for a second time to confirm whether there are variant violation words with similar pronunciation to the violation words. Based on the time location, matching results, and credibility assessment, a violation determination result is generated, which includes the matched violation words and their corresponding time locations.
7. The chat audio violation detection method according to claim 1, characterized in that, The step of obtaining independent audio data of relevant clients within the target chat channel for secondary review to identify the violating clients within the target chat channel includes: Based on the time period corresponding to the violation content, determine the target time period for which independent audio data needs to be obtained. The target time period includes the start time to the end time of the violation content appearing in the mixed audio. Retrieve independent audio data from each client within the target time period from the target chat channel; the independent audio data is the original audio stream uploaded separately by each client. Each of the independent audio data is converted into speech recognition data to obtain the independent text information corresponding to each client. Keyword matching is performed on the independent text information, and clients that match keywords in the preset violation keyword database are identified as the violation clients.
8. A chat audio violation detection device, characterized in that, It includes: The terminal attribute acquisition module is used to acquire terminal attribute information of each client in the target chat channel. The terminal attribute information includes hardware resource parameters and operating status characteristics. The audit task assignment module is used to determine at least one assignment client from the clients based on the terminal attribute information and to issue an audit task assignment instruction to the assignment client. The review content receiving module is used to receive review content information reported by the assigning client. The review content information is generated by the assigning client after performing speech recognition and conversion on the mixed audio in the target chat channel. The module for identifying violating terminals is used to conduct a violation review based on the review content information. If a violation is determined, the module obtains the independent audio data of the relevant clients in the target chat channel for a second review to identify the violating clients in the target chat channel.
9. An electronic device comprising a central processing unit and a memory, characterized in that, The central processing unit is used to invoke and run a computer program stored in the memory to perform the steps of the method as described in any one of claims 1 to 7.
10. A non-volatile storage medium, characterized in that, It stores, in the form of computer-readable instructions, a computer program implemented according to any one of claims 1 to 7, which, when invoked by a computer, executes the steps included in the corresponding method.