Artificial delay for managing voice communications
By introducing artificial delays into the audio stream and using machine learning models to identify harmful instances, the challenges of audio communication management in the metaverse environment are addressed, enabling effective control of harmful content without affecting the smooth flow of communication.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ROBLOX CORP
- Filing Date
- 2023-09-06
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies struggle to effectively manage harmful content in audio communications, especially in metaverse environments, leading to excessive delays that impair smooth communication.
By introducing an artificial delay mechanism, a trained machine learning model is used to identify harmful instances in the audio stream and insert time delays or replace them with noise or silence when necessary. At the same time, combined with speech and text analysis, a harmfulness score is generated to control the transmission of the audio stream.
It enables effective management of harmful content in audio streams without significantly increasing latency, maintaining the naturalness and continuity of communication, and reducing conversation interruptions.
Smart Images

Figure CN119895879B_ABST
Abstract
Description
[0001] Cross-reference to related applications
[0002] This application is an international application and has priority to U.S. Patent Application No. 17 / 940,749, filed on September 8, 2022, entitled “ARTIFICIAL LATENCY FOR MODERATING VOICE COMMUNICATION”, filed pursuant to 35 U.S. SC § 119(e), the entire contents of which are incorporated herein by reference. Background Technology
[0003] Online platforms need a way to provide a secure and civilized environment for communication between user devices. Text communication is easier to manage than audio communication because users are more tolerant of latency in text messaging. Furthermore, managing text communication is easier than managing audio streams because text can be compared to lists of banned or problematic words. Conversely, audio streams are more difficult to manage and analyze due to variations in accents, intonation, volume, and the use of sarcasm.
[0004] The background description provided herein is intended to introduce the background of this disclosure. The work done by the present inventor, with respect to what is described in this background section, and to any aspects of the specification that may not constitute prior art at the time of application, whether express or implied, should not be considered prior art to this disclosure. Summary of the Invention
[0005] The embodiments generally relate to a system and method for managing an audio stream by introducing an artificial delay. According to one aspect, a computer-implemented method includes receiving an audio stream from a transmitting device. The method further includes providing the audio stream and speech analysis scores, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device as input to a trained machine learning model, wherein the trained machine learning model is iteratively applied to the audio stream, and wherein each iteration corresponds to a corresponding portion of the audio stream. The method further includes generating a harmfulness level of the audio stream as output using the trained machine learning model. The method further includes transmitting the audio stream to a receiving device, wherein the transmission is performed to introduce a time delay in the audio stream based on the harmfulness level.
[0006] In some embodiments, the method further includes identifying harmful instances in the audio stream and replacing harmful instances in the audio stream with noise or silence before sending the audio stream to a receiving device. In some embodiments, the method further includes identifying silences or pauses between words in the audio stream, the silences or pauses corresponding to specific timestamps in the audio stream, wherein a time delay is introduced as a gap in the audio stream at the specific timestamp of the silence or pause between words. In some embodiments, the method further includes updating a speech analysis score based on the identified harmful instances in the audio stream. In some embodiments, the method further includes receiving text from a text channel associated with a transmitting device, wherein the text channel is separate from the audio stream, and generating a text score indicating the harmfulness level of the text, wherein the input to a trained machine learning model also includes the text score. In some embodiments, the input to the trained machine learning model also includes a harmful history of a first user, speaker history and metadata associated with the first user, and listener history and metadata associated with a second user, who is associated with the receiving device. In some embodiments, one or more voice emotion parameters include tone, pitch, and vocal intensity levels determined based on one or more previous audio streams from the transmitting device. In some embodiments, the audio stream is provided along with a visual signal, and the method further includes synchronizing the visual signal with the audio stream by introducing a time delay in the visual signal that is the same as the time delay of the audio stream. In some embodiments, the audio stream is part of a video stream, and the method further includes analyzing the audio stream to identify harmful instances, detecting, in response to identifying a harmful instance, a portion of the video stream depicting an offensive gesture, wherein the offensive gesture occurs within a predetermined time period of the harmful instance, and modifying at least that portion of the video stream by one or more of the following in response to detecting the offensive gesture: blurring the portion, or replacing the portion with pixels matching a background area. In some embodiments, the audio stream is part of a video stream, and the method further includes performing motion detection on the video stream to detect an offensive gesture, and modifying at least a portion of the video stream by blurring the portion or replacing the portion with pixels matching a background area in response to detecting the offensive behavior. In some embodiments, the time delay is zero seconds if the harmfulness level is below a minimum threshold.
[0007] According to one aspect, a device includes a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the processor to perform operations including: receiving an audio stream from a transmitting device; providing the audio stream and speech analysis scores, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device as input to a trained machine learning model, wherein the trained machine learning model is iteratively applied to the audio stream, and wherein each iteration corresponds to a corresponding portion of the audio stream; generating a harmfulness level of the audio stream as output using the trained machine learning model; and transmitting the audio stream to a receiving device, wherein the transmission is performed to introduce a time delay in the audio stream based on the harmfulness level.
[0008] In some embodiments, the above operations further include identifying harmful instances in the audio stream and replacing these harmful instances with noise or silence before sending the audio stream to a receiving device. In some embodiments, the above operations further include identifying silences or pauses between words in the audio stream, where each silence or pause corresponds to a specific timestamp in the audio stream, wherein a time delay is introduced as a gap in the audio stream at the specific timestamp of the silence or pause between words. In some embodiments, the above operations further include updating the speech analysis score based on the identified harmful instances in the audio stream.
[0009] According to one aspect, a non-transitory computer-readable medium storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations including: receiving an audio stream from a transmitting device; providing the audio stream and speech analysis scores, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device as input to a trained machine learning model, wherein the trained machine learning model is iteratively applied to the audio stream, and wherein each iteration corresponds to a corresponding portion of the audio stream; generating a harmfulness level of the audio stream as output using the trained machine learning model; and transmitting the audio stream to a receiving device, wherein the transmission is performed to introduce a time delay in the audio stream based on the harmfulness level.
[0010] In some embodiments, the above operations further include identifying harmful instances in the audio stream and replacing harmful instances in the audio stream with noise or silence before sending the audio stream to the receiving device. In some embodiments, the above operations further include identifying silences or pauses between words in the audio stream, the silences or pauses corresponding to specific timestamps in the audio stream, wherein a time delay is introduced as a gap in the audio stream at the specific timestamps of the silences or pauses between words. In some embodiments, the above operations further include updating the speech analysis score based on the identified harmful instances in the audio stream. In some embodiments, the above operations further include receiving text from a text channel associated with the transmitting device, wherein the text channel is separate from the audio stream, and generating a text score indicating the harmfulness level of the text, wherein the input to the trained machine learning model also includes the text score.
[0011] One method to prevent harmful content from appearing in audio streams is to buffer the audio stream, identify harmful instances before the audio stream is sent from the sending device to the receiving device, and remove these harmful instances from the audio stream. However, managing audio streams introduces delays of several seconds. Audio delays exceeding 50 milliseconds cause unnatural delays that interfere with the conversation, while delays exceeding 250 milliseconds can cause the conversation to break down.
[0012] This application advantageously describes a metaverse engine and / or metaverse application that provides a method for identifying harmful instances while selectively inserting intervals or pauses into an audio stream to perform management without perceptible delay. Attached Figure Description
[0013] Figure 1 This is a block diagram of an example network environment for identifying harmful instances in communications, based on some embodiments described herein.
[0014] Figure 2 This is a block diagram of an example computing device for identifying harmful instances in communications, based on some embodiments described herein.
[0015] Figure 3 This is an example user interface for identifying harmful instances of a video stream according to some embodiments described herein.
[0016] Figure 4 This is an example user interface for identifying offensive behavior in a video stream according to some embodiments described herein.
[0017] Figure 5 This is an example flowchart for identifying harmful instances in communications, based on some embodiments described herein.
[0018] Figure 6This is another example flowchart for identifying harmful instances in communications, based on some embodiments described herein. Detailed Implementation
[0019] Network environment 100
[0020] Figure 1 A block diagram of an example environment 100 for identifying malicious instances in communications is shown. In some embodiments, environment 100 includes a server 101, user devices 115a, ..., 115n, and a network 105. Users 125a, ..., 125n may be associated with corresponding user devices 115a, ..., 115n. Figure 1 In the accompanying drawings, letters following reference numerals, such as "115a," indicate a reference to an element having that particular reference numeral. Reference numerals in the text without following letters, such as "115," indicate a general reference to various embodiments of the elements having that reference numeral. In some embodiments, environment 100 may include... Figure 1 Other servers or devices not shown. For example, server 101 may be multiple servers 101.
[0021] Server 101 includes one or more servers, each including a processor, memory, and network communication hardware. In some embodiments, server 101 is a hardware server. Server 101 is communicatively coupled to network 105. In some embodiments, server 101 sends data to user equipment 115 and receives data from user equipment 115. Server 101 may include a metaverse engine 103 and a database 199.
[0022] In some embodiments, the metaverse engine 103 includes code and programs for receiving communications between two or more users in a virtual metaverse, such as between friends in the same location within the metaverse, within the same metaverse experience, or within a metaverse application. Users interact across different groups of people (e.g., different ages, regions, languages, etc.) within the metaverse.
[0023] In some embodiments, the metaverse engine 103 receives an audio stream from user device 115a, the destination of which is user device 115n. The metaverse engine 103 provides the audio stream, along with speech analysis scores, information about one or more voice emotion parameters, and one or more voice emotion scores of user 115a associated with user device 115a, as input to a trained machine learning model. The trained machine learning model is iteratively applied to relevant portions of the audio stream, such as within a few seconds of receiving the audio stream.
[0024] The metaverse engine 103 uses a trained machine learning model to generate the harmfulness level of the audio stream as output. The metaverse engine 103 sends the audio stream to one or more other user devices 115n, wherein the transmission is performed to introduce a time delay into the audio stream based on the harmfulness level. In some embodiments, the metaverse engine 103 uses the time delay to identify harmful instances and replaces these instances with noise or silence before sending the audio stream to one or more other user devices 115n.
[0025] In some embodiments, the Metaverse Engine 103 is implemented using hardware including a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), any other type of processor, or a combination thereof. In some embodiments, the Metaverse Engine 103 is implemented using a combination of hardware and software.
[0026] Database 199 may be a non-transitory computer-readable storage device (e.g., random access memory), a cache, a drive (e.g., a hard disk drive), a flash drive, a database system, or another type of component or device capable of storing data. Database 199 may also include multiple storage components (e.g., multiple drives or multiple databases) that may span multiple computing devices (e.g., multiple server computers). Database 199 may store data associated with Metaverse Engine 103, such as training datasets for trained machine learning models, history and metadata associated with each user 125, etc.
[0027] User equipment 115 may be a computing device that includes memory, a hardware processor, and a camera. For example, user equipment 115 may include a mobile device, tablet computer, mobile phone, wearable device, head-mounted display, mobile email device, portable game console, portable music player, e-reader device, or other electronic device capable of accessing network 105 and capturing images with a camera.
[0028] User device 115a includes metaverse application 104a, and user device 115n includes metaverse application 104b. In some embodiments, user device 115a is a sending device, and user device 115n is a receiving device. In some embodiments, user 125a uses metaverse application 104a on the sending device to generate communication, such as an audio or video stream, and the communication is sent to metaverse engine 103. Once the communication is approved for transmission, metaverse engine 103 sends the communication to metaverse application 104b on the receiving device for user 125n to access.
[0029] In the illustrated embodiment, entities of environment 100 are communicatively coupled via network 105. Network 105 may include public networks (e.g., the Internet), private networks (e.g., local area networks (LANs) or wide area networks (WANs)), wired networks (e.g., Ethernet), and wireless networks (e.g., 802.11 networks). Networks, or wireless LANs (WLANs), cellular networks (e.g., long-term evolution (LTE) networks), routers, hubs, switches, server computers, or combinations thereof. Although Figure 1 A network 105 is shown coupled to server 101 and user equipment 115, but in reality, one or more networks 105 may be coupled to these entities.
[0030] Example 200 of computing devices
[0031] Figure 2 This is a block diagram of an example computing device 200 that can be used to implement one or more features described herein. The computing device 200 can be any suitable computer system, server, or other electronic or hardware device. In some embodiments, the computing device 200 is server 101. In some embodiments, the computing device 200 is user device 115.
[0032] In some embodiments, computing device 200 includes a processor 235, memory 237, input / output (I / O) interface 239, microphone 241, speaker 243, display 245, and storage device 247. Depending on whether computing device 200 is server 101 or user device 115, some components of computing device 200 may be absent. For example, in the case of computing device 200 being server 101, the computing device may not include microphone 241 and speaker 243. In some embodiments, computing device 200 includes... Figure 2 Additional components not shown.
[0033] Processor 235 can be coupled to bus 218 via signal line 222, memory 237 can be coupled to bus 218 via signal line 224, I / O interface 239 can be coupled to bus 218 via signal line 226, microphone 241 can be coupled to bus 218 via signal line 228, speaker 243 can be coupled to bus 218 via signal line 230, display 245 can be coupled to bus 218 via signal line 232, and storage device 247 can be coupled to bus 218 via signal line 234.
[0034] Processor 235 includes an arithmetic logic unit, a microprocessor, a general-purpose controller, or some other processor array to perform calculations and provide instructions to a display device. Processor 235 processes data and may include various computing architectures, including complex instruction set computer (CISC) architecture, reduced instruction set computer (RISC) architecture, or architectures that implement instruction set combination. Although Figure 2 A single processor 235 is shown, but multiple processors 235 may be included. In different embodiments, processor 235 may be a single-core processor or a multi-core processor. Other processors (e.g., a graphics processing unit), operating system, sensors, display, and / or physical configuration may be part of computing device 200.
[0035] Memory 237 stores instructions and / or data executable by processor 235. Instructions may include code and / or programs for performing the techniques described herein. Memory 237 may be a dynamic random access memory (DRAM) device, static RAM, or some other memory device. In some embodiments, memory 237 may also include non-volatile memory (e.g., static random access memory (SRAM) devices or flash memory), or similar permanent storage devices and media (including hard disk drives, compact disc read-only memory (CD-ROM) devices, DVD-ROM devices, DVD-RAM devices, DVD-RW devices, flash memory devices), or some other high-capacity storage devices for more permanent storage of information. Memory 237 includes code and programs operable to execute the metaverse engine 103, which will be described in more detail below.
[0036] I / O interface 239 can provide functionality that enables computing device 200 to interface with other systems and devices. Interface devices can be included as part of computing device 200 or can be standalone and communicate with computing device 200. For example, network communication devices, storage devices (e.g., memory 237 and / or storage device 247), and input / output devices can communicate via I / O interface 239. In another example, I / O interface 239 can receive data from server 101 and deliver data to metaverse engine 103 and components of metaverse engine 103, such as machine learning module 210. In some embodiments, I / O interface 239 can be connected to interface devices such as input devices (keyboard, pointing device, touchscreen, microphone 241, sensors, etc.) and / or output devices (display device, speaker 243, monitor, etc.).
[0037] Some examples of interface devices that can be connected to I / O interface 239 may include display 245, which may be used to display content (e.g., images, videos, and / or user interfaces of output applications as described herein) and to receive touch (or gesture) input from a user. Display 245 may include any suitable display device, such as a liquid crystal display (LCD), a light emitting diode (LED), or a plasma display, a cathode ray tube (CRT), a television, a monitor, a touch screen, a 3D display, or other visual display device.
[0038] Microphone 241 includes hardware for detecting audio spoken by a person. Microphone 241 can send audio to Metaverse Engine 103 via I / O interface 239.
[0039] Speaker 243 includes hardware for generating audio for playback. For example, speaker 243 receives instructions from metaverse engine 103 to generate audio from another user after determining that the audio stream does not contain harmful instances. Speaker 233 translates the instructions into audio and generates audio for the user.
[0040] Storage device 247 stores data related to the metaverse engine 103. For example, storage device 247 may store training datasets for trained machine learning models, history and metadata associated with each user 125, etc. In embodiments where computing device 200 is server 101, storage device 247 and... Figure 1 The database 199 is the same.
[0041] Example Metaverse Engine 103 or Metaverse Application 104
[0042] Figure 2 A computing device 200 is shown executing an example metaverse engine 103 or metaverse application 104. The computing device 200 includes a history module 202, a speech analyzer 204, a voice emotion analyzer 206, a text module 208, a machine learning module 210, a harmful module 212, and a user interface module 214. Although the modules are shown as part of the same metaverse engine 103 or metaverse application 104, those skilled in the art will recognize that the aforementioned modules can be implemented by the computing device 200. For example, the text module 208 could be part of a user device 115, providing analysis of text communications before they are sent to the metaverse engine 103, which is part of a server 101, in order to reduce the computational demands of the server 101.
[0043] The history module 202 generates historical information and metadata about users involved in the communication. In some embodiments, the history module 202 includes an instruction set executable by the processor 235 to generate the historical information and metadata. In some embodiments, the history module 202 is stored in the memory 237 of the computing device 200 and can be accessed and executed by the processor 235.
[0044] In some embodiments, after obtaining user permission, the history module 202 stores information about each communication session in the metaverse associated with the user and metadata associated with the user. Communication sessions may include audio streams, video streams, text communications, etc. After obtaining user permission, the history module 202 may store information about harmful instances associated with the user. For example, the history module 202 may identify when the user participated in a harmful instance, the harmful behavior they performed (e.g., using profanity, engaging in offensive behavior, bullying other users, threatening other users, etc.), and the specific user targeted by the harmful instance. In all cases where information about the user is stored, the history module 202 has obtained permission from the user, the user has been informed that they can delete the information, and the information is stored securely and in compliance with applicable regulations. Further details are discussed below with reference to the user interface module 214.
[0045] In some embodiments, after obtaining user permission, the history module 202 may store background information about harmful instances, such as specific experiences. For example, a user might use a lot of profanity while playing a violent shooting game, but might not exhibit harmful behavior in a non-violent role-playing game. The history module 202 may receive information about harmful instances from other modules, such as the speech analyzer 204 and the harmful module 212.
[0046] In some embodiments, after obtaining user permission, the history module 202 stores listener history and metadata about a user's reaction to another user's malicious instances against them. For example, the history module 202 may update the listening history and metadata to indicate whether a user is insensitive to offensive language, whether a user responds to offensive language with the same offensive language, whether a user reports a user after witnessing offensive behavior, etc. In another example, the history module 202 may track the number of times a particular user has been blocked and whether that particular user participated in events where other users who blocked that particular user were also present. In some embodiments, the history module 202 generates a sensitivity score that reflects a user's sensitivity to malicious instances using a scale (e.g., 3 out of 10, 0.9 out of 1, etc.).
[0047] In some embodiments, after obtaining user permission, the history module 202 stores metadata associated with the user. For example, metadata may include the user's place of residence, other demographic information (gender, gender identity, age, race, preferred pronouns, sexual orientation, etc.), one or more internet protocol (IP) addresses associated with the user, one or more languages spoken by the user, etc. In some embodiments, the history module 202 may combine metadata to characterize the user's reactions. For example, the history module 202 may identify that the user is generally insensitive to harmful instances in the metaverse unless the user is subjected to offensive language related to their religious beliefs, gender identity, race, etc.
[0048] The history module 202 can provide historical information and metadata as input to the machine learning module 210 as input to a trained machine learning model. The history module 202 can also provide historical information and metadata to other modules to provide context influencing the calculation of harmful instances. For example, the speech analyzer 204 receives historical information and metadata because it uses different rules to identify harmful instances for users aged 13-16, 16-18, or over 18.
[0049] In some embodiments, the speech analyzer 204 analyzes speech during a communication session. In some embodiments, the speech analyzer 204 includes a set of instructions executable by the processor 235 to analyze speech during a communication session. In some embodiments, the speech analyzer 204 is stored in the memory 237 of the computing device 200 and can be accessed and executed by the processor 235.
[0050] Speech analyzer 204 receives an audio stream from a transmitting device. Because speech analysis can take several seconds, speech analyzer 204 performs continuous analysis of the speech in the audio stream. This analysis can be retrospective, meaning that speech analyzer 204 performs the analysis after the audio stream has been sent to the receiving device, or it can be performed each time a harmful instance is identified, regardless of whether the audio stream has been sent to the receiving device.
[0051] In some embodiments, the speech analyzer 204 includes a machine learning model trained to predict various attributes of the audio stream, such as vocal effort, speaking style, language, spoken activity, etc.
[0052] The speech analyzer 204 can perform automatic speech recognition (ASR), such as speech-to-text translation, and compare the translated text with a list of harmful words to identify harmful instances in the audio stream. The speech analyzer 204 can generate a speech analysis score for the user associated with the transmitting device. For example, the speech analyzer 204 can generate a speech analysis score based on the identification of harmful instances associated with the audio stream.
[0053] In some embodiments, the speech analyzer 204 generates a speech analysis score based on demographic information of a specific user. For example, the speech analyzer 204 applies different criteria for constituting a harmful instance based on whether the user is 13-16, 16-18, or 18 years or older (or 12-15, 15-18, or 18 years or older), the user's location, whether the audio stream is sent to users with different demographic information (e.g., when the audio stream is sent to a 13-year-old user, the audio stream may be identified as including harmful instances), or the type of game (e.g., shooting games versus puzzle games).
[0054] In some embodiments, the speech analyzer 204 performs speech-to-text translation based on one or more languages spoken by the user. For example, the aquatic mammal "seal" in English is called "phoque" in French, and "phoque" is not confused with harmful instances (i.e., the vulgar word "fuck" in English). In some embodiments, the recognition of one or more languages spoken by the user is received as part of metadata determined by the history module 202.
[0055] In some embodiments, the speech analyzer 204 periodically provides speech analysis (e.g., speech analysis scores) to the machine learning module 210 as input to a trained machine learning model. The speech analysis scores may be associated with timestamps, aligning them with positions within the audio stream. In some embodiments, the speech analyzer 204 sends the speech analysis score to the machine learning module 210 whenever it identifies a harmful instance in the audio stream and updates the speech analysis score reflecting that identified harmful instance.
[0056] Voice emotion analyzer 206 analyzes emotions in an audio stream. In some embodiments, voice emotion analyzer 206 includes a set of instructions executable by processor 235 to analyze emotions in the audio stream. In some embodiments, voice emotion analyzer 206 is stored in memory 237 of computing device 200 and can be accessed and executed by processor 235.
[0057] In some embodiments, the voice emotion analyzer 206 identifies different speakers in an audio stream and associates each speaker with a corresponding audio identifier. The voice emotion analyzer 206 analyzes multiple voice parameters, such as one or more of the following: tone, pitch, and vocal intensity level for each user in the audio stream. In some embodiments, the voice emotion analyzer 206 analyzes tone by determining the positivity and energy of the user's voice. For example, the voice emotion analyzer 206 detects whether the user sounds excited, annoyed, neutral, sad, etc. In some embodiments, the voice emotion analyzer 206 determines the speaker's emotional state based on an emotion quadrant. The emotion quadrant includes four states: tension, happiness, anger, and sadness. The voice emotion analyzer 206 can detect emotions using transformer-based techniques, such as using wav2vec2.0 as part of the front end.
[0058] In some embodiments, the voice emotion analyzer 206 analyzes pitches ranging from 60 Hz to 2 kHz. In some embodiments, the voice emotion analyzer 206 determines the fundamental frequency of the sound and the range of pitches appearing in the audio stream.
[0059] In some embodiments, the voice emotion analyzer 206 analyzes voice intensity levels by determining a noise level and comparing it to a predetermined description of voice intensity. For example, the voice emotion analyzer 206 may determine that a person whispering produces a voice intensity of 20-30 dB, a person speaking softly produces a voice intensity of 30-55 dB, a person speaking at an average level produces a voice intensity of 55-65 dB, a person speaking loudly or shouting produces a voice intensity of 65-80 dB, and a person screaming produces a voice intensity of 80-120 dB. In some embodiments, the voice emotion analyzer 206 also identifies whether the voice intensity level increases over time, as this may indicate that the conversation is escalating into a potentially harassing argument.
[0060] In some embodiments, the voice emotion analyzer 206 generates a voice emotion score for a user associated with an audio stream. In some embodiments, the voice emotion analyzer 206 generates a separate score for each of the following: tone, pitch, and vocal intensity levels determined based on one or more previous audio streams from the transmitting device. In some embodiments, the voice emotion analyzer 206 generates a voice emotion score that is a combination of tone, pitch, and vocal intensity. For example, because a user does not shout when angry, it may not be obvious whether the user is angry unless there is a combination of angry tone, large pitch changes, and low vocal intensity levels.
[0061] Regardless of whether the user uses voice modulation software, the voice emotion analyzer 206 can analyze the emotions in the audio stream. For example, when a user selects voice modulation software to make themselves sound like a popular cartoon character, the voice emotion analyzer 206 detects that voice modulation is occurring and performs emotion analysis regardless of how the voice is modulated.
[0062] Voice emotion analyzer 206 can periodically generate one or more voice emotion scores and send information about voice emotion parameters, along with the one or more voice emotion scores, to machine learning module 210. For example, whenever speech analyzer 204 identifies a harmful instance in the audio stream and updates the speech analysis score to reflect the identified harmful instance, voice emotion analyzer 206 sends information about voice emotion parameters and one or more voice emotion scores to machine learning module 210. In another example, whenever a voice emotion parameter changes, such as when tone, pitch, or vocal intensity level changes, or when the change in tone, pitch, or vocal intensity level exceeds one or more predetermined thresholds (e.g., when a user changes from a vocal intensity level associated with normal speech to a vocal intensity level associated with shouting), voice emotion analyzer 206 sends information about the voice emotion parameters and one or more speech analysis scores.
[0063] In some embodiments, the voice emotion analyzer 206 may be stored on a transmitting device and perform rapid analysis of voice emotions. For example, the transmitting device may include a voice emotion analyzer 206 with a tone detection accuracy of 60%. The voice emotion analyzer 206 on the transmitting device may send information to a voice emotion analyzer 206 on server 101, which performs a more detailed analysis of the voice emotions.
[0064] Text module 208 analyzes text from a text channel. In some embodiments, text module 208 includes an instruction set executable by processor 235 to analyze the text. In some embodiments, text module 208 is stored in memory 237 of computing device 200 and can be accessed and executed by processor 235.
[0065] In some embodiments, a user can simultaneously participate in an audio stream and send text messages using a separate text channel. For example, during a video call, a user might be giving a presentation and also adding additional information to a chat box associated with the video call. In another example, a user might participate in a game via an audio stream while simultaneously sending direct text messages to specific users through the game software. Analyzing the audio stream and text messages can be helpful, as some users might be polite in the audio stream but then verbally abuse specific members via private messages during gameplay.
[0066] In some embodiments, text module 208 compares text with a list of harmful words and identifies harmful instances in the text. Text module 208 may generate a text score indicating the harmfulness level of the text based on the text message. Text module 208 may periodically generate text scores and send the text scores to machine learning module 210. In some embodiments, text module 208 sends the text score whenever a harmful instance in the text is identified and therefore the text module 208 updates the text score.
[0067] In some embodiments, text module 208 removes harmful instances from text before sending the text to a receiving device. Text module 208 may simply remove harmful instances, such as a first user threatening another user. Alternatively, text module 208 may include a warning and an explanation of why harmful instances were removed from the text.
[0068] In some embodiments, text module 208 is stored on a sending device as part of metaverse application 104a, and text is analyzed on the sending device to save computing resources of server 101. Text module 208 on the sending device can send text scores as input to machine learning module 210 to metaverse engine 103.
[0069] Machine learning module 210 trains a machine learning model (or multiple models) to output the harmfulness level of an audio stream. In some embodiments, machine learning module 210 includes a set of instructions executable by processor 235 to train the machine learning model to output the harmfulness level of the audio stream. In some embodiments, machine learning module 210 is stored in memory 237 of computing device 200 and can be accessed and executed by processor 235.
[0070] In some embodiments, machine learning module 210 obtains a training dataset with audio streams that are manually labeled and paired with outputs from one or more of the following: history module 202, speech analyzer 204, voice emotion analyzer 206, and text module 208. In some embodiments, manual labeling includes harmful instances in the audio streams and metadata such as language, emotional state, vocal intensity, etc. For example, the training dataset may include audio streams paired with the following outputs: output from history module 202, including one or more of the following: harmful history of a first user (i.e., a speaker), speaker history and metadata associated with the first user, and listener history and metadata associated with a second user; output from speech analyzer 204, including periodically sent speech analysis scores of the first user; output from voice emotion analyzer 206, including periodically sent speech analysis scores of the first user, and in some embodiments, separate speech analysis scores including tone, pitch, and vocal intensity levels; and output from text module 208, including text scores.
[0071] In some embodiments, the training dataset also includes automatically labeled audio streams that have been preprocessed for offline malicious detection. The labels may indicate timestamps indicating the locations of malicious instances within the audio streams.
[0072] In some embodiments, the training dataset also includes synthetic audio streams for speech-to-text translation, wherein the synthetic audio streams comprise a large corpus of both harmful and harmless speech. The synthetic audio streams may be labeled to include both harmful and harmless speech and may include timestamps detailing the locations where harmful instances occurred.
[0073] In some embodiments, the training dataset also includes audio streams enhanced with different parameters to help train a machine learning model to output the harmfulness of the audio stream under different conditions. For example, the training dataset can be enhanced with audio streams that include variations in the speaker's pitch, noise, codecs, echoes, background noise (e.g., traffic, natural sounds, slurred speech, etc.), music, and playback speed.
[0074] The machine learning module 210 trains a machine learning model using a training dataset in a supervised learning manner. The training dataset includes audio stream examples without harmful content and audio stream examples with one or more harmful instances. This allows the machine learning module 210 to use the distinction between harmful and harmless activities as labels during supervised learning to train the machine learning model to classify the input audio stream as including harmful or harmless activities.
[0075] In some embodiments, the training data for the machine learning model includes audio streams collected with user permission for training purposes and labeled by a human reviewer. For example, the human reviewer listens to the audio streams in the training data and identifies whether each audio stream contains harmful instances, and if so, timestamps the location within the audio stream where the harmful instance occurred. This manually generated data is referred to as ground truth labels. The model is then trained using this training data; for example, the model being trained generates labels for each audio stream in the training data and compares them to the ground truth labels, updating one or more model parameters using a feedback function based on the comparison results.
[0076] In some embodiments, the training data for the machine learning model also includes video streams. The training dataset can be labeled to include video stream examples without offensive behavior and video stream examples with one or more offensive behaviors. This allows the machine learning module 210 to use the distinction between offensive and non-offensive behaviors as labels during supervised learning to train the machine learning model to classify input video streams as containing or not containing offensive behavior. For example, a training dataset with offensive behavior may include a set of actions corresponding to offensive behaviors, such as lip movements forming profanity, arm movements within a threshold time period forming actions associated with profanity, and actions that are precursors to offensive behavior, such as an arm starting to move in a specific way that could lead to offensive behavior. The video stream may include videos of the user or videos of the user's virtual avatar.
[0077] In some embodiments, the machine learning module 210 is a deep neural network. Types of deep neural networks include convolutional neural networks, deep belief networks, stacked autoencoders, generative adversarial networks, variational autoencoders, stream models, recurrent neural networks, and attention-based models. Deep neural networks use multiple layers to progressively extract higher-level features from the original input, where the input to a layer is different types of features extracted from other modules, and the output is a decision on whether to perform management.
[0078] Machine learning module 210 can generate layers that recognize increasingly detailed features and patterns within the speech of an audio stream, where the output of one layer is used as input to subsequent, more detailed layers until the final output is the degree of harm of the audio stream. An example of different layers in a deep neural network could include token embeddings, segment embeddings, and positional embeddings.
[0079] In some embodiments, the machine learning module 210 trains a machine learning model using a backpropagation algorithm. The backpropagation algorithm modifies the internal weights of the input signal based on feedback (e.g., at each node / layer of a multi-layer neural network), which can be a function of the output label produced by the model being trained (e.g., "this part of the audio stream is level 1 harmful") and the ground truth label included in the training data (e.g., "this part of the audio stream is level 2 harmful"). This weight adjustment can improve the accuracy of the model being trained.
[0080] After the machine learning module 210 trains the machine learning model, the trained machine learning model receives the following inputs: a speech analysis score of a first user associated with the transmitting device from the speech analyzer 204, information on one or more voice emotion parameters and one or more voice emotion scores of the first user from the voice emotion analyzer 206, and an audio stream from the transmitting device. The information on the one or more voice emotion parameters may include information about the user's tone, pitch, and vocal intensity level in the audio stream.
[0081] In some embodiments, the machine learning module 210 also receives harmful history of a first user from the history module 202, speaker history and metadata associated with the first user, and listener history and metadata associated with a second user associated with the receiving device. The metadata can be used to identify whether a speaker is likely to violate community guidelines. For example, a banned user might create a new user profile, but the metadata includes indicators that it is the same user, such as IP address and crowd information. In some embodiments, the machine learning module 210 also receives a text score indicating the harmfulness level of the indicative text from the text module 208. In some embodiments, the trained machine learning model periodically receives a speech analysis score, information about one or more voice emotion parameters, and one or more voice emotion scores. In some embodiments, the trained machine learning model continuously receives an audio stream, and the trained machine learning model is iteratively applied to the audio stream, where each iteration corresponds to a corresponding portion of the audio stream. For example, the trained machine learning model may be applied to the audio stream every 1 second, every 2 seconds, every 0.5 seconds, etc.
[0082] The trained machine learning model generates the harmfulness level of the audio stream as output. The harmfulness level of the audio stream reflects its current harmfulness and is a prediction of how harmful the audio stream may become.
[0083] In some embodiments, the machine learning module 210 sends the harmfulness level of the audio stream to the harmfulness module 212. The machine learning module 210 can provide the harmfulness level each time the trained machine learning model generates a harmfulness level.
[0084] The malicious module 212 introduces a time delay into the audio stream based on the degree of maliciousness and analyzes the audio stream for malicious instances. In some embodiments, the malicious module 212 includes a set of instructions executable by the processor 235 to introduce the time delay and analyze the audio stream for malicious instances. In some embodiments, the user interface module 214 is stored in the memory 237 of the computing device 200 and can be accessed and executed by the processor 235.
[0085] In some embodiments, the harmfulness module 212 receives the harmfulness level of the audio stream from the machine learning module 210. The harmfulness level can correspond to the time delay in sending the audio stream to the receiving device. For example, a level of 0 can indicate that there is no possibility of harmfulness in the audio stream, and the audio stream can be sent without time delay.
[0086] In some embodiments, the malicious module 212 determines whether to introduce a time delay into the transmitted audio stream to allow sufficient time for analyzing the audio stream against malicious instances. For example, the time delay could be 0 to 5 seconds. In some embodiments, if the level of maliciousness is below a minimum threshold, the time delay is zero seconds, and the malicious module 212 sends the audio stream to the receiving device. In some embodiments, if the level of maliciousness exceeds the minimum threshold, the malicious module 212 determines the amount of delay to apply based on the increasing level of maliciousness. In some embodiments, the malicious module 212 also applies a more stringent level of censorship when the level of maliciousness indicates that the audio stream is more likely to be malicious. Higher transmission delay is a negative feedback mechanism that, even without additional management, can suppress malicious behavior by slowing down interaction or preventing effective communication. In some embodiments, the level of maliciousness may be high enough that the malicious module 212 mutes the user. For example, if a user uses profanity every other word, it might be easier to simply mute the audio stream until the user's offensive language ends.
[0087] In some embodiments, the malicious module 212 performs speech-to-text translation of the audio stream, or receives translation from one of the other modules. In some embodiments, the malicious module 212 performs speech-to-text translation after the speaker has finished speaking a sentence. In some embodiments, the malicious module 212 identifies malicious instances in the audio stream without first converting the audio stream to text.
[0088] Harmful module 212 identifies harmful instances in the audio stream. In some embodiments, harmful module 212 progressively adjusts the time delay of the audio stream transmission to avoid perceptible audio distortion and is designed to adjust the gaps between words or sentences. For example, when silence or pauses are present, harmful module 212 introduces gaps to make them less noticeable. Pauses and silences can be detected in less than 100ms, so the time required for harmful module 212 to identify pauses and silences is less than the time required to wait for the end of a sentence. In some embodiments, harmful module 212 tracks the timestamps of all audio signals / data packets to help add gaps in the correct locations and make the transition between speaking and silence more seamless.
[0089] In some embodiments, the harmful module 212 replaces harmful instances in the audio stream with noise, or reviews harmful instances by replacing them with silence.
[0090] In some embodiments, the audio stream is part of a visual signal. The visual signal can be animation, such as virtual character animation or physical animation, or it can be a video stream, where the audio stream is part of the video stream. The visual signal is presented to one or more other users, all of whom participate in a metaverse, where the user's representation (e.g., their virtual character) resides in the same region of the metaverse, such that each virtual character can see other virtual characters during user interaction. Where harmful module 212 introduces a time delay in the audio transmission, harmful module 212 synchronizes the time delay with the video signal so that the visual signal has the same time delay as the audio stream.
[0091] In some embodiments, the malicious module 212 analyzes the video stream for malicious instances. In response to the malicious module 212 identifying a malicious instance in the audio stream, the malicious module 212 can analyze the video stream for offensive behavior occurring within a predetermined time period of the malicious instance. For example, going to... Figure 3 The example user interface 300 for a video stream is shown, where a harmful module 212 identifies harmful instances in the audio stream. The harmful module 212 performs image recognition on the video stream of the user's virtual avatar and identifies locations within the video stream where the speaker's mouth moves to form words corresponding to harmful instances in the audio stream. The harmful module 212 instructs the user interface module 214 to overlay a graphic 305 over the speaker's mouth when the speaker performs an offensive act. Because the graphic 305 draws attention to the offensive act, other mitigation actions can also be taken, such as adding a blurring effect to the mouth, or even replacing the mouth with pixels that match the background.
[0092] In some embodiments, the harmful module 212 performs motion detection and / or object detection on the video to identify offensive behavior. In response to identifying offensive behavior, the harmful module 212 blurs the offensive behavior in the video stream or replaces it with pixels that match the background. In some embodiments, the harmful module 212 may analyze the user's virtual avatar while performing motion detection and / or object detection to determine if the virtual avatar appears agitated and use that virtual avatar as a signal that the user may be performing offensive behavior.
[0093] Go to Figure 4 The image shows another example user interface 400 of a video stream where offensive behavior is identified. In this example, the harmful module 212 determines that the user is about to commit offensive behavior and that the user appears angry. The harmful module 212 generates a mask 405 that replaces the pixels associated with the hand with pixels from the background to make the offensive behavior invisible.
[0094] User interface module 214 generates a user interface. In some embodiments, user interface module 214 includes an instruction set that can be executed by processor 235 to generate the user interface. In some embodiments, user interface module 214 is stored in memory 237 of computing device 200 and can be accessed and executed by processor 235.
[0095] User interface module 214 generates a user interface for user 125 associated with user device 115. The user interface can be used to initiate audio communication with other users, participate in games within the metaverse, send text messages to other users, initiate video communication with other users, etc. In some embodiments, the user interface includes options for adding user preferences, such as the ability to block other users 125.
[0096] In some embodiments, before the user engages with the metaverse, the user interface module 214 generates a user interface that includes information about how user information is collected, stored, and analyzed. For example, the user interface requests permission from the user to use any information associated with the user. The user is informed that they can delete their user information and that they have the option to select the types of information available for different purposes. The use of information complies with applicable regulations, and the data is securely stored. Data collection is not performed in certain locations or for certain user categories (e.g., based on age or other demographic information), data collection is temporary (i.e., the data is discarded after a period of time), and the data is not shared with third parties. Some data may be anonymized, aggregated across users, or otherwise modified to make it impossible to identify a particular user.
[0097] In some embodiments, the user interface module 214 provides a user interface that explains to the user that the Metaverse Engine 103 automatically detects harmful instances and can store the audio or video of the harmful instances in association with the user's account.
[0098] Example Method
[0099] Figure 5 This is an example flowchart 500 for identifying harmful instances in communications. In flowchart 500, thick lines represent data streams including audio streams, and thin lines represent data streams of information.
[0100] Flowchart 500 includes audio and text communication from transmitting device 505 to receiving device 510. Transmitting device 505 receives audio input via a microphone and performs analog-to-digital conversion of the audio stream, compresses the audio stream, and sends the audio stream to real-time server 515. Real-time server 515 sends the audio stream to a constant-length multisecond buffer 520, which in turn sends the audio stream to module 525, which performs continuous backtracking speech analysis. Continuous backtracking speech analysis 525 is not performed in real time to allow sufficient time to improve the accuracy of the analysis. Real-time server 515 sends the audio stream to stream selection / mute / noise module 555 so that it can be forwarded to receiving device 510 if machine learning module 530 determines that the audio stream requires no delay. If machine learning module 530 determines that the audio stream requires delay, real-time server 515 sends the audio stream to adjustable multisecond buffer 545.
[0101] The audio stream is also sent to module 535, which performs voice emotion analysis. The voice emotion analysis is then sent as input to machine learning module 530.
[0102] The transmitting device 505 also receives text input via a keyboard. The transmitting device 505 performs text encoding and sends the text to the text management module 540. The text management module 540 sends the text to the receiving device 510 and sends the text as input to the machine learning module 530.
[0103] The machine learning module 530 also receives harmful history for each game, speaker history and metadata, and listener history and metadata as input.
[0104] Machine learning module 530 predicts the likelihood of an unwanted event and determines the duration of the audio buffer. If there is no likelihood of an unwanted event, the audio stream has no delay, and machine learning module 530 sends the audio stream directly to receiving device 510. If there is a likelihood of an unwanted event, machine learning module 530 sends the audio stream to adjustable multi-second buffer 545. Adjustable multi-second buffer 545 sends the audio stream to module 550, which detects actual unwanted events. Module 550 sends unwanted events to module 555, which determines whether to select the stream, mute the stream, or add noise to the stream. Stream selector / mute / noise generator module 555 sends the audio stream to receiving device 510.
[0105] Figure 6 This is another example flowchart 600 for identifying harmful instances in communications according to some embodiments described herein. In some embodiments, the metaverse engine 103 is stored on server 101. In some embodiments, the metaverse engine 103 is stored on user device 115. In some embodiments, the metaverse engine 103 is partially stored on server 101 and partially stored on user device 115.
[0106] Method 600 may begin at box 602. In box 602, an audio stream is received from the transmitting device. Box 602 may be followed by box 604.
[0107] In box 604, input to a trained machine learning model is provided, including the audio stream and speech analysis scores, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device. The trained machine learning model is iteratively applied to portions of the audio stream, where each iteration corresponds to a specific portion of the audio stream. Box 604 may be followed by box 606.
[0108] In box 606, the trained machine learning model generates the degree of harm of the audio stream as output. Box 606 may be followed by box 608.
[0109] In box 608, the audio stream is sent to the receiving device. The transmission is performed to introduce a time delay into the audio stream based on its harmfulness.
[0110] The methods, blocks, and / or operations described herein may be performed in a different order than those shown or described, and / or, where appropriate, performed simultaneously (partially or completely) with other blocks or operations. Some blocks or operations may be performed on a portion of data and later, for example, on another portion of data. Not all described blocks and operations need to be performed in all implementations. In some implementations, blocks and operations may be performed multiple times in a different order and / or at different times within a method.
[0111] The various embodiments described herein include acquiring data from various sensors in the physical environment, analyzing such data, generating recommendations, and providing a user interface. Data collection is conducted only with the specific user's permission and in accordance with applicable regulations. Data storage complies with applicable regulations, including anonymizing or otherwise modifying data to protect user privacy. Users are provided with clear information about data collection, storage, and use, and are given options to select the types of data that can be collected, stored, and used. Furthermore, users control the devices that can store data (e.g., user-only devices, client and server devices, etc.) and the devices that perform data analysis (e.g., user-only devices, client and server devices, etc.). Data is used for the specific purposes described herein. No data is shared with third parties without the user's explicit permission.
[0112] In the foregoing description, numerous specific details have been set forth for purposes of explanation to provide a thorough understanding of the specification. However, it will be apparent to those skilled in the art that this disclosure may be practiced without these specific details. In some instances, structures and devices are shown in block diagram form to avoid obscuring the description. For example, embodiments may be described primarily with reference to user interfaces and specific hardware. However, embodiments can be applied to any type of computing device capable of receiving data and commands, as well as any external device providing services.
[0113] References to "some embodiments" or "some examples" in the specification mean that a particular feature, structure, or characteristic described in conjunction with an embodiment or example may be included in at least one implementation of the specification. The phrase "in some embodiments" appearing in different places in the specification does not necessarily refer to the same embodiment.
[0114] Some of the parts described in detail above are presented as algorithms and symbolic representations of operations on data bits within computer memory. These algorithmic descriptions and representations are the means by which those skilled in the art of data processing most effectively communicate the substance of their work to others skilled in the art. Algorithms here are generally considered as a series of self-consistent steps that achieve a desired result. These steps are steps that require physical manipulation of physical quantities. Typically, though not always, these quantities appear in the form of electrical or magnetic data that can be stored, transmitted, combined, compared, and otherwise manipulated. It has proven convenient, primarily for general reasons, to sometimes refer to these data as bits, values, elements, symbols, characters, terms, numbers, or similar names.
[0115] However, it should be noted that all these terms and similar terms are associated with appropriate physical quantities and are merely convenient labels applied to these quantities. Unless explicitly stated in the following discussion, throughout the description, the use of terms such as 'processing,' 'computation,' 'derive,' and 'display' refers to the actions and processes of a computer system or similar electronic computing device. These actions and processes involve the manipulation and transformation of data, represented in the form of physical (electronic) quantities, within the registers and memory of the computer system, into other data, also represented in the form of physical quantities, stored in the computer system's memory or registers or other information storage, transmission, or display devices.
[0116] Embodiments of this specification may also relate to a processor for performing one or more steps of the methods described above. The processor may be a dedicated processor that is selectively activated or reconfigured by a computer program stored in a computer. Such a computer program may be stored in a non-transitory computer-readable storage medium, including but not limited to any type of disk, including optical discs, ROMs, CD-ROMs, magnetic disks, RAM, EPROMs, EEPROMs, magnetic cards or optical cards, flash memory including a USB key with non-volatile memory, or any type of medium suitable for storing electronic instructions, each medium being coupled to a computer system bus.
[0117] The specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment that includes both hardware and software elements. In some embodiments, the specification is implemented in software, including but not limited to firmware, resident software, microcode, etc.
[0118] Furthermore, this description may also take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by, or associated with, a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer-readable medium may be any means capable of containing, storing, communicating, propagating, or transmitting a program for use by or in connection with an instruction execution system, apparatus, or device.
[0119] A data processing system suitable for storing or executing program code will include at least one processor that is directly or indirectly connected to memory elements via a system bus. Memory elements may include local memory used during actual program code execution, mass storage, and cache memory, with the cache memory providing at least some temporary storage for program code to reduce the number of times code must be retrieved from mass storage during execution.
Claims
1. A computer-implemented method for determining whether to introduce delay in an audio stream from a particular speaker, the method comprising: Receive audio streams from the transmitting device; The audio stream and speech analysis score, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device are provided as input to a trained machine learning model, wherein the trained machine learning model is iteratively applied to the audio stream, and wherein the trained machine learning model is applied to the audio stream at preset time intervals. The trained machine learning model is used to generate the harmfulness level of the audio stream as output, wherein the harmfulness level of the audio stream is used to identify the current harmfulness level of the audio stream, and includes a prediction of the future harmfulness level of the audio stream, wherein a level of 0 indicates that there is no possibility of harmfulness in the audio stream; Identify silences or pauses between words in the audio stream, wherein the silence or pause corresponds to a specific timestamp in the audio stream; and The audio stream is sent to a receiving device, wherein the sending is performed to introduce a time delay in the audio stream based on the degree of harm, and wherein the time delay is introduced as a gap in the audio stream at the specific timestamp of the silence or pause between words.
2. The method according to claim 1, further comprising: Identify harmful instances in the audio stream; as well as Before the audio stream is sent to the receiving device, the harmful instances in the audio stream are replaced with noise or silence.
3. The method according to claim 1, wherein, The audio stream is part of the metaverse.
4. The method according to claim 2, further comprising: The speech analysis score is updated based on the identified harmful instances in the audio stream.
5. The method according to claim 1, further comprising: Receive text from a text channel associated with the transmitting device, wherein the text channel is separate from the audio stream; and Generate a text score indicating the harmfulness level of the text; The input to the trained machine learning model also includes the text score.
6. The method according to claim 1, wherein, The input to the trained machine learning model also includes the harmful history of the first user, the speaker history and metadata associated with the first user, and the listener history and metadata associated with the second user, who is associated with the receiving device.
7. The method according to claim 1, wherein, The one or more vocal emotion parameters include tone, pitch, and vocal intensity level determined based on one or more previous audio streams from the transmitting device.
8. The method according to claim 1, wherein, The audio stream is provided together with the visual signal, and the method further includes synchronizing the visual signal with the audio stream by introducing a time delay in the visual signal that is the same as the time delay of the audio stream.
9. The method according to claim 1, wherein, The audio stream is part of the video stream, and the method further includes: Analyze the audio stream to identify harmful instances; In response to identifying the harmful instance, a portion of the video stream depicting the offensive behavior is detected, wherein the offensive behavior occurs within a predetermined time period of the harmful instance; and In response to the detection of the offensive behavior, at least a portion of the video stream is modified by one or more of the following: blurring the portion, or replacing the portion with pixels that match the background area.
10. The method according to claim 1, wherein, The audio stream is part of the video stream, and the method further includes: Motion detection is performed on the video stream to detect offensive gestures; and In response to the detection of the offensive gesture, at least the portion of the video stream is modified by blurring the portion or replacing one or more of the portion with pixels that match the background area.
11. The method according to claim 1, wherein, If the degree of harm is below the minimum threshold, the time delay is zero seconds.
12. An apparatus comprising: processor; as well as A memory coupled to the processor stores instructions that, when executed by the processor, cause the processor to perform operations, the operations including: Receive audio streams from the transmitting device; The audio stream and speech analysis score, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device are provided as input to a trained machine learning model, wherein the trained machine learning model is iteratively applied to the audio stream, and wherein the trained machine learning model is applied to the audio stream at preset time intervals. The trained machine learning model is used to generate the harmfulness level of the audio stream as output, wherein the harmfulness level of the audio stream is used to identify the current harmfulness level of the audio stream, and includes a prediction of the future harmfulness level of the audio stream, wherein a level of 0 indicates that there is no possibility of harmfulness in the audio stream; Identify silences or pauses between words in the audio stream, wherein the silence or pause corresponds to a specific timestamp in the audio stream; and The audio stream is sent to a receiving device, wherein the sending is performed to introduce a time delay in the audio stream based on the degree of harm, and wherein the time delay is introduced as a gap in the audio stream at the specific timestamp of the silence or pause between words.
13. The device according to claim 12, wherein, The operation also includes: Identify harmful instances in the audio stream; and Before the audio stream is sent to the receiving device, the harmful instances in the audio stream are replaced with noise or silence.
14. The device according to claim 12, wherein, The audio stream is part of the metaverse.
15. The device according to claim 12, wherein, The operation also includes: The speech analysis score is updated based on the identified harmful instances in the audio stream.
16. A non-transitory computer-readable medium storing instructions thereon, the instructions causing the one or more computers to perform operations when executed by the one or more computers, the operations including: Receive audio streams from the transmitting device; The audio stream and speech analysis score, information about one or more voice emotion parameters, and one or more voice emotion scores of a first user associated with the transmitting device are provided as input to a trained machine learning model, wherein the trained machine learning model is iteratively applied to the audio stream, and wherein the trained machine learning model is applied to the audio stream at preset time intervals. The trained machine learning model is used to generate the harmfulness level of the audio stream as output, wherein the harmfulness level of the audio stream is used to identify the current harmfulness level of the audio stream, and includes a prediction of the future harmfulness level of the audio stream, wherein a level of 0 indicates that there is no possibility of harmfulness in the audio stream; Identify silences or pauses between words in the audio stream, wherein the silence or pause corresponds to a specific timestamp in the audio stream; and The audio stream is sent to a receiving device, wherein the sending is performed to introduce a time delay in the audio stream based on the degree of harm, and wherein the time delay is introduced as a gap in the audio stream at the specific timestamp of the silence or pause between words.
17. The computer-readable medium of claim 16, wherein, The operation also includes: Identify harmful instances in the audio stream; and Before the audio stream is sent to the receiving device, the harmful instances in the audio stream are replaced with noise or silence.
18. The computer-readable medium of claim 16, wherein, The audio stream is part of the metaverse.
19. The computer-readable medium of claim 16, wherein, The operation also includes: The speech analysis score is updated based on the identified harmful instances in the audio stream.
20. The computer-readable medium of claim 16, wherein, The operation also includes: Receive text from a text channel associated with the transmitting device, wherein the text channel is separate from the audio stream; and Generate a text score indicating the harmfulness level of the text; The input to the trained machine learning model also includes the text score.