system
The system addresses power harassment by converting audio to text, analyzing it with NLP and machine learning, and providing immediate alerts, enhancing workplace communication by allowing users to adjust their behavior.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-19
AI Technical Summary
In modern working environments, power harassment is a significant issue with workers often unaware of their aggressive or high-pressure behavior, and existing systems lack effective real-time detection and prevention mechanisms.
A system that converts audio data into text using speech recognition, analyzes it with natural language processing, and evaluates the risk of harassment using a machine learning model, providing immediate alerts to users through terminals like smartphones and PCs.
Enables real-time detection and notification of potentially aggressive language, allowing users to adjust their behavior promptly, thereby improving workplace communication and reducing power harassment.
Smart Images

Figure 2026100720000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In a modern working environment, power harassment is a serious problem, and many workers are burdened mentally by high-pressure or aggressive words and actions. Despite such a situation, the sender of the words and actions has few opportunities to recognize it, and there is a lack of effective means to prevent the problem. Therefore, there is a demand for the development of a system that automatically makes people aware of the possibility of power harassment in real time.
Means for Solving the Problems
[0005] This invention provides a technology for acquiring audio data and converting it into text data. It incorporates natural language processing means to analyze this text data and effectively detect aggressive or condescending language. Furthermore, it uses a machine learning model to evaluate the possibility of harassment in ongoing behavior and promptly and appropriately notifies the user of the risk, thereby preventing problems from occurring. Through this series of measures, users gain the opportunity to objectively evaluate their own behavior and make improvements as needed.
[0006] "Audio data" refers to data used to record or transmit audio information in digital format.
[0007] "Text data" refers to audio data converted into a string of characters, which is textual information in a format that can be analyzed and manipulated.
[0008] "Means of analysis" refer to processes and techniques for determining the structure and meaning of language based on text data, and for identifying specific patterns and intentions.
[0009] "Overbearing or aggressive language" refers to expressions or language that may intimidate or emotionally hurt others.
[0010] A "machine learning model" is an algorithm that learns rules and patterns from data and performs predictions and classifications based on the input data.
[0011] "Methods of evaluation" refer to the process of judging the results of data analysis using specific criteria based on the analysis results, and quantifying their value and risk.
[0012] "Means of notifying users of risks" refer to functions and mechanisms that provide users with information based on evaluation results and encourage them to take precautions or change their behavior. [Brief explanation of the drawing]
[0013] [Figure 1]This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]
[0014] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.
[0015] First, the terms used in the following description will be explained.
[0016] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0017] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0018] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0019] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), and the like.
[0020] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0021] [First Embodiment]
[0022] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0023] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0024] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0025] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0026] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0027] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0028] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0029] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0030] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0031] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0032] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0033] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0034] This invention is a system that detects power harassment in real time and notifies the user, and is implemented on terminals such as smartphones and PCs. First, the terminal acquires audio data through a microphone and converts this audio data into text information using speech recognition technology. This converted text information, i.e., text data, is in a form that can be quickly analyzed. The terminal sends this text data to a server, and the server performs analysis on the received text data.
[0035] The server uses a natural language processing (NLP) engine to analyze text data. This engine classifies and identifies each word and phrase, taking into account context and grammatical structure. Next, the server evaluates the analysis results using a machine learning model. This model is trained on past data related to power harassment and has the ability to determine whether a particular context or wording constitutes power harassment.
[0036] When the server detects signs of a problem, it calculates a specific risk score. If this score exceeds a set threshold, the server issues an alert to the user. This alert is displayed on the user's screen via the terminal and may also be output as an audio notification. Specifically, the user may see a message such as, "This may be considered aggressive behavior. Please be careful with your words and actions."
[0037] Furthermore, after receiving this notification, users can provide feedback on the timing and content of the notification, allowing the system to continuously improve its accuracy. This entire process takes place in real time, helping users to immediately adjust their necessary actions.
[0038] As a concrete example, consider a situation where a user says during a meeting, "It's your fault this project is progressing so slowly." This phrase is likely to be analyzed by the server and deemed to be aggressive. The server can then send an alert to the user's terminal, prompting them to reconsider their statement. Through this process, the speaker is given an opportunity to reflect on their words and actions, contributing to an improved work environment.
[0039] The following describes the processing flow.
[0040] Step 1:
[0041] The device acquires audio data. When the user launches an application and enables the recording function, the device's microphone captures ambient sound and records it as audio data.
[0042] Step 2:
[0043] The device uses speech recognition technology to convert the acquired audio data into text data. Here, the spoken words are accurately converted into text information and made into an analyzable format.
[0044] Step 3:
[0045] The terminal sends the converted text data to the server. This transmission occurs in real time and is used as data necessary for analysis on the server.
[0046] Step 4:
[0047] The server receives text data and performs analysis using a natural language processing (NLP) engine. Here, the server tokenizes the text, determines the meaning of each word, and analyzes the context.
[0048] Step 5:
[0049] The server evaluates the possibility of power harassment based on the analysis results. Using a machine learning model, it calculates a risk score for high-handed or aggressive behavior.
[0050] Step 6:
[0051] If the server's risk score exceeds a set threshold, an alert message is generated. This message is intended to alert the user.
[0052] Step 7:
[0053] The device receives alert messages sent from the server and notifies the user. The notification is provided as a screen pop-up or audio, prompting the user to provide prompt feedback.
[0054] Step 8:
[0055] Users can review alerts and modify their statements and attitudes as needed. As a result, users can become more aware of their own words and actions.
[0056] (Example 1)
[0057] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0058] In the workplace, a challenge is to prevent friction and discord caused by inappropriate language and behavior, and to provide a healthier communication environment. Conventional systems have had difficulty detecting inappropriate behavior in real time and responding immediately, thus limiting their effectiveness in situations where immediate action is required.
[0059] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0060] In this invention, the server includes means for acquiring an acoustic signal and converting the acoustic signal into text information, means for analyzing the text information to detect aggressive or condescending language expressions, and means for evaluating the text information using a machine learning model that evaluates the possibility of inappropriate behavior in the workplace in relation to the ongoing language expression. This makes it possible to detect inappropriate remarks in the workplace in real time and quickly issue warnings to users.
[0061] An "acoustic signal" is a recording of ambient sound as an electrical signal, and is primarily used as audio data.
[0062] "Textual information" refers to text data converted from acoustic signals, and serves as fundamental data for language analysis.
[0063] "Linguistic expression" refers to the words and phrases contained in written information, which convey specific meanings and emotions.
[0064] A "machine learning model" is an algorithm that learns from past data and is used to recognize specific patterns or rules.
[0065] "Inappropriate behavior" refers to verbal expressions or attitudes that worsen the workplace environment, and includes things like power harassment.
[0066] "Real-time" means that information processing is performed immediately at the moment it occurs, and results are provided without delay.
[0067] A specific description will be given regarding embodiments for carrying out this invention.
[0068] This system consists of a user terminal and a server. The terminal uses a smartphone or personal computer to acquire ambient acoustic signals through its built-in microphone. The terminal then uses speech recognition software to convert the acquired acoustic signals into text information. Specifically, Google® Cloud Speech-to-Text or similar APIs are often used.
[0069] The character information converted by the terminal is sent to the server in a secure manner. The server analyzes this character information using a natural language processing engine. Natural language processing libraries such as SpaCy and NLTK are used for the analysis. The server applies a machine learning model to determine whether the linguistic expressions contained in the character information are domineering or aggressive. Because this model is trained on a dataset of past inappropriate behavior, it can perform evaluations that are relevant to real-world workplace environments.
[0070] Once the analysis is complete, if certain linguistic expressions are deemed inappropriate, the server will notify the user via their terminal. The user can then review their statements after receiving this warning, thus preventing potential disagreements and conflicts.
[0071] For example, if a user says "It's your fault this project is progressing slowly" during a meeting, this phrase is likely to be considered aggressive, and the server will issue a warning. Based on this warning, the user can reflect on their statement and correct it if necessary.
[0072] Furthermore, as part of improving the system's accuracy, users can submit feedback on warnings. This feedback is reflected in the server's machine learning model, contributing to improved overall system reliability.
[0073] An example of a prompt for a generative AI model might be, "Please provide wording that points out that the instructions from your supervisor are unclear and may be causing misunderstandings."
[0074] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0075] Step 1:
[0076] The device uses a microphone to acquire ambient acoustic signals. The input consists of ambient sounds and conversations, which the device captures as digital audio data. This data is stored as a preliminary step before being processed by speech recognition software. Specifically, pre-processing such as adjusting the volume level and noise cancellation is performed.
[0077] Step 2:
[0078] The device converts acquired acoustic signals into text information using speech recognition software. The input is digital audio data, and the output is corresponding text data. Using APIs such as Google Cloud Speech-to-Text, the conversion from audio to text information is performed according to a language model. In this process, silent portions of the audio data are omitted, and important spoken parts are efficiently converted.
[0079] Step 3:
[0080] The terminal sends the converted character information to the server. The input is the generated text data, and the output is processed as data on the server. By sending the data via encrypted communication, data security is ensured while delivering it to the server with low latency.
[0081] Step 4:
[0082] The server analyzes the received text data using a natural language processing engine. The input is text data, and the output is analyzed data that includes linguistic features and contextual information. Libraries such as SpaCy and NLTK are used to perform grammatical structure analysis and keyword extraction. Sentence tone analysis is also performed here, quantifying the degree of aggressiveness.
[0083] Step 5:
[0084] The server uses a machine learning model to assess the likelihood of inappropriate behavior based on the analyzed data. The input is the analyzed data, and the output is the assessment result with a risk score. A model trained on historical datasets is used to calculate the probability that a particular linguistic expression poses a risk.
[0085] Step 6:
[0086] The server uses a risk score to notify the terminal of an alert if a threshold is exceeded. The input is the evaluation result, and the output is a warning message to the user. This alert is displayed as a pop-up on the terminal, and an audio notification is given if necessary. Specific warning messages may include, "Your remarks were judged to be aggressive. Please reconsider your words and actions."
[0087] Step 7:
[0088] Users can provide feedback on the warnings they receive. Input consists of user ratings and opinions, while output is stored as training data for the system. This allows the machine learning model to continuously improve, helping to prevent false positives and enhance accuracy.
[0089] (Application Example 1)
[0090] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0091] In modern living environments and workplaces, the likelihood of power harassment occurring due to communication issues is increasing. In particular, there is a need for effective means to immediately detect aggressive or condescending language during real-time conversations and to encourage improved communication. However, conventional systems lack sufficient real-time detection and notification capabilities to address this problem.
[0092] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0093] In this invention, the server includes means for acquiring audio information and converting said audio information into text information; means for analyzing said text information to detect high-pressure or aggressive expressions; means for evaluating said text information using a learning model that assesses the possibility of power harassment for ongoing expressions; and means for an automated device that monitors conversations in a living environment and alerts the user if a problem is detected. This enables real-time detection of power harassment and rapid feedback to the user.
[0094] "Audio information" refers to sound data acquired through devices such as microphones.
[0095] "Textual information" refers to a data format in which audio information is converted into text.
[0096] "Expression" refers to words and phrases that fulfill the role of language.
[0097] A "learning model" is an algorithm that is trained on past data and has the ability to recognize specific patterns and trends.
[0098] "Power harassment" refers to overbearing or aggressive behavior or remarks that stem from an imbalance of power in the workplace or living environment.
[0099] An "automated device" refers to a mechanical or electronic device designed to perform a specific task or function without human intervention.
[0100] "User" refers to an individual or group whose purpose is to use this system.
[0101] This invention relates to a system for detecting intimidating or aggressive behavior in a living environment in real time and for evaluating the possibility of power harassment. The system uses an automated device equipped with a microphone, display, and speaker as hardware, as well as a server and related software.
[0102] First, audio information is acquired through the microphone. The device uses the Google Speech-to-Text API, a speech recognition engine, to convert this audio information into text in real time. The text information thus converted is then sent from the device to the server.
[0103] On the server, textual information is analyzed using a natural language processing engine based on SpaCy. To determine whether words and phrases are aggressive, the ongoing expression is evaluated by a learning model, specifically a machine learning algorithm using TENSORFLOW®. This machine learning model is trained based on past cases of power harassment.
[0104] If the server determines that certain expressions are aggressive or offensive, the automated device will warn the user through a display or speaker. This warning allows for quick feedback and encourages the user to reconsider their words and actions immediately.
[0105] For example, if an automated device detects the phrase "You're not doing well at all" during a conversation at home, it will determine that this is an overbearing expression and offer a user-friendly suggestion such as "Try to think of a gentler way to say it."
[0106] An example of an input prompt for a generative AI model is: "Analyze the following conversation and determine if it may constitute harassment. Consider the context and tone. Conversation: 'Conversation content'."
[0107] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0108] Step 1:
[0109] The device acquires audio information through its microphone. The input is ambient sound, which is captured as an analog audio signal. The device converts this audio signal into digital data and stores it in an appropriate format.
[0110] Step 2:
[0111] The device uses the Google Speech-to-Text API to convert digital audio data into text. The input is audio data, and the output is text data. This process uses speech recognition technology to transcribe the content of the audio data into text with high accuracy.
[0112] Step 3:
[0113] Text information is sent from the terminal to the server. The server receives this text information and performs analysis using SpaCy, a natural language processing engine. The input is text data, and the output is the analysis result, which includes language structure and contextual information. The server evaluates the grammar and word meanings.
[0114] Step 4:
[0115] The server evaluates the analyzed data using a machine learning model (using TensorFlow). The model is trained on past cases of workplace harassment; the input is the analysis results, and the output is a risk score indicating the likelihood of workplace harassment. This process scores whether specific phrases or wording are aggressive or intimidating.
[0116] Step 5:
[0117] The server issues a warning to the user if the risk score exceeds a set threshold. This warning is displayed and output via the terminal's display and speaker. For the user, the input is the risk score, and the output is the warning message. This allows the user to immediately review their own words and actions.
[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0119] This invention is a system that evaluates the possibility of power harassment in real time based on voice and text data and warns the user, further improving accuracy by incorporating an emotion engine. The terminal first records the user's voice through a microphone and converts it into text data using speech recognition technology. The text data obtained from this data conversion process is immediately sent to the server.
[0120] The server analyzes text data using a natural language processing (NLP) engine, evaluating the possibility of harassment by considering the language structure and context. Furthermore, an emotion engine recognizes the user's emotions from the audio and text data and extracts their emotional state. The results of the emotion engine's analysis influence the evaluation of harassment, taking into account situations where the user is not aggressive but is in a negative emotional state.
[0121] The machine learning model is trained on past cases of workplace harassment, and uses this data and sentiment assessments to numerically evaluate the risk of the remarks. Based on this evaluation, the server generates alerts and notifies the user when necessary. The notifications are delivered via the device and provide information to help the user take appropriate action.
[0122] As a concrete example, consider a situation where a user says during a meeting, "I want to know why you're not meeting my expectations." When this statement is made, the server performs linguistic analysis, and simultaneously, the emotion engine recognizes the user's emotions from their tone and speed of voice. If the system determines that the user is feeling irritated or dissatisfied, the overall risk score may increase. As a result, the device will issue an alert to the user, such as, "Your statement may be misleading. Please maintain a calm tone."
[0123] By combining this with an emotion engine, it becomes possible to detect power harassment in a sophisticated way that considers not only the content of the language but also the emotions behind it. As a result, users can gain a deeper understanding and create a better communication environment.
[0124] The following describes the processing flow.
[0125] Step 1:
[0126] The device acquires audio data. The user activates the application's recording function and records conversations and voices, including ambient noise. This ensures that audio data is available for later analysis.
[0127] Step 2:
[0128] The device converts the voice data acquired using speech recognition technology into text data. Once the voice input is converted to text format, it is ready for natural language processing.
[0129] Step 3:
[0130] The terminal sends the converted text data to the server. The server then receives the data necessary to analyze the audio content. The transmission process minimizes the processing power required by the terminal.
[0131] Step 4:
[0132] The server analyzes the text data using a natural language processing engine. Here, it understands the context of tokenized words and determines whether they contain aggressive or condescending language.
[0133] Step 5:
[0134] The server uses an emotion engine to analyze the user's voice characteristics (tone, speed, and volume) and recognize the user's emotional state. This allows for a quantitative measurement of their emotions at that moment.
[0135] Step 6:
[0136] The server uses a machine learning model to integrate text data with emotional state assessments and calculate a risk score for power harassment. This score is then refined by a model trained on past case data.
[0137] Step 7:
[0138] The server generates an alert for the user if it exceeds a set threshold based on its risk score. The alert includes specific suggestions for improvement and cautionary messages.
[0139] Step 8:
[0140] The device notifies the user of alerts sent from the server. Notifications are efficiently delivered through on-screen pop-ups and audio alerts, providing the user with the information they need immediately.
[0141] Step 9:
[0142] Users review the notification content and modify their conversations as needed. This feedback loop allows users to develop an awareness of improving their words and actions.
[0143] (Example 2)
[0144] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0145] In today's communication environment, there is a need to quickly detect remarks that may unintentionally be perceived as aggressive or condescending, and to provide appropriate feedback. However, conventional systems primarily analyze only the content of the language, without conducting detailed evaluations based on the underlying emotions and context. As a result, there is a challenge in that it is difficult for users to immediately understand and improve their own communication style.
[0146] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0147] In this invention, the server includes means for analyzing voice and text information to detect aggressive or condescending language, means for evaluating language information using a learning model to determine the possibility of power harassment, and means for performing sentiment analysis and incorporating the results into the evaluation. This makes it possible to provide real-time feedback to the user and promote a deeper understanding of communication.
[0148] "Voice information" refers to data related to human voices acquired through input devices such as microphones.
[0149] "Textual information" refers to data that is represented as text after audio information has been converted.
[0150] A "learning model" is a mathematical model that performs pattern recognition and classification based on past data and examples.
[0151] "Analysis" is the process of analyzing data, understanding its content, and evaluating it.
[0152] "Evaluation" is the process of determining the value and risks of something based on analyzed data.
[0153] "Emotional analysis" is a technology that infers and extracts a person's emotional state from information expressed as voice or text.
[0154] "Notification" refers to the act of providing information to inform users of the evaluation results.
[0155] This system is designed to analyze voice information and assess the risk of power harassment in real time. A specific implementation is described below.
[0156] First, when a user speaks, the device acquires the audio information through its microphone. This audio information is then converted into text using speech recognition technology built into the device, such as general speech-to-text software. At this stage, high-quality microphones and hardware with noise-canceling capabilities are recommended to improve audio accuracy.
[0157] The converted text information is sent to the server in real time. The server receives this information and performs analysis using a natural language processing engine. The analysis utilizes language processing libraries for contextual understanding and syntactic analysis of the text information. For example, open-source and commercial natural language processing tools are commonly used.
[0158] As the next step in the analysis, the server utilizes an emotion analysis engine to extract the user's emotional state from the audio and text information. This emotion data is crucial for understanding the psychological aspects behind the statements. The emotion engine is trained using machine learning and employs data models to identify a variety of emotions.
[0159] The server integrates these analysis results and uses a learning model to assess the risk of power harassment. Because the learning model is trained on a large dataset based on past cases, it can provide accurate risk assessments.
[0160] When notifying a user of a risk, the server sends the assessment results to the terminal, and a warning message is displayed to the user. For example, the user might be advised, "Your statement may be misleading. Please maintain a calm tone."
[0161] As a concrete example, consider the behavior of a user who, during a meeting, asks, "Please tell me why you haven't met my expectations." In this case, the system would assess the possibility of the statement being misinterpreted as harassment and provide appropriate feedback. Possible prompts for the generative AI model could include instructions such as, "Perform a communication risk assessment based on text and audio data, and present the results."
[0162] This system allows users to receive a risk assessment that includes the emotional context of their statements, and to gain guidance for improving their communication style.
[0163] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0164] Step 1:
[0165] When a user speaks, the device uses its microphone to acquire audio information. Since this audio information cannot be analyzed directly, it is converted into text using speech recognition technology. The converted text is then stored digitally within the device, and the process proceeds to the next step.
[0166] Step 2:
[0167] The terminal sends the converted text information to the server. The server feeds the received text information into a natural language processing engine to analyze the sentence structure and context. The input here is text information, and the output is evaluation information, including the detection results of aggressive or condescending language. This data processing is performed through syntactic and semantic analysis of the text.
[0168] Step 3:
[0169] The server also receives audio information and uses an emotion analysis engine to extract the user's emotional state. In this step, raw audio data is used as input, and the user's emotional state information after analysis is output. Emotions are read from the tone and speed of the voice, and the evaluation is performed by integrating this with the results of text analysis.
[0170] Step 4:
[0171] The server integrates the results of text analysis and sentiment analysis, and uses a learning model to assess the risk of power harassment. The input consists of text and sentiment analysis results, and the output is a risk score. This score calculation utilizes a machine learning algorithm that has been refined based on similar past cases.
[0172] Step 5:
[0173] If the risk assessment results in a score exceeding a threshold, the server generates a warning message and sends it to the terminal. The terminal notifies the user of this message and displays specific feedback such as "Your statement may be misleading. Please maintain a calm tone," prompting the user to immediately improve their behavior.
[0174] This series of processes allows users to receive real-time feedback and recognize communication risks.
[0175] (Application Example 2)
[0176] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0177] In workplaces, educational institutions, and other settings, there is a growing need to prevent harassment and create a healthy communication environment. However, traditional methods have struggled to evaluate not only the content of statements but also background information such as emotions. In particular, many cases of harassment stem from minor misunderstandings or heightened emotions, and a system is needed that can recognize this in real time and warn users.
[0178] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0179] In this invention, the server includes means for acquiring sound data and converting said sound data into document data; means for analyzing said document data to detect intimidating or aggressive language; means for evaluating said document data using a learning model that evaluates the possibility of harassment for the language being spoken; means for recognizing emotional states from the sound data; and means for modifying the harassment evaluation based on the emotional states. This makes it possible to comprehensively evaluate the possibility of harassment from both emotional and linguistic aspects and provide warnings to users at an appropriate time.
[0180] "Audio data" refers to recordings of speech or sounds as digital information, primarily collected through devices such as microphones.
[0181] "Document data" refers to audio data converted into text, in a format that can be analyzed and evaluated through natural language processing.
[0182] "Intimidating or aggressive language" refers to language that includes expressions that exert physical or psychological pressure or aggression on others, and its detection is important in assessing the possibility of harassment.
[0183] A "learning model" is an algorithm or framework that is trained using historical data to perform a specific task, such as assessing the risk of harassment.
[0184] "Emotional state" refers to an individual's emotional state as analyzed from sound data, and is expressed in categories such as joy, anger, and sadness.
[0185] "Means of modifying harassment assessments" refer to processes and methods for precisely adjusting harassment risk assessments based on verbal analysis results, taking into account emotional states.
[0186] A "warning" is a message or notification that alerts users and is provided when a high risk of harassment is identified.
[0187] In order to implement this invention, both the server and the terminal must work together. To acquire the user's voice, the terminal is equipped with a voice input device (microphone) and collects sound data in real time. The collected sound data is converted into document data using speech recognition software on the terminal (e.g., Google's speech recognition API).
[0188] Subsequently, the document data is sent to a server, which uses a natural language processing engine to detect intimidating or aggressive language. Furthermore, emotion recognition software runs on the server, recognizing the user's emotional state from the audio data. This emotional state assessment is fed into a learning model and used to improve the accuracy of harassment risk assessments.
[0189] The server uses a learning model trained on past harassment cases to assess the overall risk of harassment. Based on this assessment, it displays warnings and alerts on the user's device in an easy-to-understand format. This process is continuous, enabling real-time feedback.
[0190] As a concrete example, consider a scenario where a user in a meeting says, "Why hasn't this problem been solved yet?" Speech recognition software converts the statement into document data, and a server analyzes its contents. Simultaneously, if frustration is detected in the audio data, emotion recognition software can increase the risk score and send an appropriate alert to the user.
[0191] An example of a prompt using a generative AI model is: "Analyze the given audio and text data for intimidating language and emotions, and assess the likelihood of harassment."
[0192] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0193] Step 1:
[0194] The device acquires the user's voice as input through the microphone. The user's speech, collected as audio data, is passed to speech recognition software within the device. This converts the audio data into document data, which is then output as text data.
[0195] Step 2:
[0196] The terminal sends the document data obtained in step 1 to the server. The server uses a natural language processing engine to analyze this document data. It detects offensive language from the input text and outputs the results as evaluation data.
[0197] Step 3:
[0198] The server simultaneously inputs sound data into emotion recognition software to evaluate the user's emotional state. It outputs an emotion score, which quantifies the emotional state. This score is used later for risk assessment.
[0199] Step 4:
[0200] The server combines the evaluation data and sentiment scores obtained from steps 2 and 3 and supplies them to the learning model. The generative AI model uses this data to assess the risk of harassment and outputs a risk score.
[0201] Step 5:
[0202] If the risk score calculated in step 4 exceeds a pre-set threshold, the server creates a prompt and sends an alert message to the terminal. The terminal notifies the user of the warning visually or audibly.
[0203] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0204] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0205] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0206] [Second Embodiment]
[0207] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0208] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0209] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0210] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0211] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0212] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0213] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0214] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0215] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0216] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0217] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0218] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0219] This invention is a system that detects power harassment in real time and notifies the user, and is implemented on terminals such as smartphones and PCs. First, the terminal acquires audio data through a microphone and converts this audio data into text information using speech recognition technology. This converted text information, i.e., text data, is in a form that can be quickly analyzed. The terminal sends this text data to a server, and the server performs analysis on the received text data.
[0220] The server uses a natural language processing (NLP) engine to analyze text data. This engine classifies and identifies each word and phrase, taking into account context and grammatical structure. Next, the server evaluates the analysis results using a machine learning model. This model is trained on past data related to power harassment and has the ability to determine whether a particular context or wording constitutes power harassment.
[0221] When the server detects signs of a problem, it calculates a specific risk score. If this score exceeds a set threshold, the server issues an alert to the user. This alert is displayed on the user's screen via the terminal and may also be output as an audio notification. Specifically, the user may see a message such as, "This may be considered aggressive behavior. Please be careful with your words and actions."
[0222] Furthermore, after receiving this notification, users can provide feedback on the timing and content of the notification, allowing the system to continuously improve its accuracy. This entire process takes place in real time, helping users to immediately adjust their necessary actions.
[0223] As a concrete example, consider a situation where a user says during a meeting, "It's your fault this project is progressing so slowly." This phrase is likely to be analyzed by the server and deemed to be aggressive. The server can then send an alert to the user's terminal, prompting them to reconsider their statement. Through this process, the speaker is given an opportunity to reflect on their words and actions, contributing to an improved work environment.
[0224] The following describes the processing flow.
[0225] Step 1:
[0226] The device acquires audio data. When the user launches an application and enables the recording function, the device's microphone captures ambient sound and records it as audio data.
[0227] Step 2:
[0228] The device uses speech recognition technology to convert the acquired audio data into text data. Here, the spoken words are accurately converted into text information and made into an analyzable format.
[0229] Step 3:
[0230] The terminal sends the converted text data to the server. This transmission occurs in real time and is used as data necessary for analysis on the server.
[0231] Step 4:
[0232] The server receives text data and performs analysis using a natural language processing (NLP) engine. Here, the server tokenizes the text, determines the meaning of each word, and analyzes the context.
[0233] Step 5:
[0234] The server evaluates the possibility of power harassment based on the analysis results. Using a machine learning model, it calculates a risk score for high-handed or aggressive behavior.
[0235] Step 6:
[0236] If the server's risk score exceeds a set threshold, an alert message is generated. This message is intended to alert the user.
[0237] Step 7:
[0238] The device receives alert messages sent from the server and notifies the user. The notification is provided as a screen pop-up or audio, prompting the user to provide prompt feedback.
[0239] Step 8:
[0240] Users can review alerts and modify their statements and attitudes as needed. As a result, users can become more aware of their own words and actions.
[0241] (Example 1)
[0242] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0243] In the workplace, a challenge is to prevent friction and discord caused by inappropriate language and behavior, and to provide a healthier communication environment. Conventional systems have had difficulty detecting inappropriate behavior in real time and responding immediately, thus limiting their effectiveness in situations where immediate action is required.
[0244] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0245] In this invention, the server includes means for acquiring an acoustic signal and converting the acoustic signal into text information, means for analyzing the text information to detect aggressive or condescending language expressions, and means for evaluating the text information using a machine learning model that evaluates the possibility of inappropriate behavior in the workplace in relation to the ongoing language expression. This makes it possible to detect inappropriate remarks in the workplace in real time and quickly issue warnings to users.
[0246] An "acoustic signal" is a recording of ambient sound as an electrical signal, and is primarily used as audio data.
[0247] "Textual information" refers to text data converted from acoustic signals, and serves as fundamental data for language analysis.
[0248] "Linguistic expression" refers to the words and phrases contained in written information, which convey specific meanings and emotions.
[0249] A "machine learning model" is an algorithm that learns from past data and is used to recognize specific patterns or rules.
[0250] "Inappropriate behavior" refers to verbal expressions or attitudes that worsen the workplace environment, and includes things like power harassment.
[0251] "Real-time" means that information processing is performed immediately at the moment it occurs, and results are provided without delay.
[0252] A specific description will be given regarding embodiments for carrying out this invention.
[0253] This system consists of a user's terminal and a server. The terminal uses a smartphone or personal computer to acquire ambient acoustic signals through its built-in microphone. The terminal then uses speech recognition software to convert the acquired acoustic signals into text information. Specific software often used includes Google Cloud Speech-to-Text or similar APIs.
[0254] The character information converted by the terminal is sent to the server in a secure manner. The server analyzes this character information using a natural language processing engine. Natural language processing libraries such as SpaCy and NLTK are used for the analysis. The server applies a machine learning model to determine whether the linguistic expressions contained in the character information are domineering or aggressive. Because this model is trained on a dataset of past inappropriate behavior, it can perform evaluations that are relevant to real-world workplace environments.
[0255] Once the analysis is complete, if certain linguistic expressions are deemed inappropriate, the server will notify the user via their terminal. The user can then review their statements after receiving this warning, thus preventing potential disagreements and conflicts.
[0256] For example, if a user says "It's your fault this project is progressing slowly" during a meeting, this phrase is likely to be considered aggressive, and the server will issue a warning. Based on this warning, the user can reflect on their statement and correct it if necessary.
[0257] Furthermore, as part of improving the system's accuracy, users can submit feedback on warnings. This feedback is reflected in the server's machine learning model, contributing to improved overall system reliability.
[0258] An example of a prompt for a generative AI model might be, "Please provide wording that points out that the instructions from your supervisor are unclear and may be causing misunderstandings."
[0259] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0260] Step 1:
[0261] The device uses a microphone to acquire ambient acoustic signals. The input consists of ambient sounds and conversations, which the device captures as digital audio data. This data is stored as a preliminary step before being processed by speech recognition software. Specifically, pre-processing such as adjusting the volume level and noise cancellation is performed.
[0262] Step 2:
[0263] The device converts acquired acoustic signals into text information using speech recognition software. The input is digital audio data, and the output is corresponding text data. Using APIs such as Google Cloud Speech-to-Text, the conversion from audio to text information is performed according to a language model. In this process, silent portions of the audio data are omitted, and important spoken parts are efficiently converted.
[0264] Step 3:
[0265] The terminal sends the converted character information to the server. The input is the generated text data, and the output is processed as data on the server. By sending the data via encrypted communication, data security is ensured while delivering it to the server with low latency.
[0266] Step 4:
[0267] The server analyzes the received text data using a natural language processing engine. The input is text data, and the output is analyzed data that includes linguistic features and contextual information. Libraries such as SpaCy and NLTK are used to perform grammatical structure analysis and keyword extraction. Sentence tone analysis is also performed here, quantifying the degree of aggressiveness.
[0268] Step 5:
[0269] The server uses a machine learning model to assess the likelihood of inappropriate behavior based on the analyzed data. The input is the analyzed data, and the output is the assessment result with a risk score. A model trained on historical datasets is used to calculate the probability that a particular linguistic expression poses a risk.
[0270] Step 6:
[0271] The server uses a risk score to notify the terminal of an alert if a threshold is exceeded. The input is the evaluation result, and the output is a warning message to the user. This alert is displayed as a pop-up on the terminal, and an audio notification is given if necessary. Specific warning messages may include, "Your remarks were judged to be aggressive. Please reconsider your words and actions."
[0272] Step 7:
[0273] Users can provide feedback on the warnings they receive. Input consists of user ratings and opinions, while output is stored as training data for the system. This allows the machine learning model to continuously improve, helping to prevent false positives and enhance accuracy.
[0274] (Application Example 1)
[0275] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0276] In modern living environments and workplaces, the likelihood of power harassment occurring due to communication issues is increasing. In particular, there is a need for effective means to immediately detect aggressive or condescending language during real-time conversations and to encourage improved communication. However, conventional systems lack sufficient real-time detection and notification capabilities to address this problem.
[0277] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0278] In this invention, the server includes means for acquiring voice information and converting the voice information into character information, means for analyzing the character information to detect high-pressure or aggressive expressions, means for evaluating the character information using a learning model that evaluates the possibility of power harassment for ongoing expressions, and means for monitoring conversations in the living environment and including an automated device that alerts the user when a problem is detected. Thereby, real-time detection of power harassment and prompt feedback to the user are possible.
[0279] "Voice information" refers to sound data acquired through devices such as microphones.
[0280] "Character information" represents a data format in which voice information is texturized.
[0281] "Expression" means words or phrases that play the role of language.
[0282] "Learning model" is an algorithm trained based on past data and having the ability to recognize specific patterns or trends.
[0283] "Power harassment" means high-pressure or aggressive acts or remarks against the background of power imbalance in the workplace or living environment.
[0284] "Automated device" refers to a mechanical or electronic device designed to perform specific operations or functions without human intervention.
[0285] "User" refers to an individual or group who uses this system.
[0286] This invention relates to a system for detecting high-pressure or aggressive expressions in real time in the living environment and evaluating the possibility of power harassment. The system uses an automated device equipped with a microphone, a display, and a speaker as hardware, as well as a server and related software.
[0287] First, voice information is obtained through a microphone. To convert this voice information into character information in real time, the Google Speech-to-Text API, which is a speech recognition engine, is used. The character information thus texturized is sent from the terminal to the server.
[0288] On the server, the character information is analyzed by a natural language processing engine using SpaCy. To determine whether words and phrases are coercive, the ongoing expression is evaluated using a machine learning algorithm with a learning model, specifically TensorFlow. This machine learning model is trained based on past cases of power harassment.
[0289] When the server determines that a specific expression is coercive or aggressive, the automated device conveys a warning to the user through a display or speaker. This warning enables quick feedback and prompts the user to review their speech and actions immediately.
[0290] As a specific example, during a conversation within a household, if the automated device detects a phrase such as "You're not doing anything right at all", it is determined to be a coercive expression, and a friendly suggestion is made to the user, saying "Please think of a nicer way to say it".
[0291] Examples of input prompt sentences for the generative AI model include "Analyze the following conversation and determine whether there is a possibility of power harassment. Please also consider the context and tone. Conversation: 'Conversation content'".
[0292] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0293] Step 1:
[0294] The device acquires audio information through its microphone. The input is ambient sound, which is captured as an analog audio signal. The device converts this audio signal into digital data and stores it in an appropriate format.
[0295] Step 2:
[0296] The device uses the Google Speech-to-Text API to convert digital audio data into text. The input is audio data, and the output is text data. This process uses speech recognition technology to transcribe the content of the audio data into text with high accuracy.
[0297] Step 3:
[0298] Text information is sent from the terminal to the server. The server receives this text information and performs analysis using SpaCy, a natural language processing engine. The input is text data, and the output is the analysis result, which includes language structure and contextual information. The server evaluates the grammar and word meanings.
[0299] Step 4:
[0300] The server evaluates the analyzed data using a machine learning model (using TensorFlow). The model is trained on past cases of workplace harassment; the input is the analysis results, and the output is a risk score indicating the likelihood of workplace harassment. This process scores whether specific phrases or wording are aggressive or intimidating.
[0301] Step 5:
[0302] The server issues a warning to the user if the risk score exceeds a set threshold. This warning is displayed and output via the terminal's display and speaker. For the user, the input is the risk score, and the output is the warning message. This allows the user to immediately review their own words and actions.
[0303] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0304] This invention is a system that evaluates the possibility of power harassment in real time based on voice and text data and warns the user, further improving accuracy by incorporating an emotion engine. The terminal first records the user's voice through a microphone and converts it into text data using speech recognition technology. The text data obtained from this data conversion process is immediately sent to the server.
[0305] The server analyzes text data using a natural language processing (NLP) engine, evaluating the possibility of harassment by considering the language structure and context. Furthermore, an emotion engine recognizes the user's emotions from the audio and text data and extracts their emotional state. The results of the emotion engine's analysis influence the evaluation of harassment, taking into account situations where the user is not aggressive but is in a negative emotional state.
[0306] The machine learning model is trained on past cases of workplace harassment, and uses this data and sentiment assessments to numerically evaluate the risk of the remarks. Based on this evaluation, the server generates alerts and notifies the user when necessary. The notifications are delivered via the device and provide information to help the user take appropriate action.
[0307] As a concrete example, consider a situation where a user says during a meeting, "I want to know why you're not meeting my expectations." When this statement is made, the server performs linguistic analysis, and simultaneously, the emotion engine recognizes the user's emotions from their tone and speed of voice. If the system determines that the user is feeling irritated or dissatisfied, the overall risk score may increase. As a result, the device will issue an alert to the user, such as, "Your statement may be misleading. Please maintain a calm tone."
[0308] In this way, by combining the emotion engine, it is possible to detect advanced power harassment that takes into account not only the content of the language but also the emotions behind it. As a result, users can gain a deeper understanding and achieve a better communication environment.
[0309] The following describes the processing flow.
[0310] Step 1:
[0311] The terminal acquires voice data. The user activates the recording function of the application and records conversations and voices including surrounding noise. This ensures the voice data used for subsequent analysis.
[0312] Step 2:
[0313] The terminal converts the voice data acquired using voice recognition technology into text data. By converting the voice input into text format, it is ready for natural language processing.
[0314] Step 3:
[0315] The terminal sends the converted text data to the server. As a result, the server receives the data necessary for analyzing the content of the voice. Sending minimizes the processing capabilities required at the terminal.
[0316] Step 4:
[0317] The server analyzes the text data using a natural language processing engine. Here, it understands the context of the tokenized words and determines whether there are high-pressure or aggressive words included.
[0318] Step 5:
[0319] The server uses an emotion engine to analyze the user's voice characteristics (tone, speed, and volume) and recognize the user's emotional state. This allows for a quantitative measurement of their emotions at that moment.
[0320] Step 6:
[0321] The server uses a machine learning model to integrate text data with emotional state assessments and calculate a risk score for power harassment. This score is then refined by a model trained on past case data.
[0322] Step 7:
[0323] The server generates an alert for the user if it exceeds a set threshold based on its risk score. The alert includes specific suggestions for improvement and cautionary messages.
[0324] Step 8:
[0325] The device notifies the user of alerts sent from the server. Notifications are efficiently delivered through on-screen pop-ups and audio alerts, providing the user with the information they need immediately.
[0326] Step 9:
[0327] Users review the notification content and modify their conversations as needed. This feedback loop allows users to develop an awareness of improving their words and actions.
[0328] (Example 2)
[0329] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0330] In today's communication environment, there is a need to quickly detect remarks that may unintentionally be perceived as aggressive or condescending, and to provide appropriate feedback. However, conventional systems primarily analyze only the content of the language, without conducting detailed evaluations based on the underlying emotions and context. As a result, there is a challenge in that it is difficult for users to immediately understand and improve their own communication style.
[0331] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0332] In this invention, the server includes means for analyzing voice and text information to detect aggressive or condescending language, means for evaluating language information using a learning model to determine the possibility of power harassment, and means for performing sentiment analysis and incorporating the results into the evaluation. This makes it possible to provide real-time feedback to the user and promote a deeper understanding of communication.
[0333] "Voice information" refers to data related to human voices acquired through input devices such as microphones.
[0334] "Textual information" refers to data that is represented as text after audio information has been converted.
[0335] A "learning model" is a mathematical model that performs pattern recognition and classification based on past data and examples.
[0336] "Analysis" is the process of analyzing data, understanding its content, and evaluating it.
[0337] "Evaluation" is the process of determining the value and risks of something based on analyzed data.
[0338] "Emotional analysis" is a technology that infers and extracts a person's emotional state from information expressed as voice or text.
[0339] "Notification" refers to the act of providing information to inform users of the evaluation results.
[0340] This system is designed to analyze voice information and assess the risk of power harassment in real time. A specific implementation is described below.
[0341] First, when a user speaks, the device acquires the audio information through its microphone. This audio information is then converted into text using speech recognition technology built into the device, such as general speech-to-text software. At this stage, high-quality microphones and hardware with noise-canceling capabilities are recommended to improve audio accuracy.
[0342] The converted text information is sent to the server in real time. The server receives this information and performs analysis using a natural language processing engine. The analysis utilizes language processing libraries for contextual understanding and syntactic analysis of the text information. For example, open-source and commercial natural language processing tools are commonly used.
[0343] As the next step in the analysis, the server utilizes an emotion analysis engine to extract the user's emotional state from the audio and text information. This emotion data is crucial for understanding the psychological aspects behind the statements. The emotion engine is trained using machine learning and employs data models to identify a variety of emotions.
[0344] The server integrates these analysis results and uses a learning model to assess the risk of power harassment. Because the learning model is trained on a large dataset based on past cases, it can provide accurate risk assessments.
[0345] When notifying a user of a risk, the server sends the assessment results to the terminal, and a warning message is displayed to the user. For example, the user might be advised, "Your statement may be misleading. Please maintain a calm tone."
[0346] As a concrete example, consider the behavior of a user who, during a meeting, asks, "Please tell me why you haven't met my expectations." In this case, the system would assess the possibility of the statement being misinterpreted as harassment and provide appropriate feedback. Possible prompts for the generative AI model could include instructions such as, "Perform a communication risk assessment based on text and audio data, and present the results."
[0347] This system allows users to receive a risk assessment that includes the emotional context of their statements, and to gain guidance for improving their communication style.
[0348] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0349] Step 1:
[0350] When a user speaks, the device uses its microphone to acquire audio information. Since this audio information cannot be analyzed directly, it is converted into text using speech recognition technology. The converted text is then stored digitally within the device, and the process proceeds to the next step.
[0351] Step 2:
[0352] The terminal sends the converted text information to the server. The server feeds the received text information into a natural language processing engine to analyze the sentence structure and context. The input here is text information, and the output is evaluation information, including the detection results of aggressive or condescending language. This data processing is performed through syntactic and semantic analysis of the text.
[0353] Step 3:
[0354] The server also receives audio information and uses an emotion analysis engine to extract the user's emotional state. In this step, raw audio data is used as input, and the user's emotional state information after analysis is output. Emotions are read from the tone and speed of the voice, and the evaluation is performed by integrating this with the results of text analysis.
[0355] Step 4:
[0356] The server integrates the results of text analysis and sentiment analysis, and uses a learning model to assess the risk of power harassment. The input consists of text and sentiment analysis results, and the output is a risk score. This score calculation utilizes a machine learning algorithm that has been refined based on similar past cases.
[0357] Step 5:
[0358] If the risk assessment results in a score exceeding a threshold, the server generates a warning message and sends it to the terminal. The terminal notifies the user of this message and displays specific feedback such as "Your statement may be misleading. Please maintain a calm tone," prompting the user to immediately improve their behavior.
[0359] This series of processes allows users to receive real-time feedback and recognize communication risks.
[0360] (Application Example 2)
[0361] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0362] In workplaces, educational institutions, and other settings, there is a growing need to prevent harassment and create a healthy communication environment. However, traditional methods have struggled to evaluate not only the content of statements but also background information such as emotions. In particular, many cases of harassment stem from minor misunderstandings or heightened emotions, and a system is needed that can recognize this in real time and warn users.
[0363] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0364] In this invention, the server includes means for acquiring sound data and converting said sound data into document data; means for analyzing said document data to detect intimidating or aggressive language; means for evaluating said document data using a learning model that evaluates the possibility of harassment for the language being spoken; means for recognizing emotional states from the sound data; and means for modifying the harassment evaluation based on the emotional states. This makes it possible to comprehensively evaluate the possibility of harassment from both emotional and linguistic aspects and provide warnings to users at an appropriate time.
[0365] "Audio data" refers to recordings of speech or sounds as digital information, primarily collected through devices such as microphones.
[0366] "Document data" refers to audio data converted into text, in a format that can be analyzed and evaluated through natural language processing.
[0367] "Intimidating or aggressive language" refers to language that includes expressions that exert physical or psychological pressure or aggression on others, and its detection is important in assessing the possibility of harassment.
[0368] A "learning model" is an algorithm or framework that is trained using historical data to perform a specific task, such as assessing the risk of harassment.
[0369] "Emotional state" refers to an individual's emotional state as analyzed from sound data, and is expressed in categories such as joy, anger, and sadness.
[0370] "Means of modifying harassment assessments" refer to processes and methods for precisely adjusting harassment risk assessments based on verbal analysis results, taking into account emotional states.
[0371] A "warning" is a message or notification that alerts users and is provided when a high risk of harassment is identified.
[0372] In order to implement this invention, both the server and the terminal must work together. To acquire the user's voice, the terminal is equipped with a voice input device (microphone) and collects sound data in real time. The collected sound data is converted into document data using speech recognition software on the terminal (e.g., Google's speech recognition API).
[0373] Subsequently, the document data is sent to a server, which uses a natural language processing engine to detect intimidating or aggressive language. Furthermore, emotion recognition software runs on the server, recognizing the user's emotional state from the audio data. This emotional state assessment is fed into a learning model and used to improve the accuracy of harassment risk assessments.
[0374] The server uses a learning model trained on past harassment cases to assess the overall risk of harassment. Based on this assessment, it displays warnings and alerts on the user's device in an easy-to-understand format. This process is continuous, enabling real-time feedback.
[0375] As a concrete example, consider a scenario where a user in a meeting says, "Why hasn't this problem been solved yet?" Speech recognition software converts the statement into document data, and a server analyzes its contents. Simultaneously, if frustration is detected in the audio data, emotion recognition software can increase the risk score and send an appropriate alert to the user.
[0376] An example of a prompt using a generative AI model is: "Analyze the given audio and text data for intimidating language and emotions, and assess the likelihood of harassment."
[0377] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0378] Step 1:
[0379] The device acquires the user's voice as input through the microphone. The user's speech, collected as audio data, is passed to speech recognition software within the device. This converts the audio data into document data, which is then output as text data.
[0380] Step 2:
[0381] The terminal sends the document data obtained in step 1 to the server. The server uses a natural language processing engine to analyze this document data. It detects offensive language from the input text and outputs the results as evaluation data.
[0382] Step 3:
[0383] The server simultaneously inputs sound data into emotion recognition software to evaluate the user's emotional state. It outputs an emotion score, which quantifies the emotional state. This score is used later for risk assessment.
[0384] Step 4:
[0385] The server combines the evaluation data and sentiment scores obtained from steps 2 and 3 and supplies them to the learning model. The generative AI model uses this data to assess the risk of harassment and outputs a risk score.
[0386] Step 5:
[0387] If the risk score calculated in step 4 exceeds a pre-set threshold, the server creates a prompt and sends an alert message to the terminal. The terminal notifies the user of the warning visually or audibly.
[0388] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0389] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0390] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0391] [Third Embodiment]
[0392] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0393] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0394] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0395] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0396] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0397] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0398] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0399] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0400] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0401] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0402] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0403] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0404] This invention is a system that detects power harassment in real time and notifies the user, and is implemented on terminals such as smartphones and PCs. First, the terminal acquires audio data through a microphone and converts this audio data into text information using speech recognition technology. This converted text information, i.e., text data, is in a form that can be quickly analyzed. The terminal sends this text data to a server, and the server performs analysis on the received text data.
[0405] The server uses a natural language processing (NLP) engine to analyze text data. This engine classifies and identifies each word and phrase, taking into account context and grammatical structure. Next, the server evaluates the analysis results using a machine learning model. This model is trained on past data related to power harassment and has the ability to determine whether a particular context or wording constitutes power harassment.
[0406] When the server detects signs of a problem, it calculates a specific risk score. If this score exceeds a set threshold, the server issues an alert to the user. This alert is displayed on the user's screen via the terminal and may also be output as an audio notification. Specifically, the user may see a message such as, "This may be considered aggressive behavior. Please be careful with your words and actions."
[0407] Furthermore, after receiving this notification, users can provide feedback on the timing and content of the notification, allowing the system to continuously improve its accuracy. This entire process takes place in real time, helping users to immediately adjust their necessary actions.
[0408] As a concrete example, consider a situation where a user says during a meeting, "It's your fault this project is progressing so slowly." This phrase is likely to be analyzed by the server and deemed to be aggressive. The server can then send an alert to the user's terminal, prompting them to reconsider their statement. Through this process, the speaker is given an opportunity to reflect on their words and actions, contributing to an improved work environment.
[0409] The following describes the processing flow.
[0410] Step 1:
[0411] The device acquires audio data. When the user launches an application and enables the recording function, the device's microphone captures ambient sound and records it as audio data.
[0412] Step 2:
[0413] The device uses speech recognition technology to convert the acquired audio data into text data. Here, the spoken words are accurately converted into text information and made into an analyzable format.
[0414] Step 3:
[0415] The terminal sends the converted text data to the server. This transmission occurs in real time and is used as data necessary for analysis on the server.
[0416] Step 4:
[0417] The server receives text data and performs analysis using a natural language processing (NLP) engine. Here, the server tokenizes the text, determines the meaning of each word, and analyzes the context.
[0418] Step 5:
[0419] The server evaluates the possibility of power harassment based on the analysis results. Using a machine learning model, it calculates a risk score for high-handed or aggressive behavior.
[0420] Step 6:
[0421] If the server's risk score exceeds a set threshold, an alert message is generated. This message is intended to alert the user.
[0422] Step 7:
[0423] The device receives alert messages sent from the server and notifies the user. The notification is provided as a screen pop-up or audio, prompting the user to provide prompt feedback.
[0424] Step 8:
[0425] Users can review alerts and modify their statements and attitudes as needed. As a result, users can become more aware of their own words and actions.
[0426] (Example 1)
[0427] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0428] In the workplace, a challenge is to prevent friction and discord caused by inappropriate language and behavior, and to provide a healthier communication environment. Conventional systems have had difficulty detecting inappropriate behavior in real time and responding immediately, thus limiting their effectiveness in situations where immediate action is required.
[0429] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0430] In this invention, the server includes means for acquiring an acoustic signal and converting the acoustic signal into text information, means for analyzing the text information to detect aggressive or condescending language expressions, and means for evaluating the text information using a machine learning model that evaluates the possibility of inappropriate behavior in the workplace in relation to the ongoing language expression. This makes it possible to detect inappropriate remarks in the workplace in real time and quickly issue warnings to users.
[0431] An "acoustic signal" is a recording of ambient sound as an electrical signal, and is primarily used as audio data.
[0432] "Textual information" refers to text data converted from acoustic signals, and serves as fundamental data for language analysis.
[0433] "Linguistic expression" refers to the words and phrases contained in written information, which convey specific meanings and emotions.
[0434] A "machine learning model" is an algorithm that learns from past data and is used to recognize specific patterns or rules.
[0435] "Inappropriate behavior" refers to verbal expressions or attitudes that worsen the workplace environment, and includes things like power harassment.
[0436] "Real-time" means that information processing is performed immediately at the moment it occurs, and results are provided without delay.
[0437] A specific description will be given regarding embodiments for carrying out this invention.
[0438] This system consists of a user's terminal and a server. The terminal uses a smartphone or personal computer to acquire ambient acoustic signals through its built-in microphone. The terminal then uses speech recognition software to convert the acquired acoustic signals into text information. Specific software often used includes Google Cloud Speech-to-Text or similar APIs.
[0439] The character information converted by the terminal is sent to the server in a secure manner. The server analyzes this character information using a natural language processing engine. Natural language processing libraries such as SpaCy and NLTK are used for the analysis. The server applies a machine learning model to determine whether the linguistic expressions contained in the character information are domineering or aggressive. Because this model is trained on a dataset of past inappropriate behavior, it can perform evaluations that are relevant to real-world workplace environments.
[0440] Once the analysis is complete, if certain linguistic expressions are deemed inappropriate, the server will notify the user via their terminal. The user can then review their statements after receiving this warning, thus preventing potential disagreements and conflicts.
[0441] For example, if a user says "It's your fault this project is progressing slowly" during a meeting, this phrase is likely to be considered aggressive, and the server will issue a warning. Based on this warning, the user can reflect on their statement and correct it if necessary.
[0442] Furthermore, as part of improving the system's accuracy, users can submit feedback on warnings. This feedback is reflected in the server's machine learning model, contributing to improved overall system reliability.
[0443] An example of a prompt for a generative AI model might be, "Please provide wording that points out that the instructions from your supervisor are unclear and may be causing misunderstandings."
[0444] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0445] Step 1:
[0446] The device uses a microphone to acquire ambient acoustic signals. The input consists of ambient sounds and conversations, which the device captures as digital audio data. This data is stored as a preliminary step before being processed by speech recognition software. Specifically, pre-processing such as adjusting the volume level and noise cancellation is performed.
[0447] Step 2:
[0448] The device converts acquired acoustic signals into text information using speech recognition software. The input is digital audio data, and the output is corresponding text data. Using APIs such as Google Cloud Speech-to-Text, the conversion from audio to text information is performed according to a language model. In this process, silent portions of the audio data are omitted, and important spoken parts are efficiently converted.
[0449] Step 3:
[0450] The terminal sends the converted character information to the server. The input is the generated text data, and the output is processed as data on the server. By sending the data via encrypted communication, data security is ensured while delivering it to the server with low latency.
[0451] Step 4:
[0452] The server analyzes the received text data using a natural language processing engine. The input is text data, and the output is analyzed data that includes linguistic features and contextual information. Libraries such as SpaCy and NLTK are used to perform grammatical structure analysis and keyword extraction. Sentence tone analysis is also performed here, quantifying the degree of aggressiveness.
[0453] Step 5:
[0454] The server uses a machine learning model to assess the likelihood of inappropriate behavior based on the analyzed data. The input is the analyzed data, and the output is the assessment result with a risk score. A model trained on historical datasets is used to calculate the probability that a particular linguistic expression poses a risk.
[0455] Step 6:
[0456] The server uses a risk score to notify the terminal of an alert if a threshold is exceeded. The input is the evaluation result, and the output is a warning message to the user. This alert is displayed as a pop-up on the terminal, and an audio notification is given if necessary. Specific warning messages may include, "Your remarks were judged to be aggressive. Please reconsider your words and actions."
[0457] Step 7:
[0458] Users can provide feedback on the warnings they receive. Input consists of user ratings and opinions, while output is stored as training data for the system. This allows the machine learning model to continuously improve, helping to prevent false positives and enhance accuracy.
[0459] (Application Example 1)
[0460] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0461] In modern living environments and workplaces, the likelihood of power harassment occurring due to communication issues is increasing. In particular, there is a need for effective means to immediately detect aggressive or condescending language during real-time conversations and to encourage improved communication. However, conventional systems lack sufficient real-time detection and notification capabilities to address this problem.
[0462] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0463] In this invention, the server includes means for acquiring audio information and converting said audio information into text information; means for analyzing said text information to detect high-pressure or aggressive expressions; means for evaluating said text information using a learning model that assesses the possibility of power harassment for ongoing expressions; and means for an automated device that monitors conversations in a living environment and alerts the user if a problem is detected. This enables real-time detection of power harassment and rapid feedback to the user.
[0464] "Audio information" refers to sound data acquired through devices such as microphones.
[0465] "Textual information" refers to a data format in which audio information is converted into text.
[0466] "Expression" refers to words and phrases that fulfill the role of language.
[0467] A "learning model" is an algorithm that is trained on past data and has the ability to recognize specific patterns and trends.
[0468] "Power harassment" refers to overbearing or aggressive behavior or remarks that stem from an imbalance of power in the workplace or living environment.
[0469] An "automated device" refers to a mechanical or electronic device designed to perform a specific task or function without human intervention.
[0470] "User" refers to an individual or group whose purpose is to use this system.
[0471] This invention relates to a system for detecting intimidating or aggressive behavior in a living environment in real time and for evaluating the possibility of power harassment. The system uses an automated device equipped with a microphone, display, and speaker as hardware, as well as a server and related software.
[0472] First, audio information is acquired through the microphone. The device uses the Google Speech-to-Text API, a speech recognition engine, to convert this audio information into text in real time. The text information thus converted is then sent from the device to the server.
[0473] On the server, textual information is analyzed using a natural language processing engine based on SpaCy. To determine whether words and phrases are aggressive, the ongoing expression is evaluated by a learning model, specifically a machine learning algorithm using TensorFlow. This machine learning model is trained based on past cases of power harassment.
[0474] If the server determines that certain expressions are aggressive or offensive, the automated device will warn the user through a display or speaker. This warning allows for quick feedback and encourages the user to reconsider their words and actions immediately.
[0475] For example, if an automated device detects the phrase "You're not doing well at all" during a conversation at home, it will determine that this is an overbearing expression and offer a user-friendly suggestion such as "Try to think of a gentler way to say it."
[0476] An example of an input prompt for a generative AI model is: "Analyze the following conversation and determine if it may constitute harassment. Consider the context and tone. Conversation: 'Conversation content'."
[0477] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0478] Step 1:
[0479] The device acquires audio information through its microphone. The input is ambient sound, which is captured as an analog audio signal. The device converts this audio signal into digital data and stores it in an appropriate format.
[0480] Step 2:
[0481] The device uses the Google Speech-to-Text API to convert digital audio data into text. The input is audio data, and the output is text data. This process uses speech recognition technology to transcribe the content of the audio data into text with high accuracy.
[0482] Step 3:
[0483] Text information is sent from the terminal to the server. The server receives this text information and performs analysis using SpaCy, a natural language processing engine. The input is text data, and the output is the analysis result, which includes language structure and contextual information. The server evaluates the grammar and word meanings.
[0484] Step 4:
[0485] The server evaluates the analyzed data using a machine learning model (using TensorFlow). The model is trained on past cases of workplace harassment; the input is the analysis results, and the output is a risk score indicating the likelihood of workplace harassment. This process scores whether specific phrases or wording are aggressive or intimidating.
[0486] Step 5:
[0487] The server issues a warning to the user if the risk score exceeds a set threshold. This warning is displayed and output via the terminal's display and speaker. For the user, the input is the risk score, and the output is the warning message. This allows the user to immediately review their own words and actions.
[0488] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0489] This invention is a system that evaluates the possibility of power harassment in real time based on voice and text data and warns the user, further improving accuracy by incorporating an emotion engine. The terminal first records the user's voice through a microphone and converts it into text data using speech recognition technology. The text data obtained from this data conversion process is immediately sent to the server.
[0490] The server analyzes text data using a natural language processing (NLP) engine, evaluating the possibility of harassment by considering the language structure and context. Furthermore, an emotion engine recognizes the user's emotions from the audio and text data and extracts their emotional state. The results of the emotion engine's analysis influence the evaluation of harassment, taking into account situations where the user is not aggressive but is in a negative emotional state.
[0491] The machine learning model is trained on past cases of workplace harassment, and uses this data and sentiment assessments to numerically evaluate the risk of the remarks. Based on this evaluation, the server generates alerts and notifies the user when necessary. The notifications are delivered via the device and provide information to help the user take appropriate action.
[0492] As a concrete example, consider a situation where a user says during a meeting, "I want to know why you're not meeting my expectations." When this statement is made, the server performs linguistic analysis, and simultaneously, the emotion engine recognizes the user's emotions from their tone and speed of voice. If the system determines that the user is feeling irritated or dissatisfied, the overall risk score may increase. As a result, the device will issue an alert to the user, such as, "Your statement may be misleading. Please maintain a calm tone."
[0493] By combining this with an emotion engine, it becomes possible to detect power harassment in a sophisticated way that considers not only the content of the language but also the emotions behind it. As a result, users can gain a deeper understanding and create a better communication environment.
[0494] The following describes the processing flow.
[0495] Step 1:
[0496] The device acquires audio data. The user activates the application's recording function and records conversations and voices, including ambient noise. This ensures that audio data is available for later analysis.
[0497] Step 2:
[0498] The device converts the voice data acquired using speech recognition technology into text data. Once the voice input is converted to text format, it is ready for natural language processing.
[0499] Step 3:
[0500] The terminal sends the converted text data to the server. The server then receives the data necessary to analyze the audio content. The transmission process minimizes the processing power required by the terminal.
[0501] Step 4:
[0502] The server analyzes the text data using a natural language processing engine. Here, it understands the context of tokenized words and determines whether they contain aggressive or condescending language.
[0503] Step 5:
[0504] The server uses an emotion engine to analyze the user's voice characteristics (tone, speed, and volume) and recognize the user's emotional state. This allows for a quantitative measurement of their emotions at that moment.
[0505] Step 6:
[0506] The server uses a machine learning model to integrate text data with emotional state assessments and calculate a risk score for power harassment. This score is then refined by a model trained on past case data.
[0507] Step 7:
[0508] The server generates an alert for the user if it exceeds a set threshold based on its risk score. The alert includes specific suggestions for improvement and cautionary messages.
[0509] Step 8:
[0510] The device notifies the user of alerts sent from the server. Notifications are efficiently delivered through on-screen pop-ups and audio alerts, providing the user with the information they need immediately.
[0511] Step 9:
[0512] Users review the notification content and modify their conversations as needed. This feedback loop allows users to develop an awareness of improving their words and actions.
[0513] (Example 2)
[0514] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0515] In today's communication environment, there is a need to quickly detect remarks that may unintentionally be perceived as aggressive or condescending, and to provide appropriate feedback. However, conventional systems primarily analyze only the content of the language, without conducting detailed evaluations based on the underlying emotions and context. As a result, there is a challenge in that it is difficult for users to immediately understand and improve their own communication style.
[0516] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0517] In this invention, the server includes means for analyzing voice and text information to detect aggressive or condescending language, means for evaluating language information using a learning model to determine the possibility of power harassment, and means for performing sentiment analysis and incorporating the results into the evaluation. This makes it possible to provide real-time feedback to the user and promote a deeper understanding of communication.
[0518] "Voice information" refers to data related to human voices acquired through input devices such as microphones.
[0519] "Textual information" refers to data that is represented as text after audio information has been converted.
[0520] A "learning model" is a mathematical model that performs pattern recognition and classification based on past data and examples.
[0521] "Analysis" is the process of analyzing data, understanding its content, and evaluating it.
[0522] "Evaluation" is the process of determining the value and risks of something based on analyzed data.
[0523] "Emotional analysis" is a technology that infers and extracts a person's emotional state from information expressed as voice or text.
[0524] "Notification" refers to the act of providing information to inform users of the evaluation results.
[0525] This system is designed to analyze voice information and assess the risk of power harassment in real time. A specific implementation is described below.
[0526] First, when a user speaks, the device acquires the audio information through its microphone. This audio information is then converted into text using speech recognition technology built into the device, such as general speech-to-text software. At this stage, high-quality microphones and hardware with noise-canceling capabilities are recommended to improve audio accuracy.
[0527] The converted text information is sent to the server in real time. The server receives this information and performs analysis using a natural language processing engine. The analysis utilizes language processing libraries for contextual understanding and syntactic analysis of the text information. For example, open-source and commercial natural language processing tools are commonly used.
[0528] As the next step in the analysis, the server utilizes an emotion analysis engine to extract the user's emotional state from the audio and text information. This emotion data is crucial for understanding the psychological aspects behind the statements. The emotion engine is trained using machine learning and employs data models to identify a variety of emotions.
[0529] The server integrates these analysis results and uses a learning model to assess the risk of power harassment. Because the learning model is trained on a large dataset based on past cases, it can provide accurate risk assessments.
[0530] When notifying a user of a risk, the server sends the assessment results to the terminal, and a warning message is displayed to the user. For example, the user might be advised, "Your statement may be misleading. Please maintain a calm tone."
[0531] As a concrete example, consider the behavior of a user who, during a meeting, asks, "Please tell me why you haven't met my expectations." In this case, the system would assess the possibility of the statement being misinterpreted as harassment and provide appropriate feedback. Possible prompts for the generative AI model could include instructions such as, "Perform a communication risk assessment based on text and audio data, and present the results."
[0532] This system allows users to receive a risk assessment that includes the emotional context of their statements, and to gain guidance for improving their communication style.
[0533] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0534] Step 1:
[0535] When a user speaks, the device uses its microphone to acquire audio information. Since this audio information cannot be analyzed directly, it is converted into text using speech recognition technology. The converted text is then stored digitally within the device, and the process proceeds to the next step.
[0536] Step 2:
[0537] The terminal sends the converted text information to the server. The server feeds the received text information into a natural language processing engine to analyze the sentence structure and context. The input here is text information, and the output is evaluation information, including the detection results of aggressive or condescending language. This data processing is performed through syntactic and semantic analysis of the text.
[0538] Step 3:
[0539] The server also receives audio information and uses an emotion analysis engine to extract the user's emotional state. In this step, raw audio data is used as input, and the user's emotional state information after analysis is output. Emotions are read from the tone and speed of the voice, and the evaluation is performed by integrating this with the results of text analysis.
[0540] Step 4:
[0541] The server integrates the results of text analysis and sentiment analysis, and uses a learning model to assess the risk of power harassment. The input consists of text and sentiment analysis results, and the output is a risk score. This score calculation utilizes a machine learning algorithm that has been refined based on similar past cases.
[0542] Step 5:
[0543] If the risk assessment results in a score exceeding a threshold, the server generates a warning message and sends it to the terminal. The terminal notifies the user of this message and displays specific feedback such as "Your statement may be misleading. Please maintain a calm tone," prompting the user to immediately improve their behavior.
[0544] This series of processes allows users to receive real-time feedback and recognize communication risks.
[0545] (Application Example 2)
[0546] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0547] In workplaces, educational institutions, and other settings, there is a growing need to prevent harassment and create a healthy communication environment. However, traditional methods have struggled to evaluate not only the content of statements but also background information such as emotions. In particular, many cases of harassment stem from minor misunderstandings or heightened emotions, and a system is needed that can recognize this in real time and warn users.
[0548] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0549] In this invention, the server includes means for acquiring sound data and converting said sound data into document data; means for analyzing said document data to detect intimidating or aggressive language; means for evaluating said document data using a learning model that evaluates the possibility of harassment for the language being spoken; means for recognizing emotional states from the sound data; and means for modifying the harassment evaluation based on the emotional states. This makes it possible to comprehensively evaluate the possibility of harassment from both emotional and linguistic aspects and provide warnings to users at an appropriate time.
[0550] "Audio data" refers to recordings of speech or sounds as digital information, primarily collected through devices such as microphones.
[0551] "Document data" refers to audio data converted into text, in a format that can be analyzed and evaluated through natural language processing.
[0552] "Intimidating or aggressive language" refers to language that includes expressions that exert physical or psychological pressure or aggression on others, and its detection is important in assessing the possibility of harassment.
[0553] A "learning model" is an algorithm or framework that is trained using historical data to perform a specific task, such as assessing the risk of harassment.
[0554] "Emotional state" refers to an individual's emotional state as analyzed from sound data, and is expressed in categories such as joy, anger, and sadness.
[0555] "Means of modifying harassment assessments" refer to processes and methods for precisely adjusting harassment risk assessments based on verbal analysis results, taking into account emotional states.
[0556] A "warning" is a message or notification that alerts users and is provided when a high risk of harassment is identified.
[0557] In order to implement this invention, both the server and the terminal must work together. To acquire the user's voice, the terminal is equipped with a voice input device (microphone) and collects sound data in real time. The collected sound data is converted into document data using speech recognition software on the terminal (e.g., Google's speech recognition API).
[0558] Subsequently, the document data is sent to a server, which uses a natural language processing engine to detect intimidating or aggressive language. Furthermore, emotion recognition software runs on the server, recognizing the user's emotional state from the audio data. This emotional state assessment is fed into a learning model and used to improve the accuracy of harassment risk assessments.
[0559] The server uses a learning model trained on past harassment cases to assess the overall risk of harassment. Based on this assessment, it displays warnings and alerts on the user's device in an easy-to-understand format. This process is continuous, enabling real-time feedback.
[0560] As a concrete example, consider a scenario where a user in a meeting says, "Why hasn't this problem been solved yet?" Speech recognition software converts the statement into document data, and a server analyzes its contents. Simultaneously, if frustration is detected in the audio data, emotion recognition software can increase the risk score and send an appropriate alert to the user.
[0561] An example of a prompt using a generative AI model is: "Analyze the given audio and text data for intimidating language and emotions, and assess the likelihood of harassment."
[0562] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0563] Step 1:
[0564] The device acquires the user's voice as input through the microphone. The user's speech, collected as audio data, is passed to speech recognition software within the device. This converts the audio data into document data, which is then output as text data.
[0565] Step 2:
[0566] The terminal sends the document data obtained in step 1 to the server. The server uses a natural language processing engine to analyze this document data. It detects offensive language from the input text and outputs the results as evaluation data.
[0567] Step 3:
[0568] The server simultaneously inputs sound data into emotion recognition software to evaluate the user's emotional state. It outputs an emotion score, which quantifies the emotional state. This score is used later for risk assessment.
[0569] Step 4:
[0570] The server combines the evaluation data and sentiment scores obtained from steps 2 and 3 and supplies them to the learning model. The generative AI model uses this data to assess the risk of harassment and outputs a risk score.
[0571] Step 5:
[0572] If the risk score calculated in step 4 exceeds a pre-set threshold, the server creates a prompt and sends an alert message to the terminal. The terminal notifies the user of the warning visually or audibly.
[0573] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0574] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0575] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0576] [Fourth Embodiment]
[0577] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0578] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0579] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0580] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0581] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0582] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0583] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0584] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0585] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0586] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0587] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0588] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0589] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0590] This invention is a system that detects power harassment in real time and notifies the user, and is implemented on terminals such as smartphones and PCs. First, the terminal acquires audio data through a microphone and converts this audio data into text information using speech recognition technology. This converted text information, i.e., text data, is in a form that can be quickly analyzed. The terminal sends this text data to a server, and the server performs analysis on the received text data.
[0591] The server uses a natural language processing (NLP) engine to analyze text data. This engine classifies and identifies each word and phrase, taking into account context and grammatical structure. Next, the server evaluates the analysis results using a machine learning model. This model is trained on past data related to power harassment and has the ability to determine whether a particular context or wording constitutes power harassment.
[0592] When the server detects signs of a problem, it calculates a specific risk score. If this score exceeds a set threshold, the server issues an alert to the user. This alert is displayed on the user's screen via the terminal and may also be output as an audio notification. Specifically, the user may see a message such as, "This may be considered aggressive behavior. Please be careful with your words and actions."
[0593] Furthermore, after receiving this notification, users can provide feedback on the timing and content of the notification, allowing the system to continuously improve its accuracy. This entire process takes place in real time, helping users to immediately adjust their necessary actions.
[0594] As a concrete example, consider a situation where a user says during a meeting, "It's your fault this project is progressing so slowly." This phrase is likely to be analyzed by the server and deemed to be aggressive. The server can then send an alert to the user's terminal, prompting them to reconsider their statement. Through this process, the speaker is given an opportunity to reflect on their words and actions, contributing to an improved work environment.
[0595] The following describes the processing flow.
[0596] Step 1:
[0597] The device acquires audio data. When the user launches an application and enables the recording function, the device's microphone captures ambient sound and records it as audio data.
[0598] Step 2:
[0599] The device uses speech recognition technology to convert the acquired audio data into text data. Here, the spoken words are accurately converted into text information and made into an analyzable format.
[0600] Step 3:
[0601] The terminal sends the converted text data to the server. This transmission occurs in real time and is used as data necessary for analysis on the server.
[0602] Step 4:
[0603] The server receives text data and performs analysis using a natural language processing (NLP) engine. Here, the server tokenizes the text, determines the meaning of each word, and analyzes the context.
[0604] Step 5:
[0605] The server evaluates the possibility of power harassment based on the analysis results. Using a machine learning model, it calculates a risk score for high-handed or aggressive behavior.
[0606] Step 6:
[0607] If the server's risk score exceeds a set threshold, an alert message is generated. This message is intended to alert the user.
[0608] Step 7:
[0609] The device receives alert messages sent from the server and notifies the user. The notification is provided as a screen pop-up or audio, prompting the user to provide prompt feedback.
[0610] Step 8:
[0611] Users can review alerts and modify their statements and attitudes as needed. As a result, users can become more aware of their own words and actions.
[0612] (Example 1)
[0613] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0614] In the workplace, a challenge is to prevent friction and discord caused by inappropriate language and behavior, and to provide a healthier communication environment. Conventional systems have had difficulty detecting inappropriate behavior in real time and responding immediately, thus limiting their effectiveness in situations where immediate action is required.
[0615] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0616] In this invention, the server includes means for acquiring an acoustic signal and converting the acoustic signal into text information, means for analyzing the text information to detect aggressive or condescending language expressions, and means for evaluating the text information using a machine learning model that evaluates the possibility of inappropriate behavior in the workplace in relation to the ongoing language expression. This makes it possible to detect inappropriate remarks in the workplace in real time and quickly issue warnings to users.
[0617] An "acoustic signal" is a recording of ambient sound as an electrical signal, and is primarily used as audio data.
[0618] "Textual information" refers to text data converted from acoustic signals, and serves as fundamental data for language analysis.
[0619] "Linguistic expression" refers to the words and phrases contained in written information, which convey specific meanings and emotions.
[0620] A "machine learning model" is an algorithm that learns from past data and is used to recognize specific patterns or rules.
[0621] "Inappropriate behavior" refers to verbal expressions or attitudes that worsen the workplace environment, and includes things like power harassment.
[0622] "Real-time" means that information processing is performed immediately at the moment it occurs, and results are provided without delay.
[0623] A specific description will be given regarding embodiments for carrying out this invention.
[0624] This system consists of a user's terminal and a server. The terminal uses a smartphone or personal computer to acquire ambient acoustic signals through its built-in microphone. The terminal then uses speech recognition software to convert the acquired acoustic signals into text information. Specific software often used includes Google Cloud Speech-to-Text or similar APIs.
[0625] The character information converted by the terminal is sent to the server in a secure manner. The server analyzes this character information using a natural language processing engine. Natural language processing libraries such as SpaCy and NLTK are used for the analysis. The server applies a machine learning model to determine whether the linguistic expressions contained in the character information are domineering or aggressive. Because this model is trained on a dataset of past inappropriate behavior, it can perform evaluations that are relevant to real-world workplace environments.
[0626] Once the analysis is complete, if certain linguistic expressions are deemed inappropriate, the server will notify the user via their terminal. The user can then review their statements after receiving this warning, thus preventing potential disagreements and conflicts.
[0627] For example, if a user says "It's your fault this project is progressing slowly" during a meeting, this phrase is likely to be considered aggressive, and the server will issue a warning. Based on this warning, the user can reflect on their statement and correct it if necessary.
[0628] Furthermore, as part of improving the system's accuracy, users can submit feedback on warnings. This feedback is reflected in the server's machine learning model, contributing to improved overall system reliability.
[0629] An example of a prompt for a generative AI model might be, "Please provide wording that points out that the instructions from your supervisor are unclear and may be causing misunderstandings."
[0630] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0631] Step 1:
[0632] The device uses a microphone to acquire ambient acoustic signals. The input consists of ambient sounds and conversations, which the device captures as digital audio data. This data is stored as a preliminary step before being processed by speech recognition software. Specifically, pre-processing such as adjusting the volume level and noise cancellation is performed.
[0633] Step 2:
[0634] The device converts acquired acoustic signals into text information using speech recognition software. The input is digital audio data, and the output is corresponding text data. Using APIs such as Google Cloud Speech-to-Text, the conversion from audio to text information is performed according to a language model. In this process, silent portions of the audio data are omitted, and important spoken parts are efficiently converted.
[0635] Step 3:
[0636] The terminal sends the converted character information to the server. The input is the generated text data, and the output is processed as data on the server. By sending the data via encrypted communication, data security is ensured while delivering it to the server with low latency.
[0637] Step 4:
[0638] The server analyzes the received text data using a natural language processing engine. The input is text data, and the output is analyzed data that includes linguistic features and contextual information. Libraries such as SpaCy and NLTK are used to perform grammatical structure analysis and keyword extraction. Sentence tone analysis is also performed here, quantifying the degree of aggressiveness.
[0639] Step 5:
[0640] The server uses a machine learning model to assess the likelihood of inappropriate behavior based on the analyzed data. The input is the analyzed data, and the output is the assessment result with a risk score. A model trained on historical datasets is used to calculate the probability that a particular linguistic expression poses a risk.
[0641] Step 6:
[0642] The server uses a risk score to notify the terminal of an alert if a threshold is exceeded. The input is the evaluation result, and the output is a warning message to the user. This alert is displayed as a pop-up on the terminal, and an audio notification is given if necessary. Specific warning messages may include, "Your remarks were judged to be aggressive. Please reconsider your words and actions."
[0643] Step 7:
[0644] Users can provide feedback on the warnings they receive. Input consists of user ratings and opinions, while output is stored as training data for the system. This allows the machine learning model to continuously improve, helping to prevent false positives and enhance accuracy.
[0645] (Application Example 1)
[0646] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0647] In modern living environments and workplaces, the likelihood of power harassment occurring due to communication issues is increasing. In particular, there is a need for effective means to immediately detect aggressive or condescending language during real-time conversations and to encourage improved communication. However, conventional systems lack sufficient real-time detection and notification capabilities to address this problem.
[0648] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0649] In this invention, the server includes means for acquiring audio information and converting said audio information into text information; means for analyzing said text information to detect high-pressure or aggressive expressions; means for evaluating said text information using a learning model that assesses the possibility of power harassment for ongoing expressions; and means for an automated device that monitors conversations in a living environment and alerts the user if a problem is detected. This enables real-time detection of power harassment and rapid feedback to the user.
[0650] "Audio information" refers to sound data acquired through devices such as microphones.
[0651] "Textual information" refers to a data format in which audio information is converted into text.
[0652] "Expression" refers to words and phrases that fulfill the role of language.
[0653] A "learning model" is an algorithm that is trained on past data and has the ability to recognize specific patterns and trends.
[0654] "Power harassment" refers to overbearing or aggressive behavior or remarks that stem from an imbalance of power in the workplace or living environment.
[0655] An "automated device" refers to a mechanical or electronic device designed to perform a specific task or function without human intervention.
[0656] "User" refers to an individual or group whose purpose is to use this system.
[0657] This invention relates to a system for detecting intimidating or aggressive behavior in a living environment in real time and for evaluating the possibility of power harassment. The system uses an automated device equipped with a microphone, display, and speaker as hardware, as well as a server and related software.
[0658] First, audio information is acquired through the microphone. The device uses the Google Speech-to-Text API, a speech recognition engine, to convert this audio information into text in real time. The text information thus converted is then sent from the device to the server.
[0659] On the server, textual information is analyzed using a natural language processing engine based on SpaCy. To determine whether words and phrases are aggressive, the ongoing expression is evaluated by a learning model, specifically a machine learning algorithm using TensorFlow. This machine learning model is trained based on past cases of power harassment.
[0660] If the server determines that certain expressions are aggressive or offensive, the automated device will warn the user through a display or speaker. This warning allows for quick feedback and encourages the user to reconsider their words and actions immediately.
[0661] For example, if an automated device detects the phrase "You're not doing well at all" during a conversation at home, it will determine that this is an overbearing expression and offer a user-friendly suggestion such as "Try to think of a gentler way to say it."
[0662] An example of an input prompt for a generative AI model is: "Analyze the following conversation and determine if it may constitute harassment. Consider the context and tone. Conversation: 'Conversation content'."
[0663] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0664] Step 1:
[0665] The device acquires audio information through its microphone. The input is ambient sound, which is captured as an analog audio signal. The device converts this audio signal into digital data and stores it in an appropriate format.
[0666] Step 2:
[0667] The device uses the Google Speech-to-Text API to convert digital audio data into text. The input is audio data, and the output is text data. This process uses speech recognition technology to transcribe the content of the audio data into text with high accuracy.
[0668] Step 3:
[0669] Text information is sent from the terminal to the server. The server receives this text information and performs analysis using SpaCy, a natural language processing engine. The input is text data, and the output is the analysis result, which includes language structure and contextual information. The server evaluates the grammar and word meanings.
[0670] Step 4:
[0671] The server evaluates the analyzed data using a machine learning model (using TensorFlow). The model is trained on past cases of workplace harassment; the input is the analysis results, and the output is a risk score indicating the likelihood of workplace harassment. This process scores whether specific phrases or wording are aggressive or intimidating.
[0672] Step 5:
[0673] The server issues a warning to the user if the risk score exceeds a set threshold. This warning is displayed and output via the terminal's display and speaker. For the user, the input is the risk score, and the output is the warning message. This allows the user to immediately review their own words and actions.
[0674] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0675] This invention is a system that evaluates the possibility of power harassment in real time based on voice and text data and warns the user, further improving accuracy by incorporating an emotion engine. The terminal first records the user's voice through a microphone and converts it into text data using speech recognition technology. The text data obtained from this data conversion process is immediately sent to the server.
[0676] The server analyzes text data using a natural language processing (NLP) engine, evaluating the possibility of harassment by considering the language structure and context. Furthermore, an emotion engine recognizes the user's emotions from the audio and text data and extracts their emotional state. The results of the emotion engine's analysis influence the evaluation of harassment, taking into account situations where the user is not aggressive but is in a negative emotional state.
[0677] The machine learning model is trained on past cases of workplace harassment, and uses this data and sentiment assessments to numerically evaluate the risk of the remarks. Based on this evaluation, the server generates alerts and notifies the user when necessary. The notifications are delivered via the device and provide information to help the user take appropriate action.
[0678] As a concrete example, consider a situation where a user says during a meeting, "I want to know why you're not meeting my expectations." When this statement is made, the server performs linguistic analysis, and simultaneously, the emotion engine recognizes the user's emotions from their tone and speed of voice. If the system determines that the user is feeling irritated or dissatisfied, the overall risk score may increase. As a result, the device will issue an alert to the user, such as, "Your statement may be misleading. Please maintain a calm tone."
[0679] By combining this with an emotion engine, it becomes possible to detect power harassment in a sophisticated way that considers not only the content of the language but also the emotions behind it. As a result, users can gain a deeper understanding and create a better communication environment.
[0680] The following describes the processing flow.
[0681] Step 1:
[0682] The device acquires audio data. The user activates the application's recording function and records conversations and voices, including ambient noise. This ensures that audio data is available for later analysis.
[0683] Step 2:
[0684] The device converts the voice data acquired using speech recognition technology into text data. Once the voice input is converted to text format, it is ready for natural language processing.
[0685] Step 3:
[0686] The terminal sends the converted text data to the server. The server then receives the data necessary to analyze the audio content. The transmission process minimizes the processing power required by the terminal.
[0687] Step 4:
[0688] The server analyzes the text data using a natural language processing engine. Here, it understands the context of tokenized words and determines whether they contain aggressive or condescending language.
[0689] Step 5:
[0690] The server uses an emotion engine to analyze the user's voice characteristics (tone, speed, and volume) and recognize the user's emotional state. This allows for a quantitative measurement of their emotions at that moment.
[0691] Step 6:
[0692] The server uses a machine learning model to integrate text data with emotional state assessments and calculate a risk score for power harassment. This score is then refined by a model trained on past case data.
[0693] Step 7:
[0694] The server generates an alert for the user if it exceeds a set threshold based on its risk score. The alert includes specific suggestions for improvement and cautionary messages.
[0695] Step 8:
[0696] The device notifies the user of alerts sent from the server. Notifications are efficiently delivered through on-screen pop-ups and audio alerts, providing the user with the information they need immediately.
[0697] Step 9:
[0698] Users review the notification content and modify their conversations as needed. This feedback loop allows users to develop an awareness of improving their words and actions.
[0699] (Example 2)
[0700] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0701] In today's communication environment, there is a need to quickly detect remarks that may unintentionally be perceived as aggressive or condescending, and to provide appropriate feedback. However, conventional systems primarily analyze only the content of the language, without conducting detailed evaluations based on the underlying emotions and context. As a result, there is a challenge in that it is difficult for users to immediately understand and improve their own communication style.
[0702] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0703] In this invention, the server includes means for analyzing voice and text information to detect aggressive or condescending language, means for evaluating language information using a learning model to determine the possibility of power harassment, and means for performing sentiment analysis and incorporating the results into the evaluation. This makes it possible to provide real-time feedback to the user and promote a deeper understanding of communication.
[0704] "Voice information" refers to data related to human voices acquired through input devices such as microphones.
[0705] "Textual information" refers to data that is represented as text after audio information has been converted.
[0706] A "learning model" is a mathematical model that performs pattern recognition and classification based on past data and examples.
[0707] "Analysis" is the process of analyzing data, understanding its content, and evaluating it.
[0708] "Evaluation" is the process of determining the value and risks of something based on analyzed data.
[0709] "Emotional analysis" is a technology that infers and extracts a person's emotional state from information expressed as voice or text.
[0710] "Notification" refers to the act of providing information to inform users of the evaluation results.
[0711] This system is designed to analyze voice information and assess the risk of power harassment in real time. A specific implementation is described below.
[0712] First, when a user speaks, the device acquires the audio information through its microphone. This audio information is then converted into text using speech recognition technology built into the device, such as general speech-to-text software. At this stage, high-quality microphones and hardware with noise-canceling capabilities are recommended to improve audio accuracy.
[0713] The converted text information is sent to the server in real time. The server receives this information and performs analysis using a natural language processing engine. The analysis utilizes language processing libraries for contextual understanding and syntactic analysis of the text information. For example, open-source and commercial natural language processing tools are commonly used.
[0714] As the next step in the analysis, the server utilizes an emotion analysis engine to extract the user's emotional state from the audio and text information. This emotion data is crucial for understanding the psychological aspects behind the statements. The emotion engine is trained using machine learning and employs data models to identify a variety of emotions.
[0715] The server integrates these analysis results and uses a learning model to assess the risk of power harassment. Because the learning model is trained on a large dataset based on past cases, it can provide accurate risk assessments.
[0716] When notifying a user of a risk, the server sends the assessment results to the terminal, and a warning message is displayed to the user. For example, the user might be advised, "Your statement may be misleading. Please maintain a calm tone."
[0717] As a concrete example, consider the behavior of a user who, during a meeting, asks, "Please tell me why you haven't met my expectations." In this case, the system would assess the possibility of the statement being misinterpreted as harassment and provide appropriate feedback. Possible prompts for the generative AI model could include instructions such as, "Perform a communication risk assessment based on text and audio data, and present the results."
[0718] This system allows users to receive a risk assessment that includes the emotional context of their statements, and to gain guidance for improving their communication style.
[0719] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0720] Step 1:
[0721] When a user speaks, the device uses its microphone to acquire audio information. Since this audio information cannot be analyzed directly, it is converted into text using speech recognition technology. The converted text is then stored digitally within the device, and the process proceeds to the next step.
[0722] Step 2:
[0723] The terminal sends the converted text information to the server. The server feeds the received text information into a natural language processing engine to analyze the sentence structure and context. The input here is text information, and the output is evaluation information, including the detection results of aggressive or condescending language. This data processing is performed through syntactic and semantic analysis of the text.
[0724] Step 3:
[0725] The server also receives audio information and uses an emotion analysis engine to extract the user's emotional state. In this step, raw audio data is used as input, and the user's emotional state information after analysis is output. Emotions are read from the tone and speed of the voice, and the evaluation is performed by integrating this with the results of text analysis.
[0726] Step 4:
[0727] The server integrates the results of text analysis and sentiment analysis, and uses a learning model to assess the risk of power harassment. The input consists of text and sentiment analysis results, and the output is a risk score. This score calculation utilizes a machine learning algorithm that has been refined based on similar past cases.
[0728] Step 5:
[0729] If the risk assessment results in a score exceeding a threshold, the server generates a warning message and sends it to the terminal. The terminal notifies the user of this message and displays specific feedback such as "Your statement may be misleading. Please maintain a calm tone," prompting the user to immediately improve their behavior.
[0730] This series of processes allows users to receive real-time feedback and recognize communication risks.
[0731] (Application Example 2)
[0732] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0733] In workplaces, educational institutions, and other settings, there is a growing need to prevent harassment and create a healthy communication environment. However, traditional methods have struggled to evaluate not only the content of statements but also background information such as emotions. In particular, many cases of harassment stem from minor misunderstandings or heightened emotions, and a system is needed that can recognize this in real time and warn users.
[0734] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0735] In this invention, the server includes means for acquiring sound data and converting said sound data into document data; means for analyzing said document data to detect intimidating or aggressive language; means for evaluating said document data using a learning model that evaluates the possibility of harassment for the language being spoken; means for recognizing emotional states from the sound data; and means for modifying the harassment evaluation based on the emotional states. This makes it possible to comprehensively evaluate the possibility of harassment from both emotional and linguistic aspects and provide warnings to users at an appropriate time.
[0736] "Audio data" refers to recordings of speech or sounds as digital information, primarily collected through devices such as microphones.
[0737] "Document data" refers to audio data converted into text, in a format that can be analyzed and evaluated through natural language processing.
[0738] "Intimidating or aggressive language" refers to language that includes expressions that exert physical or psychological pressure or aggression on others, and its detection is important in assessing the possibility of harassment.
[0739] A "learning model" is an algorithm or framework that is trained using historical data to perform a specific task, such as assessing the risk of harassment.
[0740] "Emotional state" refers to an individual's emotional state as analyzed from sound data, and is expressed in categories such as joy, anger, and sadness.
[0741] "Means of modifying harassment assessments" refer to processes and methods for precisely adjusting harassment risk assessments based on verbal analysis results, taking into account emotional states.
[0742] A "warning" is a message or notification that alerts users and is provided when a high risk of harassment is identified.
[0743] In order to implement this invention, both the server and the terminal must work together. To acquire the user's voice, the terminal is equipped with a voice input device (microphone) and collects sound data in real time. The collected sound data is converted into document data using speech recognition software on the terminal (e.g., Google's speech recognition API).
[0744] Subsequently, the document data is sent to a server, which uses a natural language processing engine to detect intimidating or aggressive language. Furthermore, emotion recognition software runs on the server, recognizing the user's emotional state from the audio data. This emotional state assessment is fed into a learning model and used to improve the accuracy of harassment risk assessments.
[0745] The server uses a learning model trained on past harassment cases to assess the overall risk of harassment. Based on this assessment, it displays warnings and alerts on the user's device in an easy-to-understand format. This process is continuous, enabling real-time feedback.
[0746] As a concrete example, consider a scenario where a user in a meeting says, "Why hasn't this problem been solved yet?" Speech recognition software converts the statement into document data, and a server analyzes its contents. Simultaneously, if frustration is detected in the audio data, emotion recognition software can increase the risk score and send an appropriate alert to the user.
[0747] An example of a prompt using a generative AI model is: "Analyze the given audio and text data for intimidating language and emotions, and assess the likelihood of harassment."
[0748] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0749] Step 1:
[0750] The device acquires the user's voice as input through the microphone. The user's speech, collected as audio data, is passed to speech recognition software within the device. This converts the audio data into document data, which is then output as text data.
[0751] Step 2:
[0752] The terminal sends the document data obtained in step 1 to the server. The server uses a natural language processing engine to analyze this document data. It detects offensive language from the input text and outputs the results as evaluation data.
[0753] Step 3:
[0754] The server simultaneously inputs sound data into emotion recognition software to evaluate the user's emotional state. It outputs an emotion score, which quantifies the emotional state. This score is used later for risk assessment.
[0755] Step 4:
[0756] The server combines the evaluation data and sentiment scores obtained from steps 2 and 3 and supplies them to the learning model. The generative AI model uses this data to assess the risk of harassment and outputs a risk score.
[0757] Step 5:
[0758] If the risk score calculated in step 4 exceeds a pre-set threshold, the server creates a prompt and sends an alert message to the terminal. The terminal notifies the user of the warning visually or audibly.
[0759] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0760] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0761] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0762] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0763] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0764] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0765] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0766] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0767] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0768] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0769] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0770] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0771] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0772] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0773] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0774] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0775] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0776] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0777] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0778] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0779] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0780] The following is further disclosed regarding the embodiments described above.
[0781] (Claim 1)
[0782] A means for acquiring audio data and converting said audio data into text data,
[0783] A means for analyzing the aforementioned text data to detect aggressive or condescending language,
[0784] A means for evaluating text data using a machine learning model that assesses the possibility of power harassment in a given language,
[0785] A means of notifying the user of the risk of power harassment based on the aforementioned evaluation,
[0786] A system that includes this.
[0787] (Claim 2)
[0788] The system according to claim 1, further comprising means for analyzing the language contained in the aforementioned audio data in real time and converting it into text data.
[0789] (Claim 3)
[0790] The system according to claim 1, wherein the machine learning model is trained on past cases of power harassment and includes means for issuing an alert when a statement exceeds a certain threshold.
[0791] "Example 1"
[0792] (Claim 1)
[0793] A means for acquiring an acoustic signal and converting the acoustic signal into text information,
[0794] Means for analyzing the aforementioned textual information to detect high-handed or aggressive language expressions,
[0795] A means for evaluating textual information using a machine learning model that assesses the possibility of inappropriate behavior in the workplace environment for ongoing linguistic expressions,
[0796] A means of notifying users of the risk of inappropriate behavior based on the aforementioned evaluation,
[0797] A means of collecting feedback from users and using it to improve the accuracy of the machine learning model,
[0798] A system that includes this.
[0799] (Claim 2)
[0800] The system according to claim 1, further comprising means for analyzing the linguistic expression contained in the acoustic signal in real time and converting it into textual information.
[0801] (Claim 3)
[0802] The system according to claim 1, wherein the machine learning model is trained on past instances of inappropriate behavior and includes means for issuing a warning when the expression exceeds a certain threshold.
[0803] "Application Example 1"
[0804] (Claim 1)
[0805] A means for acquiring audio information and converting said audio information into text information,
[0806] Means for analyzing the aforementioned textual information to detect aggressive or condescending expressions,
[0807] A means for evaluating textual information using a learning model that assesses the possibility of power harassment in an ongoing expression,
[0808] A means of notifying users of the risk of power harassment based on the aforementioned evaluation,
[0809] A means comprising an automated device that monitors dialogue in the living environment and alerts the user if a problem is detected,
[0810] A system that includes this.
[0811] (Claim 2)
[0812] The system according to claim 1, further comprising means for analyzing expressions contained in the aforementioned audio information in real time and converting them into text information.
[0813] (Claim 3)
[0814] The system according to claim 1, wherein the learning model is trained based on past cases of power harassment and includes means for issuing a warning when the dialogue exceeds a certain threshold.
[0815] "Example 2 of combining an emotion engine"
[0816] (Claim 1)
[0817] A means for acquiring audio information and converting said audio information into text information,
[0818] Means for analyzing the aforementioned textual information to detect aggressive or condescending language,
[0819] A means for evaluating textual information using a learning model that assesses the possibility of power harassment in a given language,
[0820] A method for analyzing emotions from audio and text information and incorporating the analysis results into power harassment assessments,
[0821] A means of notifying users of the risk of power harassment based on the aforementioned evaluation,
[0822] ...
[0823] A system that includes this.
[0824] (Claim 2)
[0825] The system according to claim 1, further comprising means for analyzing the language contained in the aforementioned audio information in real time and converting it into text information.
[0826] (Claim 3)
[0827] The system according to claim 1, wherein the learning model is trained on past cases of power harassment and includes means for issuing a warning when a statement exceeds a certain threshold.
[0828] "Application example 2 when combining with an emotional engine"
[0829] (Claim 1)
[0830] A means for acquiring sound data and converting said sound data into document data,
[0831] Means for analyzing the aforementioned document data to detect intimidating or aggressive language,
[0832] A means for evaluating document data using a learning model that assesses the potential for harassment in the language being spoken,
[0833] A means of notifying users of the risk of harassment based on the aforementioned evaluation,
[0834] A means of recognizing emotional states from sound data,
[0835] Means for modifying the evaluation of harassment based on the aforementioned emotional state,
[0836] A system that includes this.
[0837] (Claim 2)
[0838] The system according to claim 1, further comprising means for immediately analyzing the language contained in the sound data and converting it into document data.
[0839] (Claim 3)
[0840] The system according to claim 1, wherein the learning model is trained based on past harassment cases and includes means for issuing an alarm when a statement exceeds a certain threshold. [Explanation of symbols]
[0841] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means for acquiring audio data and converting said audio data into text data, A means for analyzing the aforementioned text data to detect aggressive or condescending language, A means for evaluating text data using a machine learning model that assesses the possibility of power harassment in a given language, A means of notifying the user of the risk of power harassment based on the aforementioned evaluation, A system that includes this.
2. The system according to claim 1, further comprising means for analyzing the language contained in the aforementioned audio data in real time and converting it into text data.
3. The system according to claim 1, wherein the machine learning model is trained on past cases of power harassment and includes means for issuing an alert when a statement exceeds a certain threshold.