system
The system converts audio to text, analyzes it for power harassment, and provides immediate feedback, addressing the challenge of unconscious harassment by enabling real-time detection and improving workplace communication.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-26
AI Technical Summary
Existing systems fail to effectively detect and alert against power harassment in real-time, as perpetrators are often unconscious of their actions, and victims struggle to express discomfort, leading to a need for immediate detection and feedback mechanisms.
A system that converts audio data to text, analyzes the text using natural language processing to calculate a power harassment score, and generates a warning message if the score exceeds a threshold, providing immediate feedback to the speaker.
Enables real-time detection and alerting of power harassment, allowing speakers to reflect on their behavior and improve communication, promoting a healthier workplace environment.
Smart Images

Figure 2026105509000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] It is urgent to reduce the adverse effects of harassment in the workplace, especially power harassment, on the working environment. Many perpetrators are unconscious, and there are also many cases where victims cannot express their discomfort. Therefore, there is a need for means to detect power harassment speech in real time and promptly alert the speaker.
Means for Solving the Problems
[0005] This invention includes means for acquiring audio data and converting it into text data, and means for analyzing the text data and calculating a power harassment score based on specific criteria. Furthermore, it includes means for generating and notifying a warning message if the calculated score exceeds a certain threshold. This allows the speaker to reflect on their behavior and immediately make improvements.
[0006] "Audio data" refers to a format in which sound information is digitized, recorded, and processed.
[0007] "Text data" refers to a data format that records and processes character information in a digital form.
[0008] A "natural language processing model" is a collection of algorithms and techniques that enable computers to understand and analyze human language.
[0009] "Analysis results" refer to information obtained after processing text data, and are used to derive specific insights or decisions.
[0010] The "power harassment score" is an index that numerically indicates the likelihood of power harassment in a conversation.
[0011] The "reference value" is a reference value used in evaluating the power harassment score to determine whether or not to issue a warning or notification.
[0012] A "warning message" is a notification containing information sent to draw attention to something.
[0013] "Notification" refers to a means or method of conveying information to a user. [Brief explanation of the drawing]
[0014] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2]It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
MODE FOR CARRYING OUT THE INVENTION
[0015] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0016] First, the terms used in the following description will be explained.
[0017] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be one arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be one type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0018] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0019] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.
[0020] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.
[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0022] [First Embodiment]
[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0035] This invention is implemented as a software application installed on a user's device, such as a smartphone or PC. The device first receives voice input and converts it into text data in real time. By using speech recognition technology, the user's conversation can be accurately captured as text information.
[0036] Next, the terminal sends the converted text data to the server via the internet connection. The server has a natural language processing model for analyzing the received text data, and uses this model to extract linguistic features and context. Based on the analysis results, the server calculates a power harassment score. The calculation method includes the frequency of occurrence of specific keywords and phrases, context, and matching with a database of past power harassment cases.
[0037] If the power harassment score exceeds the set threshold, the server immediately generates a warning message. This warning message includes specific problematic statements and the reasons for them, informing the perpetrator of how their words affect others.
[0038] Subsequently, a warning message is sent to the device and the user is notified. The notification appears as an on-screen pop-up or an audio alert, so the user can immediately notice its presence. For example, if the statement "You're always so useless" is detected, a warning will be issued because this statement carries the risk of belittling the other person.
[0039] This provides users with an opportunity to review their unconscious words and actions, creating an environment that helps prevent workplace harassment. It is believed that using this system will promote healthy communication in the workplace and contribute to increased productivity.
[0040] The following describes the processing flow.
[0041] Step 1:
[0042] The device captures the user's voice using the microphone and temporarily stores it as audio data. Noise filtering is also performed simultaneously to ensure sound quality.
[0043] Step 2:
[0044] The device converts acquired audio data into text data in real time. It uses a speech recognition engine to achieve highly accurate transcription.
[0045] Step 3:
[0046] The device sends the converted text data to the server. During transmission, the data is encrypted to protect user privacy.
[0047] Step 4:
[0048] The server receives the transmitted text data and begins analysis. It uses a natural language processing model to extract linguistic features and understand the context.
[0049] Step 5:
[0050] The server calculates a power harassment score based on the analysis results. It checks for the appearance of specific keywords and phrases and calculates the score by comparing them with past power harassment case data.
[0051] Step 6:
[0052] The server compares the power harassment score to a baseline value. If the score exceeds the baseline value, it generates a warning message about the problematic remarks. This message explains which parts are problematic and why.
[0053] Step 7:
[0054] The server sends the generated warning message to the terminal.
[0055] Step 8:
[0056] The device notifies the user of any received warning messages. The notification is provided visually and audibly, allowing the user to understand the content immediately.
[0057] (Example 1)
[0058] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0059] In modern society, there is a need to reduce the risk of inappropriate remarks and harassment in the workplace and other communication settings, and to promote a healthy environment for dialogue. In particular, there is a need for a system that can immediately grasp the psychological impact that unconscious remarks made by participants have on others and respond appropriately.
[0060] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0061] In this invention, the server includes means for acquiring audio data and converting it into text information, means for transmitting the converted text information to a remote device, and means for analyzing the received text information using a natural language processing model. This makes it possible to immediately evaluate whether a participant's remarks have a psychological impact and to issue a warning if necessary.
[0062] "Audio data" refers to the digital representation of audio information spoken by a user.
[0063] "Textual information" refers to data obtained by analyzing audio data and converting it into a text format that humans can understand.
[0064] A "natural language processing model" is a computer program that understands human language and facilitates communication. It is used to analyze and extract the context and meaning of data.
[0065] A "psychological impact score" is an index that quantifies the potential psychological and emotional impact that a statement may have on others.
[0066] A "warning message" is information generated when a user's statement violates the standards, intended to alert the user to the issue.
[0067] A "remote device" is a device that can communicate via a network and is used to process or store information.
[0068] "Encryption" is a technology that transforms information based on a specific algorithm to improve data security and prevent unauthorized access.
[0069] This invention is implemented as a system consisting of software installed on a portable information terminal or computer device used by a user.
[0070] Terminal:
[0071] When a user speaks, the device acquires the audio data and converts it into text in real time. A general-purpose speech recognition service is used for this process, for example. This ensures that the spoken words are accurately converted into digital text.
[0072] server:
[0073] After the text information is converted, it is sent to a server via an internet connection. The server uses a generative AI model and a natural language processing model to analyze the text information. Here, the server extracts linguistic features and context from the text information and calculates a psychological impact score. This analysis uses machine learning and data mining techniques, and past cases are also referenced in the database.
[0074] Warning message:
[0075] If the psychological impact score exceeds a certain threshold, the server generates a warning message. This message is sent to the user's device and displayed on the display screen. For example, if a negative comment is detected, the user will receive a warning such as, "This comment may give an inappropriate impression."
[0076] Specific examples and prompt statements:
[0077] If a user says "You are incompetent" during a workplace conversation, the system analyzes this expression and determines whether to issue an appropriate warning. An example of a prompt provided to the generative AI model is, "Analyze this text and determine its psychological impact."
[0078] This system allows users to review their own statements and improve the quality of their communication. As a result, it is expected to improve the health and productivity of interactions in the user environment.
[0079] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0080] Step 1:
[0081] The device acquires the user's speech as audio data. The input is the user's voice, which is converted into digital audio data. The device sends this audio data to its internal speech recognition engine, which converts it into text information in real time. The output as a result of the conversion is text information that records the user's speech.
[0082] Step 2:
[0083] The terminal sends the converted character information to the server via the internet connection. The input here is the character information converted by the terminal. The terminal encrypts the character information using a transmission protocol and sends it securely to the server. The output is the character information data received by the server.
[0084] Step 3:
[0085] The server analyzes the received text information. The input is text information sent from the terminal. The server utilizes its built-in generative AI model and natural language processing model to perform data analysis to extract linguistic features and context. This analysis yields a psychological impact score as output.
[0086] Step 4:
[0087] The server calculates a psychological impact score based on the analysis results. The input is the result of natural language processing. The server compares this with a database of past cases and calculates the psychological impact score according to specific criteria. If this score exceeds the criteria, a flag is output indicating that a warning is necessary.
[0088] Step 5:
[0089] The server generates a warning message based on the score flag. The input is information about the score exceeding the threshold. The server generates a warning message and determines the specific content to provide to the user. The output is the text data of the warning message.
[0090] Step 6:
[0091] The terminal receives warning messages sent from the server and notifies the user. The input is the warning message data transferred from the server. The terminal immediately alerts the user using screen displays and audio alerts. The output is a warning display that the user can see.
[0092] (Application Example 1)
[0093] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0094] In the workplace, it is essential to prevent situations where remarks are unintentionally perceived as harassment. However, it is difficult for speakers to constantly be aware of how their individual remarks are received. Furthermore, the lack of a system to immediately provide corrective measures when problems are recognized after a remark has been made hinders the development of healthy workplace communication.
[0095] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0096] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the text information and calculating a speech evaluation score, and means for generating warning information and providing suggestions. This enables the speaker to immediately recognize the impact their words have on others and to make appropriate behavioral modifications.
[0097] "Audio information" refers to data that represents content transmitted through sound in a digital format.
[0098] "Textual information" refers to data that represents audio information in text format.
[0099] "Means of conversion" refers to a function or device for converting audio information into text information in real time.
[0100] "Means of analysis" refers to a system or program that analyzes content based on acquired and converted character information to understand its meaning and context.
[0101] The "Behavioral Evaluation Score" is an index that evaluates and quantifies the appropriateness of a statement based on analyzed textual information.
[0102] A "warning message" is a notification generated when the calculated behavioral evaluation score exceeds a certain threshold, clearly indicating the points that should be corrected in the statement in question.
[0103] "Means of providing suggestions" refers to a function that provides users with advice and hints to review and improve their statements based on warning information.
[0104] This invention is a system for evaluating the appropriateness of speech by acquiring voice information in real time and analyzing its content. This system integrates speech recognition technology and natural language processing technology and is installed in an information processing terminal that users use on a daily basis. Specifically, it utilizes an input device for acquiring voice information (e.g., a smartphone microphone) and speech recognition software for converting it to text (e.g., Google® Cloud Speech-to-Text).
[0105] The device first inputs the user's speech through its microphone and converts this speech information into text using speech recognition software. Then, the device sends this text information to a server via the internet.
[0106] The server interprets the received text information and analyzes the language features and context using natural language processing models (e.g., NLTK or Hugging Face's Transformers). Next, it calculates a behavior evaluation score based on the analysis results. This score is calculated based on set criteria, and warning information is added to statements that exceed a certain standard.
[0107] Warning information is sent to the user, including suggestions for improvement regarding the relevant statement. Users can immediately check this notification on their smartphone screen and recognize the unintended impact of their statement. This improves the quality of everyday communication and contributes to building smoother relationships in the workplace.
[0108] For example, if someone says during a meeting, "It's your fault that this project is progressing so slowly," the system can analyze this statement and issue a warning if it excessively emphasizes the other person's responsibility. An example prompt for the user would be, "Identify any negative comments or harassment in the following text. If any, explain why."
[0109] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0110] Step 1:
[0111] The device acquires user speech as audio information via a microphone. The input is user speech data. This audio information is converted into text in real time using speech recognition software such as Google Cloud Speech-to-Text. The output is the converted text data.
[0112] Step 2:
[0113] The terminal sends the converted character information to the server via the internet. The input is text data, which is then transferred to the server. Since the HTTPS protocol is used for transmission, the data is transmitted securely.
[0114] Step 3:
[0115] The server analyzes the received text information. The input is text data, and the analysis is performed using a natural language processing model (e.g., NLTK or Hugging Face's Transformers). During the analysis process, linguistic features and context are evaluated, and the information is structured. The output is the analysis result.
[0116] Step 4:
[0117] The server calculates a behavioral evaluation score based on the analysis results. The input is the analysis results, and the appropriateness of the statements is quantified by the scoring algorithm. The output is the behavioral evaluation score.
[0118] Step 5:
[0119] The server compares the calculated behavioral evaluation score to a set threshold value. If the score exceeds the threshold value, it generates a warning message. The input is the behavioral evaluation score, and the output is the warning message. This message includes an analysis of why the remarks were problematic and suggestions for improvement.
[0120] Step 6:
[0121] The terminal receives warning information sent from the server and notifies the user. The input is the warning information, which is displayed as a pop-up or audio alert. The user acknowledges this warning and is prompted to reconsider their statement. The output is feedback on the user's perception and actions.
[0122] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0123] This invention is embodied as a software application installed on a user's smartphone, PC, or other device. This system has the function of converting audio data into text data in real time, and then sending the text data to a server for analysis. The server is equipped with a natural language processing model and an emotion engine, which are used to evaluate the context and emotion of the user's speech.
[0124] The server extracts linguistic features from the received text data and uses an emotion engine to recognize the user's emotions. The emotion engine identifies emotional keywords and expressions appearing in the text and has the function of evaluating the intensity and type of those emotions. Based on the text data and emotion data, the server calculates a power harassment score, which serves as an indicator of whether the conversation content constitutes power harassment.
[0125] The emotional data recognized by the emotion engine is reflected in the calculation of the harassment score, which is affected by whether the conversation is conducted calmly or emotionally aggressive. For example, the score increases when aggressive emotions are present, increasing the likelihood of a warning.
[0126] If the score exceeds a certain threshold, the server generates a warning message. This message includes the problematic statement, the reason it was deemed aggressive, and the user's emotional state. The generated warning message is sent to the device and the user is notified. The notification is made through a visual pop-up or an audio alert, allowing the user to re-evaluate their own statements and emotions.
[0127] This system provides users with the opportunity to correct inappropriate remarks they may have unconsciously made during communication, creating an environment aimed at preventing workplace harassment. The incorporation of an emotion engine enables more accurate feedback, making it a valuable tool for user self-improvement.
[0128] The following describes the processing flow.
[0129] Step 1:
[0130] The device continuously captures the user's voice using a microphone and stores it as audio data. This data is processed by a noise reduction algorithm to obtain a clear audio signal.
[0131] Step 2:
[0132] The device uses speech recognition software to convert voice data into text data in real time. The converted text is recorded as accurate textual information of the user's speech.
[0133] Step 3:
[0134] The device encrypts text data before sending it to the server. This ensures user privacy and enables secure data transmission.
[0135] Step 4:
[0136] The server inputs the received text data into a natural language processing (NLP) model for analysis. Here, linguistic features and contextual information from the text are extracted.
[0137] Step 5:
[0138] The server activates the emotion engine based on the analysis results of the NLP model. The emotion engine analyzes text dynamics and identifies emotional states such as positive, negative, and aggressive.
[0139] Step 6:
[0140] The server integrates acquired linguistic features and emotional information to calculate a power harassment score. The score quantifies the potential impact that the remarks have on the recipient.
[0141] Step 7:
[0142] The server compares the calculated harassment score to a baseline value. If the score exceeds the baseline value, a warning message is generated. The message includes the content of the problematic remarks and the sentiment analysis results.
[0143] Step 8:
[0144] The server sends the generated warning message to the terminal.
[0145] Step 9:
[0146] The device will notify the user of a warning message. The notification will appear as a pop-up on the device screen and can also be accompanied by an audio alert to prompt the user's immediate attention.
[0147] This allows users to review their own words, actions, and emotional states and receive feedback to take appropriate action.
[0148] (Example 2)
[0149] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0150] Traditional communication analysis systems have struggled to accurately identify emotions from voice and provide appropriate warnings to users when conversation content exceeds certain thresholds. In particular, there is a need to detect remarks that could potentially affect mental health, such as power harassment, in real time and provide immediate feedback. Furthermore, encryption is required for data transmission to protect privacy.
[0151] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0152] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the converted text information using natural language processing, and means for identifying emotions and generating emotion data based on the analysis results. This makes it possible to identify emotions during voice communication, calculate the risk of power harassment, and provide a warning to the user.
[0153] "Audio information" refers to data that represents and processes audio in a digital format.
[0154] "Textual information" refers to data obtained by converting audio data into linguistic characters.
[0155] "Analysis" is the process of processing textual information and evaluating its context, content, and emotions.
[0156] An "indicator" is an evaluation value that is quantified according to specific evaluation criteria based on the results obtained from the analysis.
[0157] "Warning information" is information generated to alert the user when the calculated indicator exceeds a certain threshold value.
[0158] "Emotional data" refers to data that indicates the intensity and type of emotion based on textual information.
[0159] "Encryption" is the process of transforming data into a form that cannot be interpreted by third parties in order to protect its content.
[0160] This invention is implemented by a system that captures speech through a user's voice input device and converts it to text in real time. This system utilizes speech recognition technology, specifically using services such as the Google Speech-to-Text API. The user's device converts the captured speech into text information and sends that text information to the server. During transmission, the data is encrypted using the SSL / TLS protocol to ensure the security of the communication.
[0161] The server uses a natural language processing model to analyze the received text information. Specific models used include generative AI models such as BERT and GPT-3 (registered trademark). This analysis extracts linguistic features from the text information and understands the context of the speech. Furthermore, an emotion engine is used to identify emotions within the text information, and their intensity and type are recorded as data.
[0162] Based on emotional data and analysis results, the server calculates an index. This index indicates the likelihood that the conversation content constitutes power harassment. For example, if a user says, "Your way of doing things is completely wrong," the server reacts to the negative word "wrong" and rates it highly as negative emotional data. If the index exceeds a certain threshold, the server generates and sends a warning message to the user.
[0163] The generated warning information is displayed on the device and notified to the user through visual or audible alerts. This allows the user to understand in real time how their comments are being evaluated and how they can be improved.
[0164] As a concrete example, the following prompt sentences can be input into a generative AI model to simulate similar language analysis:
[0165] "Please explain how workplace conversations are recorded in real time, how negative emotional expressions are identified, and how they are evaluated and notified."
[0166] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0167] Step 1:
[0168] The device captures the user's voice through the microphone. This voice data is input in real time to a speech recognition service such as the Google Speech-to-Text API. The input voice data is processed and output as text information. For example, if the user says "Hello, how are you?", this phrase is converted into text information.
[0169] Step 2:
[0170] The terminal sends the converted character information to the server. SSL / TLS encryption protocol is used during transmission to ensure data security. The input at this stage is the character information obtained in step 1, which is then output to the server in an encrypted form.
[0171] Step 3:
[0172] The server inputs the received text information into a natural language processing model. Specifically, it uses generative AI models such as BERT or GPT-3. The model analyzes the text information and extracts linguistic features and context. From the input text information, it obtains the results of contextual analysis as output. For example, it analyzes the emotional nuances of a phrase.
[0173] Step 4:
[0174] The server generates sentiment data via the sentiment engine based on the analyzed contextual data. The input for this step is the contextual data from step 3, which is processed to output the intensity and type of emotion. Specifically, it identifies positive, negative, and neutral emotions.
[0175] Step 5:
[0176] The server uses emotion data and contextual analysis results to calculate metrics. The emotion engine quantifies the score according to the intensity of the emotion and outputs this as a metric. For example, a high score is obtained when negative emotions are strong.
[0177] Step 6:
[0178] The server determines whether the calculated metric exceeds the threshold value. In this step, the metric is used as input, and the result of the determination (whether or not it exceeds the threshold value) is output. If it exceeds the threshold value, warning information is then generated.
[0179] Step 7:
[0180] The server generates warning information to be sent to the terminal. The input is the judgment result from step 6, which is used to create and output an appropriate warning message. For example, a message indicating that the user's statement is aggressive is constructed.
[0181] Step 8:
[0182] The terminal notifies the user of the received warning information. The input is warning information sent from the server, which is output as a screen display or audio alert. The user receives this and has an opportunity to re-evaluate their own statements.
[0183] (Application Example 2)
[0184] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0185] In customer support interactions, staff members may unconsciously make inappropriate remarks to customers, which can lead to decreased customer satisfaction. To solve this problem, support is needed to enable staff members to evaluate their own words and actions in real time during conversations with customers and to respond appropriately.
[0186] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0187] In this invention, the server includes means for converting voice information into text information, means for analyzing the converted text information, means for calculating an evaluation score according to criteria based on the analysis results, and means for analyzing emotional data in real time and providing advice to support appropriate response methods. This enables the person in charge to always respond to customers with appropriate attitude and language while interacting with them.
[0188] "Audio information" refers to human speech and other sounds acquired as digital data.
[0189] "Textual information" refers to text data converted from audio information.
[0190] "Analysis" is the process of analyzing content based on textual information to identify specific patterns or emotions.
[0191] An "evaluation score" is a numerical indicator calculated based on the analysis results, which shows the degree of conformity to a specific standard.
[0192] A "notification message" is information that indicates a warning when the evaluation score exceeds a certain threshold.
[0193] "Emotional data" refers to information about emotions extracted from textual data.
[0194] "Providing advice" means offering real-time guidance on appropriate actions based on evaluation scores and sentiment data.
[0195] "Encryption" is the process of protecting audio information by transforming it using a specific algorithm so that it cannot be easily read by a third party.
[0196] The system for realizing this application is basically designed as a software application for acquiring voice information in real time and converting it into text information. The user uses a device such as a smartphone or PC to input voice information through a microphone. The device then uses the SpeechRecognition library to convert the voice information into text information.
[0197] Text information transmitted from the terminal reaches the server, where it undergoes text analysis using natural language processing technology. This analysis utilizes a language processing model (for example, a model using TENSORFLOW®). The server extracts sentiment data from the text information and calculates an evaluation score based on it. Based on the sentiment data, it also provides advice to encourage appropriate responses in specific situations.
[0198] The server generates visual or auditory notification messages for customer service representatives based on sentiment data and evaluation scores. For example, if strong emotions indicating customer dissatisfaction are detected, advice such as "To improve customer satisfaction, please strive for a calm response" might be provided. This message is displayed via the terminal's screen or through an audio alert.
[0199] For example, if the system detects that a customer is experiencing high levels of stress during a customer service call, it analyzes the customer's emotions in real time and advises the caller to remain calm. This approach is expected to improve customer satisfaction.
[0200] An example of a prompt message might be: "Develop an application that provides advice to facilitate smooth conversations with customers. It will use voice input, analyze sentiment in real time, and recommend appropriate responses as needed."
[0201] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0202] Step 1:
[0203] The user inputs voice information using the device's microphone. The device acquires this voice signal and converts the voice into text information using the SpeechRecognition library. In this step, the input is voice data, and the output is text data converted from the voice. Data processing involves voice sampling and subsequent conversion to text information.
[0204] Step 2:
[0205] After the character information is generated, the terminal sends this information to the server. At this stage, the character data is transferred to the server via the network. The input is the character information, and the output is the character information as received by the server. This data is encrypted when it is sent to the server, ensuring security.
[0206] Step 3:
[0207] The server analyzes the received text information and uses a natural language processing model to analyze its content. TensorFlow's language model is used for this purpose. The input is the received text information, and the output is the semantic information of the analyzed text. Data calculations include structural analysis of the language and extraction of sentiment data.
[0208] Step 4:
[0209] Based on the analysis results, the server extracts emotional data and calculates an evaluation score. This score is a numerical representation of how appropriate the conversation is. The input is the analysis results, and the output is the evaluation score. Data processing at this stage includes the evaluation of the type and intensity of emotions by the emotion engine.
[0210] Step 5:
[0211] Based on the evaluation score, the server generates real-time advice for the user. This advice suggests specific actions to facilitate smooth customer interactions. The input is the evaluation score, and the output is a visual or auditory notification message presented to the user. Here, a generative AI model is used to create appropriate advice.
[0212] Step 6:
[0213] Notification messages sent from the server are received by the terminal and presented to the user. Here, the user reflects on their response and improves it as needed. The input is the notification message from the server, and the output is the information displayed to the user. Specific actions include pop-up displays on the terminal screen and audio alerts.
[0214] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0215] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0216] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0217] [Second Embodiment]
[0218] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0219] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0220] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0221] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0222] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0223] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0224] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0225] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0226] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0227] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0228] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0229] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0230] This invention is implemented as a software application installed on a user's device, such as a smartphone or PC. The device first receives voice input and converts it into text data in real time. By using speech recognition technology, the user's conversation can be accurately captured as text information.
[0231] Next, the terminal sends the converted text data to the server via the internet connection. The server has a natural language processing model for analyzing the received text data, and uses this model to extract linguistic features and context. Based on the analysis results, the server calculates a power harassment score. The calculation method includes the frequency of occurrence of specific keywords and phrases, context, and matching with a database of past power harassment cases.
[0232] If the power harassment score exceeds the set threshold, the server immediately generates a warning message. This warning message includes specific problematic statements and the reasons for them, informing the perpetrator of how their words affect others.
[0233] Subsequently, a warning message is sent to the device and the user is notified. The notification appears as an on-screen pop-up or an audio alert, so the user can immediately notice its presence. For example, if the statement "You're always so useless" is detected, a warning will be issued because this statement carries the risk of belittling the other person.
[0234] This provides users with an opportunity to review their unconscious words and actions, creating an environment that helps prevent workplace harassment. It is believed that using this system will promote healthy communication in the workplace and contribute to increased productivity.
[0235] The following describes the processing flow.
[0236] Step 1:
[0237] The device captures the user's voice using the microphone and temporarily stores it as audio data. Noise filtering is also performed simultaneously to ensure sound quality.
[0238] Step 2:
[0239] The device converts acquired audio data into text data in real time. It uses a speech recognition engine to achieve highly accurate transcription.
[0240] Step 3:
[0241] The device sends the converted text data to the server. During transmission, the data is encrypted to protect user privacy.
[0242] Step 4:
[0243] The server receives the transmitted text data and begins analysis. It uses a natural language processing model to extract linguistic features and understand the context.
[0244] Step 5:
[0245] The server calculates a power harassment score based on the analysis results. It checks for the appearance of specific keywords and phrases and calculates the score by comparing them with past power harassment case data.
[0246] Step 6:
[0247] The server compares the power harassment score to a baseline value. If the score exceeds the baseline value, it generates a warning message about the problematic remarks. This message explains which parts are problematic and why.
[0248] Step 7:
[0249] The server sends the generated warning message to the terminal.
[0250] Step 8:
[0251] The device notifies the user of any received warning messages. The notification is provided visually and audibly, allowing the user to understand the content immediately.
[0252] (Example 1)
[0253] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0254] In modern society, there is a need to reduce the risk of inappropriate remarks and harassment in the workplace and other communication settings, and to promote a healthy environment for dialogue. In particular, there is a need for a system that can immediately grasp the psychological impact that unconscious remarks made by participants have on others and respond appropriately.
[0255] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0256] In this invention, the server includes means for acquiring audio data and converting it into text information, means for transmitting the converted text information to a remote device, and means for analyzing the received text information using a natural language processing model. This makes it possible to immediately evaluate whether a participant's remarks have a psychological impact and to issue a warning if necessary.
[0257] "Audio data" refers to the digital representation of audio information spoken by a user.
[0258] "Textual information" refers to data obtained by analyzing audio data and converting it into a text format that humans can understand.
[0259] A "natural language processing model" is a computer program that understands human language and facilitates communication. It is used to analyze and extract the context and meaning of data.
[0260] A "psychological impact score" is an index that quantifies the potential psychological and emotional impact that a statement may have on others.
[0261] A "warning message" is information generated when a user's statement violates the standards, intended to alert the user to the issue.
[0262] A "remote device" is a device that can communicate via a network and is used to process or store information.
[0263] "Encryption" is a technology that transforms information based on a specific algorithm to improve data security and prevent unauthorized access.
[0264] This invention is implemented as a system consisting of software installed on a portable information terminal or computer device used by a user.
[0265] Terminal:
[0266] When a user speaks, the device acquires the audio data and converts it into text in real time. A general-purpose speech recognition service is used for this process, for example. This ensures that the spoken words are accurately converted into digital text.
[0267] server:
[0268] After the text information is converted, it is sent to a server via an internet connection. The server uses a generative AI model and a natural language processing model to analyze the text information. Here, the server extracts linguistic features and context from the text information and calculates a psychological impact score. This analysis uses machine learning and data mining techniques, and past cases are also referenced in the database.
[0269] Warning message:
[0270] If the psychological impact score exceeds a certain threshold, the server generates a warning message. This message is sent to the user's device and displayed on the display screen. For example, if a negative comment is detected, the user will receive a warning such as, "This comment may give an inappropriate impression."
[0271] Specific examples and prompt statements:
[0272] If a user says "You are incompetent" during a workplace conversation, the system analyzes this expression and determines whether to issue an appropriate warning. An example of a prompt provided to the generative AI model is, "Analyze this text and determine its psychological impact."
[0273] This system allows users to review their own statements and improve the quality of their communication. As a result, it is expected to improve the health and productivity of interactions in the user environment.
[0274] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0275] Step 1:
[0276] The device acquires the user's speech as audio data. The input is the user's voice, which is converted into digital audio data. The device sends this audio data to its internal speech recognition engine, which converts it into text information in real time. The output as a result of the conversion is text information that records the user's speech.
[0277] Step 2:
[0278] The terminal sends the converted character information to the server via the internet connection. The input here is the character information converted by the terminal. The terminal encrypts the character information using a transmission protocol and sends it securely to the server. The output is the character information data received by the server.
[0279] Step 3:
[0280] The server analyzes the received character information. The input is the character information transmitted from the terminal. The server utilizes the built-in generation AI model and natural language analysis model to perform data analysis for extracting language features and context. Through this analysis, a psychological impact score is obtained as the output.
[0281] Step 4:
[0282] Based on the analysis results, the server calculates the psychological impact score. The input is the result of natural language analysis. The server compares with the database of past cases and calculates the psychological impact score according to specific criteria. If the score exceeds the criteria that need to be met, a flag indicating that a warning is required is output.
[0283] Step 5:
[0284] Based on the score flag, the server generates a warning message. The input is the information regarding the score that exceeds the criteria. The server determines the specific content to be provided to the user by generating a warning message. The output is the text data of the warning message.
[0285] Step 6:
[0286] The terminal receives the warning message transmitted from the server and notifies the user. The input is the warning message data transferred from the server. The terminal uses screen display or voice alerts to promptly prompt the user's attention. The output is a warning display that the user can confirm.
[0287] (Application Example 1)
[0288] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".
[0289] In the workplace, it is essential to prevent situations where remarks are unintentionally perceived as harassment. However, it is difficult for speakers to constantly be aware of how their individual remarks are received. Furthermore, the lack of a system to immediately provide corrective measures when problems are recognized after a remark has been made hinders the development of healthy workplace communication.
[0290] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0291] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the text information and calculating a speech evaluation score, and means for generating warning information and providing suggestions. This enables the speaker to immediately recognize the impact their words have on others and to make appropriate behavioral modifications.
[0292] "Audio information" refers to data that represents content transmitted through sound in a digital format.
[0293] "Textual information" refers to data that represents audio information in text format.
[0294] "Means of conversion" refers to a function or device for converting audio information into text information in real time.
[0295] "Means of analysis" refers to a system or program that analyzes content based on acquired and converted character information to understand its meaning and context.
[0296] The "Behavioral Evaluation Score" is an index that evaluates and quantifies the appropriateness of a statement based on analyzed textual information.
[0297] A "warning message" is a notification generated when the calculated behavioral evaluation score exceeds a certain threshold, clearly indicating the points that should be corrected in the statement in question.
[0298] "Means of providing suggestions" refers to a function that provides users with advice and hints to review and improve their statements based on warning information.
[0299] This invention is a system for evaluating the appropriateness of speech by acquiring voice information in real time and analyzing its content. This system integrates speech recognition technology and natural language processing technology and is installed in an information processing terminal that users use on a daily basis. Specifically, it utilizes an input device for acquiring voice information (e.g., a smartphone microphone) and speech recognition software for converting it to text (e.g., Google Cloud Speech-to-Text).
[0300] The device first inputs the user's speech through its microphone and converts this speech information into text using speech recognition software. Then, the device sends this text information to a server via the internet.
[0301] The server interprets the received text information and analyzes the language features and context using natural language processing models (e.g., NLTK or Hugging Face's Transformers). Next, it calculates a behavior evaluation score based on the analysis results. This score is calculated based on set criteria, and warning information is added to statements that exceed a certain standard.
[0302] Warning information is sent to the user, including suggestions for improvement regarding the relevant statement. Users can immediately check this notification on their smartphone screen and recognize the unintended impact of their statement. This improves the quality of everyday communication and contributes to building smoother relationships in the workplace.
[0303] As a specific example, if a statement like "The delay in the progress of this project is your responsibility" is made during a meeting, the system can analyze this statement and issue a warning if it is considered to overly emphasize the responsibility of the other party. An exemplary prompt sentence on the user side is "Please identify the words and phrases in the following text that correspond to negative statements or power harassment behaviors in the workplace. If there are any, please also explain the reasons."
[0304] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0305] Step 1:
[0306] The terminal acquires the user's speech as voice information through the microphone. The input is the user's speech data. This voice information is converted into character information in real time using speech recognition software such as Google Cloud Speech-to-Text. The output is the converted text data.
[0307] Step 2:
[0308] The terminal transmits the converted character information to the server via the Internet. The input is the text data, which is transferred to the server. Since the HTTPS protocol is used for transmission, the data is transmitted securely.
[0309] Step 3:
[0310] The server analyzes the received character information. The input is the text data, and the analysis is performed using a natural language processing model (e.g., NLTK or Hugging Face Transformers). During the analysis process, the features and context of the language are evaluated, and the information is structured. The output is the analysis result.
[0311] Step 4:
[0312] The server calculates a behavioral evaluation score based on the analysis results. The input is the analysis results, and the appropriateness of the statements is quantified by the scoring algorithm. The output is the behavioral evaluation score.
[0313] Step 5:
[0314] The server compares the calculated behavioral evaluation score to a set threshold value. If the score exceeds the threshold value, it generates a warning message. The input is the behavioral evaluation score, and the output is the warning message. This message includes an analysis of why the remarks were problematic and suggestions for improvement.
[0315] Step 6:
[0316] The terminal receives warning information sent from the server and notifies the user. The input is the warning information, which is displayed as a pop-up or audio alert. The user acknowledges this warning and is prompted to reconsider their statement. The output is feedback on the user's perception and actions.
[0317] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0318] This invention is embodied as a software application installed on a user's smartphone, PC, or other device. This system has the function of converting audio data into text data in real time, and then sending the text data to a server for analysis. The server is equipped with a natural language processing model and an emotion engine, which are used to evaluate the context and emotion of the user's speech.
[0319] The server extracts linguistic features from the received text data and uses an emotion engine to recognize the user's emotions. The emotion engine identifies emotional keywords and expressions appearing in the text and has the function of evaluating the intensity and type of those emotions. Based on the text data and emotion data, the server calculates a power harassment score, which serves as an indicator of whether the conversation content constitutes power harassment.
[0320] The emotional data recognized by the emotion engine is reflected in the calculation of the harassment score, which is affected by whether the conversation is conducted calmly or emotionally aggressive. For example, the score increases when aggressive emotions are present, increasing the likelihood of a warning.
[0321] If the score exceeds a certain threshold, the server generates a warning message. This message includes the problematic statement, the reason it was deemed aggressive, and the user's emotional state. The generated warning message is sent to the device and the user is notified. The notification is made through a visual pop-up or an audio alert, allowing the user to re-evaluate their own statements and emotions.
[0322] This system provides users with the opportunity to correct inappropriate remarks they may have unconsciously made during communication, creating an environment aimed at preventing workplace harassment. The incorporation of an emotion engine enables more accurate feedback, making it a valuable tool for user self-improvement.
[0323] The following describes the processing flow.
[0324] Step 1:
[0325] The device continuously captures the user's voice using a microphone and stores it as audio data. This data is processed by a noise reduction algorithm to obtain a clear audio signal.
[0326] Step 2:
[0327] The device uses speech recognition software to convert voice data into text data in real time. The converted text is recorded as accurate textual information of the user's speech.
[0328] Step 3:
[0329] The device encrypts text data before sending it to the server. This ensures user privacy and enables secure data transmission.
[0330] Step 4:
[0331] The server inputs the received text data into a natural language processing (NLP) model for analysis. Here, linguistic features and contextual information from the text are extracted.
[0332] Step 5:
[0333] The server activates the emotion engine based on the analysis results of the NLP model. The emotion engine analyzes text dynamics and identifies emotional states such as positive, negative, and aggressive.
[0334] Step 6:
[0335] The server integrates acquired linguistic features and emotional information to calculate a power harassment score. The score quantifies the potential impact that the remarks have on the recipient.
[0336] Step 7:
[0337] The server compares the calculated harassment score to a baseline value. If the score exceeds the baseline value, a warning message is generated. The message includes the content of the problematic remarks and the sentiment analysis results.
[0338] Step 8:
[0339] The server sends the generated warning message to the terminal.
[0340] Step 9:
[0341] The device will notify the user of a warning message. The notification will appear as a pop-up on the device screen and can also be accompanied by an audio alert to prompt the user's immediate attention.
[0342] This allows users to review their own words, actions, and emotional states and receive feedback to take appropriate action.
[0343] (Example 2)
[0344] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0345] Traditional communication analysis systems have struggled to accurately identify emotions from voice and provide appropriate warnings to users when conversation content exceeds certain thresholds. In particular, there is a need to detect remarks that could potentially affect mental health, such as power harassment, in real time and provide immediate feedback. Furthermore, encryption is required for data transmission to protect privacy.
[0346] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0347] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the converted text information using natural language processing, and means for identifying emotions and generating emotion data based on the analysis results. This makes it possible to identify emotions during voice communication, calculate the risk of power harassment, and provide a warning to the user.
[0348] "Audio information" refers to data that represents and processes audio in a digital format.
[0349] "Textual information" refers to data obtained by converting audio data into linguistic characters.
[0350] "Analysis" is the process of processing textual information and evaluating its context, content, and emotions.
[0351] An "indicator" is an evaluation value that is quantified according to specific evaluation criteria based on the results obtained from the analysis.
[0352] "Warning information" is information generated to alert the user when the calculated indicator exceeds a certain threshold value.
[0353] "Emotional data" refers to data that indicates the intensity and type of emotion based on textual information.
[0354] "Encryption" is the process of transforming data into a form that cannot be interpreted by third parties in order to protect its content.
[0355] This invention is implemented by a system that captures speech through a user's voice input device and converts it to text in real time. This system utilizes speech recognition technology, specifically using services such as the Google Speech-to-Text API. The user's device converts the captured speech into text information and sends that text information to the server. During transmission, the data is encrypted using the SSL / TLS protocol to ensure the security of the communication.
[0356] The server uses a natural language processing model to analyze the received text information. Specifically, generative AI models such as BERT and GPT-3 are applied. This analysis extracts linguistic features from the text information and understands the context of the utterance. Furthermore, an emotion engine is used to identify emotions within the text information, and their intensity and type are recorded as data.
[0357] Based on emotional data and analysis results, the server calculates an index. This index indicates the likelihood that the conversation content constitutes power harassment. For example, if a user says, "Your way of doing things is completely wrong," the server reacts to the negative word "wrong" and rates it highly as negative emotional data. If the index exceeds a certain threshold, the server generates and sends a warning message to the user.
[0358] The generated warning information is displayed on the device and notified to the user through visual or audible alerts. This allows the user to understand in real time how their comments are being evaluated and how they can be improved.
[0359] As a concrete example, the following prompt sentences can be input into a generative AI model to simulate similar language analysis:
[0360] "Please explain how workplace conversations are recorded in real time, how negative emotional expressions are identified, and how they are evaluated and notified."
[0361] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0362] Step 1:
[0363] The device captures the user's voice through the microphone. This voice data is input in real time to a speech recognition service such as the Google Speech-to-Text API. The input voice data is processed and output as text information. For example, if the user says "Hello, how are you?", this phrase is converted into text information.
[0364] Step 2:
[0365] The terminal sends the converted character information to the server. SSL / TLS encryption protocol is used during transmission to ensure data security. The input at this stage is the character information obtained in step 1, which is then output to the server in an encrypted form.
[0366] Step 3:
[0367] The server inputs the received text information into a natural language processing model. Specifically, it uses generative AI models such as BERT or GPT-3. The model analyzes the text information and extracts linguistic features and context. From the input text information, it obtains the results of contextual analysis as output. For example, it analyzes the emotional nuances of a phrase.
[0368] Step 4:
[0369] The server generates sentiment data via the sentiment engine based on the analyzed contextual data. The input for this step is the contextual data from step 3, which is processed to output the intensity and type of emotion. Specifically, it identifies positive, negative, and neutral emotions.
[0370] Step 5:
[0371] The server uses emotion data and contextual analysis results to calculate metrics. The emotion engine quantifies the score according to the intensity of the emotion and outputs this as a metric. For example, a high score is obtained when negative emotions are strong.
[0372] Step 6:
[0373] The server determines whether the calculated metric exceeds the threshold value. In this step, the metric is used as input, and the result of the determination (whether or not it exceeds the threshold value) is output. If it exceeds the threshold value, warning information is then generated.
[0374] Step 7:
[0375] The server generates warning information to be sent to the terminal. The input is the judgment result from step 6, which is used to create and output an appropriate warning message. For example, a message indicating that the user's statement is aggressive is constructed.
[0376] Step 8:
[0377] The terminal notifies the user of the received warning information. The input is warning information sent from the server, which is output as a screen display or audio alert. The user receives this and has an opportunity to re-evaluate their own statements.
[0378] (Application Example 2)
[0379] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".
[0380] In customer support interactions, staff members may unconsciously make inappropriate remarks to customers, which can lead to decreased customer satisfaction. To solve this problem, support is needed to enable staff members to evaluate their own words and actions in real time during conversations with customers and to respond appropriately.
[0381] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0382] In this invention, the server includes means for converting voice information into text information, means for analyzing the converted text information, means for calculating an evaluation score according to criteria based on the analysis results, and means for analyzing emotional data in real time and providing advice to support appropriate response methods. This enables the person in charge to always respond to customers with appropriate attitude and language while interacting with them.
[0383] "Audio information" refers to human speech and other sounds acquired as digital data.
[0384] "Textual information" refers to text data converted from audio information.
[0385] "Analysis" is the process of analyzing content based on textual information to identify specific patterns or emotions.
[0386] An "evaluation score" is a numerical indicator calculated based on the analysis results, which shows the degree of conformity to a specific standard.
[0387] A "notification message" is information that indicates a warning when the evaluation score exceeds a certain threshold.
[0388] "Emotional data" refers to information about emotions extracted from textual data.
[0389] "Providing advice" means offering real-time guidance on appropriate actions based on evaluation scores and sentiment data.
[0390] "Encryption" is the process of protecting audio information by transforming it using a specific algorithm so that it cannot be easily read by a third party.
[0391] The system for realizing this application is basically designed as a software application for acquiring voice information in real time and converting it into text information. The user uses a device such as a smartphone or PC to input voice information through a microphone. The device then uses the SpeechRecognition library to convert the voice information into text information.
[0392] Text information transmitted from the terminal reaches the server, where it undergoes text analysis using natural language processing technology. This analysis utilizes a language processing model (for example, a model using TensorFlow). The server extracts sentiment data from the text information and calculates an evaluation score based on it. Based on the sentiment data, it also provides advice to encourage appropriate responses in specific situations.
[0393] The server generates visual or auditory notification messages for customer service representatives based on sentiment data and evaluation scores. For example, if strong emotions indicating customer dissatisfaction are detected, advice such as "To improve customer satisfaction, please strive for a calm response" might be provided. This message is displayed via the terminal's screen or through an audio alert.
[0394] For example, if the system detects that a customer is experiencing high levels of stress during a customer service call, it analyzes the customer's emotions in real time and advises the caller to remain calm. This approach is expected to improve customer satisfaction.
[0395] An example of a prompt message might be: "Develop an application that provides advice to facilitate smooth conversations with customers. It will use voice input, analyze sentiment in real time, and recommend appropriate responses as needed."
[0396] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0397] Step 1:
[0398] The user inputs voice information using the device's microphone. The device acquires this voice signal and converts the voice into text information using the SpeechRecognition library. In this step, the input is voice data, and the output is text data converted from the voice. Data processing involves voice sampling and subsequent conversion to text information.
[0399] Step 2:
[0400] After the character information is generated, the terminal sends this information to the server. At this stage, the character data is transferred to the server via the network. The input is the character information, and the output is the character information as received by the server. This data is encrypted when it is sent to the server, ensuring security.
[0401] Step 3:
[0402] The server analyzes the received text information and uses a natural language processing model to analyze its content. TensorFlow's language model is used for this purpose. The input is the received text information, and the output is the semantic information of the analyzed text. Data calculations include structural analysis of the language and extraction of sentiment data.
[0403] Step 4:
[0404] Based on the analysis results, the server extracts emotional data and calculates an evaluation score. This score is a numerical representation of how appropriate the conversation is. The input is the analysis results, and the output is the evaluation score. Data processing at this stage includes the evaluation of the type and intensity of emotions by the emotion engine.
[0405] Step 5:
[0406] Based on the evaluation score, the server generates real-time advice for the user. This advice suggests specific actions to facilitate smooth customer interactions. The input is the evaluation score, and the output is a visual or auditory notification message presented to the user. Here, a generative AI model is used to create appropriate advice.
[0407] Step 6:
[0408] Notification messages sent from the server are received by the terminal and presented to the user. Here, the user reflects on their response and improves it as needed. The input is the notification message from the server, and the output is the information displayed to the user. Specific actions include pop-up displays on the terminal screen and audio alerts.
[0409] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0410] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0411] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0412] [Third Embodiment]
[0413] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0414] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0415] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0416] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0417] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0418] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0419] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0420] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0421] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0422] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0423] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0424] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0425] This invention is implemented as a software application installed on a user's device, such as a smartphone or PC. The device first receives voice input and converts it into text data in real time. By using speech recognition technology, the user's conversation can be accurately captured as text information.
[0426] Next, the terminal sends the converted text data to the server via the internet connection. The server has a natural language processing model for analyzing the received text data, and uses this model to extract linguistic features and context. Based on the analysis results, the server calculates a power harassment score. The calculation method includes the frequency of occurrence of specific keywords and phrases, context, and matching with a database of past power harassment cases.
[0427] If the power harassment score exceeds the set threshold, the server immediately generates a warning message. This warning message includes specific problematic statements and the reasons for them, informing the perpetrator of how their words affect others.
[0428] Subsequently, a warning message is sent to the device and the user is notified. The notification appears as an on-screen pop-up or an audio alert, so the user can immediately notice its presence. For example, if the statement "You're always so useless" is detected, a warning will be issued because this statement carries the risk of belittling the other person.
[0429] This provides users with an opportunity to review their unconscious words and actions, creating an environment that helps prevent workplace harassment. It is believed that using this system will promote healthy communication in the workplace and contribute to increased productivity.
[0430] The following describes the processing flow.
[0431] Step 1:
[0432] The device captures the user's voice using the microphone and temporarily stores it as audio data. Noise filtering is also performed simultaneously to ensure sound quality.
[0433] Step 2:
[0434] The device converts acquired audio data into text data in real time. It uses a speech recognition engine to achieve highly accurate transcription.
[0435] Step 3:
[0436] The device sends the converted text data to the server. During transmission, the data is encrypted to protect user privacy.
[0437] Step 4:
[0438] The server receives the transmitted text data and begins analysis. It uses a natural language processing model to extract linguistic features and understand the context.
[0439] Step 5:
[0440] The server calculates a power harassment score based on the analysis results. It checks for the appearance of specific keywords and phrases and calculates the score by comparing them with past power harassment case data.
[0441] Step 6:
[0442] The server compares the power harassment score to a baseline value. If the score exceeds the baseline value, it generates a warning message about the problematic remarks. This message explains which parts are problematic and why.
[0443] Step 7:
[0444] The server sends the generated warning message to the terminal.
[0445] Step 8:
[0446] The device notifies the user of any received warning messages. The notification is provided visually and audibly, allowing the user to understand the content immediately.
[0447] (Example 1)
[0448] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0449] In modern society, there is a need to reduce the risk of inappropriate remarks and harassment in the workplace and other communication settings, and to promote a healthy environment for dialogue. In particular, there is a need for a system that can immediately grasp the psychological impact that unconscious remarks made by participants have on others and respond appropriately.
[0450] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0451] In this invention, the server includes means for acquiring audio data and converting it into text information, means for transmitting the converted text information to a remote device, and means for analyzing the received text information using a natural language processing model. This makes it possible to immediately evaluate whether a participant's remarks have a psychological impact and to issue a warning if necessary.
[0452] "Audio data" refers to the digital representation of audio information spoken by a user.
[0453] "Textual information" refers to data obtained by analyzing audio data and converting it into a text format that humans can understand.
[0454] A "natural language processing model" is a computer program that understands human language and facilitates communication. It is used to analyze and extract the context and meaning of data.
[0455] A "psychological impact score" is an index that quantifies the potential psychological and emotional impact that a statement may have on others.
[0456] A "warning message" is information generated when a user's statement violates the standards, intended to alert the user to the issue.
[0457] A "remote device" is a device that can communicate via a network and is used to process or store information.
[0458] "Encryption" is a technology that transforms information based on a specific algorithm to improve data security and prevent unauthorized access.
[0459] This invention is implemented as a system consisting of software installed on a portable information terminal or computer device used by a user.
[0460] Terminal:
[0461] When a user speaks, the device acquires the audio data and converts it into text in real time. A general-purpose speech recognition service is used for this process, for example. This ensures that the spoken words are accurately converted into digital text.
[0462] server:
[0463] After the text information is converted, it is sent to a server via an internet connection. The server uses a generative AI model and a natural language processing model to analyze the text information. Here, the server extracts linguistic features and context from the text information and calculates a psychological impact score. This analysis uses machine learning and data mining techniques, and past cases are also referenced in the database.
[0464] Warning message:
[0465] If the psychological impact score exceeds a certain threshold, the server generates a warning message. This message is sent to the user's device and displayed on the display screen. For example, if a negative comment is detected, the user will receive a warning such as, "This comment may give an inappropriate impression."
[0466] Specific examples and prompt statements:
[0467] If a user says "You are incompetent" during a workplace conversation, the system analyzes this expression and determines whether to issue an appropriate warning. An example of a prompt provided to the generative AI model is, "Analyze this text and determine its psychological impact."
[0468] This system allows users to review their own statements and improve the quality of their communication. As a result, it is expected to improve the health and productivity of interactions in the user environment.
[0469] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0470] Step 1:
[0471] The device acquires the user's speech as audio data. The input is the user's voice, which is converted into digital audio data. The device sends this audio data to its internal speech recognition engine, which converts it into text information in real time. The output as a result of the conversion is text information that records the user's speech.
[0472] Step 2:
[0473] The terminal sends the converted character information to the server via the internet connection. The input here is the character information converted by the terminal. The terminal encrypts the character information using a transmission protocol and sends it securely to the server. The output is the character information data received by the server.
[0474] Step 3:
[0475] The server analyzes the received text information. The input is text information sent from the terminal. The server utilizes its built-in generative AI model and natural language processing model to perform data analysis to extract linguistic features and context. This analysis yields a psychological impact score as output.
[0476] Step 4:
[0477] The server calculates a psychological impact score based on the analysis results. The input is the result of natural language processing. The server compares this with a database of past cases and calculates the psychological impact score according to specific criteria. If this score exceeds the criteria, a flag is output indicating that a warning is necessary.
[0478] Step 5:
[0479] The server generates a warning message based on the score flag. The input is information about the score exceeding the threshold. The server generates a warning message and determines the specific content to provide to the user. The output is the text data of the warning message.
[0480] Step 6:
[0481] The terminal receives warning messages sent from the server and notifies the user. The input is the warning message data transferred from the server. The terminal immediately alerts the user using screen displays and audio alerts. The output is a warning display that the user can see.
[0482] (Application Example 1)
[0483] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0484] In the workplace, it is essential to prevent situations where remarks are unintentionally perceived as harassment. However, it is difficult for speakers to constantly be aware of how their individual remarks are received. Furthermore, the lack of a system to immediately provide corrective measures when problems are recognized after a remark has been made hinders the development of healthy workplace communication.
[0485] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0486] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the text information and calculating a speech evaluation score, and means for generating warning information and providing suggestions. This enables the speaker to immediately recognize the impact their words have on others and to make appropriate behavioral modifications.
[0487] "Audio information" refers to data that represents content transmitted through sound in a digital format.
[0488] "Textual information" refers to data that represents audio information in text format.
[0489] "Means of conversion" refers to a function or device for converting audio information into text information in real time.
[0490] "Means of analysis" refers to a system or program that analyzes content based on acquired and converted character information to understand its meaning and context.
[0491] The "Behavioral Evaluation Score" is an index that evaluates and quantifies the appropriateness of a statement based on analyzed textual information.
[0492] A "warning message" is a notification generated when the calculated behavioral evaluation score exceeds a certain threshold, clearly indicating the points that should be corrected in the statement in question.
[0493] "Means of providing suggestions" refers to a function that provides users with advice and hints to review and improve their statements based on warning information.
[0494] This invention is a system for evaluating the appropriateness of speech by acquiring voice information in real time and analyzing its content. This system integrates speech recognition technology and natural language processing technology and is installed in an information processing terminal that users use on a daily basis. Specifically, it utilizes an input device for acquiring voice information (e.g., a smartphone microphone) and speech recognition software for converting it to text (e.g., Google Cloud Speech-to-Text).
[0495] The device first inputs the user's speech through its microphone and converts this speech information into text using speech recognition software. Then, the device sends this text information to a server via the internet.
[0496] The server interprets the received text information and analyzes the language features and context using natural language processing models (e.g., NLTK or Hugging Face's Transformers). Next, it calculates a behavior evaluation score based on the analysis results. This score is calculated based on set criteria, and warning information is added to statements that exceed a certain standard.
[0497] Warning information is sent to the user, including suggestions for improvement regarding the relevant statement. Users can immediately check this notification on their smartphone screen and recognize the unintended impact of their statement. This improves the quality of everyday communication and contributes to building smoother relationships in the workplace.
[0498] For example, if someone says during a meeting, "It's your fault that this project is progressing so slowly," the system can analyze this statement and issue a warning if it excessively emphasizes the other person's responsibility. An example prompt for the user would be, "Identify any negative comments or harassment in the following text. If any, explain why."
[0499] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0500] Step 1:
[0501] The device acquires user speech as audio information via a microphone. The input is user speech data. This audio information is converted into text in real time using speech recognition software such as Google Cloud Speech-to-Text. The output is the converted text data.
[0502] Step 2:
[0503] The terminal sends the converted character information to the server via the internet. The input is text data, which is then transferred to the server. Since the HTTPS protocol is used for transmission, the data is transmitted securely.
[0504] Step 3:
[0505] The server analyzes the received text information. The input is text data, and the analysis is performed using a natural language processing model (e.g., NLTK or Hugging Face's Transformers). During the analysis process, linguistic features and context are evaluated, and the information is structured. The output is the analysis result.
[0506] Step 4:
[0507] The server calculates a behavioral evaluation score based on the analysis results. The input is the analysis results, and the appropriateness of the statements is quantified by the scoring algorithm. The output is the behavioral evaluation score.
[0508] Step 5:
[0509] The server compares the calculated behavioral evaluation score to a set threshold value. If the score exceeds the threshold value, it generates a warning message. The input is the behavioral evaluation score, and the output is the warning message. This message includes an analysis of why the remarks were problematic and suggestions for improvement.
[0510] Step 6:
[0511] The terminal receives warning information sent from the server and notifies the user. The input is the warning information, which is displayed as a pop-up or audio alert. The user acknowledges this warning and is prompted to reconsider their statement. The output is feedback on the user's perception and actions.
[0512] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0513] This invention is embodied as a software application installed on a user's smartphone, PC, or other device. This system has the function of converting audio data into text data in real time, and then sending the text data to a server for analysis. The server is equipped with a natural language processing model and an emotion engine, which are used to evaluate the context and emotion of the user's speech.
[0514] The server extracts linguistic features from the received text data and uses an emotion engine to recognize the user's emotions. The emotion engine identifies emotional keywords and expressions appearing in the text and has the function of evaluating the intensity and type of those emotions. Based on the text data and emotion data, the server calculates a power harassment score, which serves as an indicator of whether the conversation content constitutes power harassment.
[0515] The emotional data recognized by the emotion engine is reflected in the calculation of the harassment score, which is affected by whether the conversation is conducted calmly or emotionally aggressive. For example, the score increases when aggressive emotions are present, increasing the likelihood of a warning.
[0516] If the score exceeds a certain threshold, the server generates a warning message. This message includes the problematic statement, the reason it was deemed aggressive, and the user's emotional state. The generated warning message is sent to the device and the user is notified. The notification is made through a visual pop-up or an audio alert, allowing the user to re-evaluate their own statements and emotions.
[0517] This system provides users with the opportunity to correct inappropriate remarks they may have unconsciously made during communication, creating an environment aimed at preventing workplace harassment. The incorporation of an emotion engine enables more accurate feedback, making it a valuable tool for user self-improvement.
[0518] The following describes the processing flow.
[0519] Step 1:
[0520] The device continuously captures the user's voice using a microphone and stores it as audio data. This data is processed by a noise reduction algorithm to obtain a clear audio signal.
[0521] Step 2:
[0522] The device uses speech recognition software to convert voice data into text data in real time. The converted text is recorded as accurate textual information of the user's speech.
[0523] Step 3:
[0524] The device encrypts text data before sending it to the server. This ensures user privacy and enables secure data transmission.
[0525] Step 4:
[0526] The server inputs the received text data into a natural language processing (NLP) model for analysis. Here, linguistic features and contextual information from the text are extracted.
[0527] Step 5:
[0528] The server activates the emotion engine based on the analysis results of the NLP model. The emotion engine analyzes text dynamics and identifies emotional states such as positive, negative, and aggressive.
[0529] Step 6:
[0530] The server integrates acquired linguistic features and emotional information to calculate a power harassment score. The score quantifies the potential impact that the remarks have on the recipient.
[0531] Step 7:
[0532] The server compares the calculated harassment score to a baseline value. If the score exceeds the baseline value, a warning message is generated. The message includes the content of the problematic remarks and the sentiment analysis results.
[0533] Step 8:
[0534] The server sends the generated warning message to the terminal.
[0535] Step 9:
[0536] The device will notify the user of a warning message. The notification will appear as a pop-up on the device screen and can also be accompanied by an audio alert to prompt the user's immediate attention.
[0537] This allows users to review their own words, actions, and emotional states and receive feedback to take appropriate action.
[0538] (Example 2)
[0539] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0540] Traditional communication analysis systems have struggled to accurately identify emotions from voice and provide appropriate warnings to users when conversation content exceeds certain thresholds. In particular, there is a need to detect remarks that could potentially affect mental health, such as power harassment, in real time and provide immediate feedback. Furthermore, encryption is required for data transmission to protect privacy.
[0541] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0542] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the converted text information using natural language processing, and means for identifying emotions and generating emotion data based on the analysis results. This makes it possible to identify emotions during voice communication, calculate the risk of power harassment, and provide a warning to the user.
[0543] "Audio information" refers to data that represents and processes audio in a digital format.
[0544] "Textual information" refers to data obtained by converting audio data into linguistic characters.
[0545] "Analysis" is the process of processing textual information and evaluating its context, content, and emotions.
[0546] An "indicator" is an evaluation value that is quantified according to specific evaluation criteria based on the results obtained from the analysis.
[0547] "Warning information" is information generated to alert the user when the calculated indicator exceeds a certain threshold value.
[0548] "Emotional data" refers to data that indicates the intensity and type of emotion based on textual information.
[0549] "Encryption" is the process of transforming data into a form that cannot be interpreted by third parties in order to protect its content.
[0550] This invention is implemented by a system that captures speech through a user's voice input device and converts it to text in real time. This system utilizes speech recognition technology, specifically using services such as the Google Speech-to-Text API. The user's device converts the captured speech into text information and sends that text information to the server. During transmission, the data is encrypted using the SSL / TLS protocol to ensure the security of the communication.
[0551] The server uses a natural language processing model to analyze the received text information. Specifically, generative AI models such as BERT and GPT-3 are applied. This analysis extracts linguistic features from the text information and understands the context of the utterance. Furthermore, an emotion engine is used to identify emotions within the text information, and their intensity and type are recorded as data.
[0552] Based on emotional data and analysis results, the server calculates an index. This index indicates the likelihood that the conversation content constitutes power harassment. For example, if a user says, "Your way of doing things is completely wrong," the server reacts to the negative word "wrong" and rates it highly as negative emotional data. If the index exceeds a certain threshold, the server generates and sends a warning message to the user.
[0553] The generated warning information is displayed on the device and notified to the user through visual or audible alerts. This allows the user to understand in real time how their comments are being evaluated and how they can be improved.
[0554] As a concrete example, the following prompt sentences can be input into a generative AI model to simulate similar language analysis:
[0555] "Please explain how workplace conversations are recorded in real time, how negative emotional expressions are identified, and how they are evaluated and notified."
[0556] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0557] Step 1:
[0558] The device captures the user's voice through the microphone. This voice data is input in real time to a speech recognition service such as the Google Speech-to-Text API. The input voice data is processed and output as text information. For example, if the user says "Hello, how are you?", this phrase is converted into text information.
[0559] Step 2:
[0560] The terminal sends the converted character information to the server. SSL / TLS encryption protocol is used during transmission to ensure data security. The input at this stage is the character information obtained in step 1, which is then output to the server in an encrypted form.
[0561] Step 3:
[0562] The server inputs the received text information into a natural language processing model. Specifically, it uses generative AI models such as BERT or GPT-3. The model analyzes the text information and extracts linguistic features and context. From the input text information, it obtains the results of contextual analysis as output. For example, it analyzes the emotional nuances of a phrase.
[0563] Step 4:
[0564] The server generates sentiment data via the sentiment engine based on the analyzed contextual data. The input for this step is the contextual data from step 3, which is processed to output the intensity and type of emotion. Specifically, it identifies positive, negative, and neutral emotions.
[0565] Step 5:
[0566] The server uses emotion data and contextual analysis results to calculate metrics. The emotion engine quantifies the score according to the intensity of the emotion and outputs this as a metric. For example, a high score is obtained when negative emotions are strong.
[0567] Step 6:
[0568] The server determines whether the calculated metric exceeds the threshold value. In this step, the metric is used as input, and the result of the determination (whether or not it exceeds the threshold value) is output. If it exceeds the threshold value, warning information is then generated.
[0569] Step 7:
[0570] The server generates warning information to be sent to the terminal. The input is the judgment result from step 6, which is used to create and output an appropriate warning message. For example, a message indicating that the user's statement is aggressive is constructed.
[0571] Step 8:
[0572] The terminal notifies the user of the received warning information. The input is warning information sent from the server, which is output as a screen display or audio alert. The user receives this and has an opportunity to re-evaluate their own statements.
[0573] (Application Example 2)
[0574] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0575] In customer support interactions, staff members may unconsciously make inappropriate remarks to customers, which can lead to decreased customer satisfaction. To solve this problem, support is needed to enable staff members to evaluate their own words and actions in real time during conversations with customers and to respond appropriately.
[0576] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0577] In this invention, the server includes means for converting voice information into text information, means for analyzing the converted text information, means for calculating an evaluation score according to criteria based on the analysis results, and means for analyzing emotional data in real time and providing advice to support appropriate response methods. This enables the person in charge to always respond to customers with appropriate attitude and language while interacting with them.
[0578] "Audio information" refers to human speech and other sounds acquired as digital data.
[0579] "Textual information" refers to text data converted from audio information.
[0580] "Analysis" is the process of analyzing content based on textual information to identify specific patterns or emotions.
[0581] An "evaluation score" is a numerical indicator calculated based on the analysis results, which shows the degree of conformity to a specific standard.
[0582] A "notification message" is information that indicates a warning when the evaluation score exceeds a certain threshold.
[0583] "Emotional data" refers to information about emotions extracted from textual data.
[0584] "Providing advice" means offering real-time guidance on appropriate actions based on evaluation scores and sentiment data.
[0585] "Encryption" is the process of protecting audio information by transforming it using a specific algorithm so that it cannot be easily read by a third party.
[0586] The system for realizing this application is basically designed as a software application for acquiring voice information in real time and converting it into text information. The user uses a device such as a smartphone or PC to input voice information through a microphone. The device then uses the SpeechRecognition library to convert the voice information into text information.
[0587] Text information transmitted from the terminal reaches the server, where it undergoes text analysis using natural language processing technology. This analysis utilizes a language processing model (for example, a model using TensorFlow). The server extracts sentiment data from the text information and calculates an evaluation score based on it. Based on the sentiment data, it also provides advice to encourage appropriate responses in specific situations.
[0588] The server generates visual or auditory notification messages for customer service representatives based on sentiment data and evaluation scores. For example, if strong emotions indicating customer dissatisfaction are detected, advice such as "To improve customer satisfaction, please strive for a calm response" might be provided. This message is displayed via the terminal's screen or through an audio alert.
[0589] For example, if the system detects that a customer is experiencing high levels of stress during a customer service call, it analyzes the customer's emotions in real time and advises the caller to remain calm. This approach is expected to improve customer satisfaction.
[0590] An example of a prompt message might be: "Develop an application that provides advice to facilitate smooth conversations with customers. It will use voice input, analyze sentiment in real time, and recommend appropriate responses as needed."
[0591] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0592] Step 1:
[0593] The user inputs voice information using the device's microphone. The device acquires this voice signal and converts the voice into text information using the SpeechRecognition library. In this step, the input is voice data, and the output is text data converted from the voice. Data processing involves voice sampling and subsequent conversion to text information.
[0594] Step 2:
[0595] After the character information is generated, the terminal sends this information to the server. At this stage, the character data is transferred to the server via the network. The input is the character information, and the output is the character information as received by the server. This data is encrypted when it is sent to the server, ensuring security.
[0596] Step 3:
[0597] The server analyzes the received text information and uses a natural language processing model to analyze its content. TensorFlow's language model is used for this purpose. The input is the received text information, and the output is the semantic information of the analyzed text. Data calculations include structural analysis of the language and extraction of sentiment data.
[0598] Step 4:
[0599] Based on the analysis results, the server extracts emotional data and calculates an evaluation score. This score is a numerical representation of how appropriate the conversation is. The input is the analysis results, and the output is the evaluation score. Data processing at this stage includes the evaluation of the type and intensity of emotions by the emotion engine.
[0600] Step 5:
[0601] Based on the evaluation score, the server generates real-time advice for the user. This advice suggests specific actions to facilitate smooth customer interactions. The input is the evaluation score, and the output is a visual or auditory notification message presented to the user. Here, a generative AI model is used to create appropriate advice.
[0602] Step 6:
[0603] Notification messages sent from the server are received by the terminal and presented to the user. Here, the user reflects on their response and improves it as needed. The input is the notification message from the server, and the output is the information displayed to the user. Specific actions include pop-up displays on the terminal screen and audio alerts.
[0604] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0605] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0606] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0607] [Fourth Embodiment]
[0608] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0609] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0610] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0611] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0612] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0613] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0614] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0615] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0616] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0617] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0618] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0619] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0620] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0621] This invention is implemented as a software application installed on a user's device, such as a smartphone or PC. The device first receives voice input and converts it into text data in real time. By using speech recognition technology, the user's conversation can be accurately captured as text information.
[0622] Next, the terminal sends the converted text data to the server via the internet connection. The server has a natural language processing model for analyzing the received text data, and uses this model to extract linguistic features and context. Based on the analysis results, the server calculates a power harassment score. The calculation method includes the frequency of occurrence of specific keywords and phrases, context, and matching with a database of past power harassment cases.
[0623] If the power harassment score exceeds the set threshold, the server immediately generates a warning message. This warning message includes specific problematic statements and the reasons for them, informing the perpetrator of how their words affect others.
[0624] Subsequently, a warning message is sent to the device and the user is notified. The notification appears as an on-screen pop-up or an audio alert, so the user can immediately notice its presence. For example, if the statement "You're always so useless" is detected, a warning will be issued because this statement carries the risk of belittling the other person.
[0625] This provides users with an opportunity to review their unconscious words and actions, creating an environment that helps prevent workplace harassment. It is believed that using this system will promote healthy communication in the workplace and contribute to increased productivity.
[0626] The following describes the processing flow.
[0627] Step 1:
[0628] The device captures the user's voice using the microphone and temporarily stores it as audio data. Noise filtering is also performed simultaneously to ensure sound quality.
[0629] Step 2:
[0630] The device converts acquired audio data into text data in real time. It uses a speech recognition engine to achieve highly accurate transcription.
[0631] Step 3:
[0632] The device sends the converted text data to the server. During transmission, the data is encrypted to protect user privacy.
[0633] Step 4:
[0634] The server receives the transmitted text data and begins analysis. It uses a natural language processing model to extract linguistic features and understand the context.
[0635] Step 5:
[0636] The server calculates a power harassment score based on the analysis results. It checks for the appearance of specific keywords and phrases and calculates the score by comparing them with past power harassment case data.
[0637] Step 6:
[0638] The server compares the power harassment score to a baseline value. If the score exceeds the baseline value, it generates a warning message about the problematic remarks. This message explains which parts are problematic and why.
[0639] Step 7:
[0640] The server sends the generated warning message to the terminal.
[0641] Step 8:
[0642] The device notifies the user of any received warning messages. The notification is provided visually and audibly, allowing the user to understand the content immediately.
[0643] (Example 1)
[0644] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0645] In modern society, there is a need to reduce the risk of inappropriate remarks and harassment in the workplace and other communication settings, and to promote a healthy environment for dialogue. In particular, there is a need for a system that can immediately grasp the psychological impact that unconscious remarks made by participants have on others and respond appropriately.
[0646] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0647] In this invention, the server includes means for acquiring audio data and converting it into text information, means for transmitting the converted text information to a remote device, and means for analyzing the received text information using a natural language processing model. This makes it possible to immediately evaluate whether a participant's remarks have a psychological impact and to issue a warning if necessary.
[0648] "Audio data" refers to the digital representation of audio information spoken by a user.
[0649] "Textual information" refers to data obtained by analyzing audio data and converting it into a text format that humans can understand.
[0650] A "natural language processing model" is a computer program that understands human language and facilitates communication. It is used to analyze and extract the context and meaning of data.
[0651] A "psychological impact score" is an index that quantifies the potential psychological and emotional impact that a statement may have on others.
[0652] A "warning message" is information generated when a user's statement violates the standards, intended to alert the user to the issue.
[0653] A "remote device" is a device that can communicate via a network and is used to process or store information.
[0654] "Encryption" is a technology that transforms information based on a specific algorithm to improve data security and prevent unauthorized access.
[0655] This invention is implemented as a system consisting of software installed on a portable information terminal or computer device used by a user.
[0656] Terminal:
[0657] When a user speaks, the device acquires the audio data and converts it into text in real time. A general-purpose speech recognition service is used for this process, for example. This ensures that the spoken words are accurately converted into digital text.
[0658] server:
[0659] After the text information is converted, it is sent to a server via an internet connection. The server uses a generative AI model and a natural language processing model to analyze the text information. Here, the server extracts linguistic features and context from the text information and calculates a psychological impact score. This analysis uses machine learning and data mining techniques, and past cases are also referenced in the database.
[0660] Warning message:
[0661] If the psychological impact score exceeds a certain threshold, the server generates a warning message. This message is sent to the user's device and displayed on the display screen. For example, if a negative comment is detected, the user will receive a warning such as, "This comment may give an inappropriate impression."
[0662] Specific examples and prompt statements:
[0663] If a user says "You are incompetent" during a workplace conversation, the system analyzes this expression and determines whether to issue an appropriate warning. An example of a prompt provided to the generative AI model is, "Analyze this text and determine its psychological impact."
[0664] This system allows users to review their own statements and improve the quality of their communication. As a result, it is expected to improve the health and productivity of interactions in the user environment.
[0665] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0666] Step 1:
[0667] The device acquires the user's speech as audio data. The input is the user's voice, which is converted into digital audio data. The device sends this audio data to its internal speech recognition engine, which converts it into text information in real time. The output as a result of the conversion is text information that records the user's speech.
[0668] Step 2:
[0669] The terminal sends the converted character information to the server via the internet connection. The input here is the character information converted by the terminal. The terminal encrypts the character information using a transmission protocol and sends it securely to the server. The output is the character information data received by the server.
[0670] Step 3:
[0671] The server analyzes the received text information. The input is text information sent from the terminal. The server utilizes its built-in generative AI model and natural language processing model to perform data analysis to extract linguistic features and context. This analysis yields a psychological impact score as output.
[0672] Step 4:
[0673] The server calculates a psychological impact score based on the analysis results. The input is the result of natural language processing. The server compares this with a database of past cases and calculates the psychological impact score according to specific criteria. If this score exceeds the criteria, a flag is output indicating that a warning is necessary.
[0674] Step 5:
[0675] The server generates a warning message based on the score flag. The input is information about the score exceeding the threshold. The server generates a warning message and determines the specific content to provide to the user. The output is the text data of the warning message.
[0676] Step 6:
[0677] The terminal receives warning messages sent from the server and notifies the user. The input is the warning message data transferred from the server. The terminal immediately alerts the user using screen displays and audio alerts. The output is a warning display that the user can see.
[0678] (Application Example 1)
[0679] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0680] In the workplace, it is essential to prevent situations where remarks are unintentionally perceived as harassment. However, it is difficult for speakers to constantly be aware of how their individual remarks are received. Furthermore, the lack of a system to immediately provide corrective measures when problems are recognized after a remark has been made hinders the development of healthy workplace communication.
[0681] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0682] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the text information and calculating a speech evaluation score, and means for generating warning information and providing suggestions. This enables the speaker to immediately recognize the impact their words have on others and to make appropriate behavioral modifications.
[0683] "Audio information" refers to data that represents content transmitted through sound in a digital format.
[0684] "Textual information" refers to data that represents audio information in text format.
[0685] "Means of conversion" refers to a function or device for converting audio information into text information in real time.
[0686] "Means of analysis" refers to a system or program that analyzes content based on acquired and converted character information to understand its meaning and context.
[0687] The "Behavioral Evaluation Score" is an index that evaluates and quantifies the appropriateness of a statement based on analyzed textual information.
[0688] A "warning message" is a notification generated when the calculated behavioral evaluation score exceeds a certain threshold, clearly indicating the points that should be corrected in the statement in question.
[0689] "Means of providing suggestions" refers to a function that provides users with advice and hints to review and improve their statements based on warning information.
[0690] This invention is a system for evaluating the appropriateness of speech by acquiring voice information in real time and analyzing its content. This system integrates speech recognition technology and natural language processing technology and is installed in an information processing terminal that users use on a daily basis. Specifically, it utilizes an input device for acquiring voice information (e.g., a smartphone microphone) and speech recognition software for converting it to text (e.g., Google Cloud Speech-to-Text).
[0691] The device first inputs the user's speech through its microphone and converts this speech information into text using speech recognition software. Then, the device sends this text information to a server via the internet.
[0692] The server interprets the received text information and analyzes the language features and context using natural language processing models (e.g., NLTK or Hugging Face's Transformers). Next, it calculates a behavior evaluation score based on the analysis results. This score is calculated based on set criteria, and warning information is added to statements that exceed a certain standard.
[0693] Warning information is sent to the user, including suggestions for improvement regarding the relevant statement. Users can immediately check this notification on their smartphone screen and recognize the unintended impact of their statement. This improves the quality of everyday communication and contributes to building smoother relationships in the workplace.
[0694] For example, if someone says during a meeting, "It's your fault that this project is progressing so slowly," the system can analyze this statement and issue a warning if it excessively emphasizes the other person's responsibility. An example prompt for the user would be, "Identify any negative comments or harassment in the following text. If any, explain why."
[0695] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0696] Step 1:
[0697] The device acquires user speech as audio information via a microphone. The input is user speech data. This audio information is converted into text in real time using speech recognition software such as Google Cloud Speech-to-Text. The output is the converted text data.
[0698] Step 2:
[0699] The terminal sends the converted character information to the server via the internet. The input is text data, which is then transferred to the server. Since the HTTPS protocol is used for transmission, the data is transmitted securely.
[0700] Step 3:
[0701] The server analyzes the received text information. The input is text data, and the analysis is performed using a natural language processing model (e.g., NLTK or Hugging Face's Transformers). During the analysis process, linguistic features and context are evaluated, and the information is structured. The output is the analysis result.
[0702] Step 4:
[0703] The server calculates a behavioral evaluation score based on the analysis results. The input is the analysis results, and the appropriateness of the statements is quantified by the scoring algorithm. The output is the behavioral evaluation score.
[0704] Step 5:
[0705] The server compares the calculated behavioral evaluation score to a set threshold value. If the score exceeds the threshold value, it generates a warning message. The input is the behavioral evaluation score, and the output is the warning message. This message includes an analysis of why the remarks were problematic and suggestions for improvement.
[0706] Step 6:
[0707] The terminal receives warning information sent from the server and notifies the user. The input is the warning information, which is displayed as a pop-up or audio alert. The user acknowledges this warning and is prompted to reconsider their statement. The output is feedback on the user's perception and actions.
[0708] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0709] This invention is embodied as a software application installed on a user's smartphone, PC, or other device. This system has the function of converting audio data into text data in real time, and then sending the text data to a server for analysis. The server is equipped with a natural language processing model and an emotion engine, which are used to evaluate the context and emotion of the user's speech.
[0710] The server extracts linguistic features from the received text data and uses an emotion engine to recognize the user's emotions. The emotion engine identifies emotional keywords and expressions appearing in the text and has the function of evaluating the intensity and type of those emotions. Based on the text data and emotion data, the server calculates a power harassment score, which serves as an indicator of whether the conversation content constitutes power harassment.
[0711] The emotional data recognized by the emotion engine is reflected in the calculation of the harassment score, which is affected by whether the conversation is conducted calmly or emotionally aggressive. For example, the score increases when aggressive emotions are present, increasing the likelihood of a warning.
[0712] If the score exceeds a certain threshold, the server generates a warning message. This message includes the problematic statement, the reason it was deemed aggressive, and the user's emotional state. The generated warning message is sent to the device and the user is notified. The notification is made through a visual pop-up or an audio alert, allowing the user to re-evaluate their own statements and emotions.
[0713] This system provides users with the opportunity to correct inappropriate remarks they may have unconsciously made during communication, creating an environment aimed at preventing workplace harassment. The incorporation of an emotion engine enables more accurate feedback, making it a valuable tool for user self-improvement.
[0714] The following describes the processing flow.
[0715] Step 1:
[0716] The device continuously captures the user's voice using a microphone and stores it as audio data. This data is processed by a noise reduction algorithm to obtain a clear audio signal.
[0717] Step 2:
[0718] The device uses speech recognition software to convert voice data into text data in real time. The converted text is recorded as accurate textual information of the user's speech.
[0719] Step 3:
[0720] The device encrypts text data before sending it to the server. This ensures user privacy and enables secure data transmission.
[0721] Step 4:
[0722] The server inputs the received text data into a natural language processing (NLP) model for analysis. Here, linguistic features and contextual information from the text are extracted.
[0723] Step 5:
[0724] The server activates the emotion engine based on the analysis results of the NLP model. The emotion engine analyzes text dynamics and identifies emotional states such as positive, negative, and aggressive.
[0725] Step 6:
[0726] The server integrates acquired linguistic features and emotional information to calculate a power harassment score. The score quantifies the potential impact that the remarks have on the recipient.
[0727] Step 7:
[0728] The server compares the calculated harassment score to a baseline value. If the score exceeds the baseline value, a warning message is generated. The message includes the content of the problematic remarks and the sentiment analysis results.
[0729] Step 8:
[0730] The server sends the generated warning message to the terminal.
[0731] Step 9:
[0732] The device will notify the user of a warning message. The notification will appear as a pop-up on the device screen and can also be accompanied by an audio alert to prompt the user's immediate attention.
[0733] This allows users to review their own words, actions, and emotional states and receive feedback to take appropriate action.
[0734] (Example 2)
[0735] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0736] Traditional communication analysis systems have struggled to accurately identify emotions from voice and provide appropriate warnings to users when conversation content exceeds certain thresholds. In particular, there is a need to detect remarks that could potentially affect mental health, such as power harassment, in real time and provide immediate feedback. Furthermore, encryption is required for data transmission to protect privacy.
[0737] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0738] In this invention, the server includes means for acquiring voice information and converting it into text information, means for analyzing the converted text information using natural language processing, and means for identifying emotions and generating emotion data based on the analysis results. This makes it possible to identify emotions during voice communication, calculate the risk of power harassment, and provide a warning to the user.
[0739] "Audio information" refers to data that represents and processes audio in a digital format.
[0740] "Textual information" refers to data obtained by converting audio data into linguistic characters.
[0741] "Analysis" is the process of processing textual information and evaluating its context, content, and emotions.
[0742] An "indicator" is an evaluation value that is quantified according to specific evaluation criteria based on the results obtained from the analysis.
[0743] "Warning information" is information generated to alert the user when the calculated indicator exceeds a certain threshold value.
[0744] "Emotional data" refers to data that indicates the intensity and type of emotion based on textual information.
[0745] "Encryption" is the process of transforming data into a form that cannot be interpreted by third parties in order to protect its content.
[0746] This invention is implemented by a system that captures speech through a user's voice input device and converts it to text in real time. This system utilizes speech recognition technology, specifically using services such as the Google Speech-to-Text API. The user's device converts the captured speech into text information and sends that text information to the server. During transmission, the data is encrypted using the SSL / TLS protocol to ensure the security of the communication.
[0747] The server uses a natural language processing model to analyze the received text information. Specifically, generative AI models such as BERT and GPT-3 are applied. This analysis extracts linguistic features from the text information and understands the context of the utterance. Furthermore, an emotion engine is used to identify emotions within the text information, and their intensity and type are recorded as data.
[0748] Based on emotional data and analysis results, the server calculates an index. This index indicates the likelihood that the conversation content constitutes power harassment. For example, if a user says, "Your way of doing things is completely wrong," the server reacts to the negative word "wrong" and rates it highly as negative emotional data. If the index exceeds a certain threshold, the server generates and sends a warning message to the user.
[0749] The generated warning information is displayed on the device and notified to the user through visual or audible alerts. This allows the user to understand in real time how their comments are being evaluated and how they can be improved.
[0750] As a concrete example, the following prompt sentences can be input into a generative AI model to simulate similar language analysis:
[0751] "Please explain how workplace conversations are recorded in real time, how negative emotional expressions are identified, and how they are evaluated and notified."
[0752] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0753] Step 1:
[0754] The device captures the user's voice through the microphone. This voice data is input in real time to a speech recognition service such as the Google Speech-to-Text API. The input voice data is processed and output as text information. For example, if the user says "Hello, how are you?", this phrase is converted into text information.
[0755] Step 2:
[0756] The terminal sends the converted character information to the server. SSL / TLS encryption protocol is used during transmission to ensure data security. The input at this stage is the character information obtained in step 1, which is then output to the server in an encrypted form.
[0757] Step 3:
[0758] The server inputs the received text information into a natural language processing model. Specifically, it uses generative AI models such as BERT or GPT-3. The model analyzes the text information and extracts linguistic features and context. From the input text information, it obtains the results of contextual analysis as output. For example, it analyzes the emotional nuances of a phrase.
[0759] Step 4:
[0760] The server generates sentiment data via the sentiment engine based on the analyzed contextual data. The input for this step is the contextual data from step 3, which is processed to output the intensity and type of emotion. Specifically, it identifies positive, negative, and neutral emotions.
[0761] Step 5:
[0762] The server uses emotion data and contextual analysis results to calculate metrics. The emotion engine quantifies the score according to the intensity of the emotion and outputs this as a metric. For example, a high score is obtained when negative emotions are strong.
[0763] Step 6:
[0764] The server determines whether the calculated metric exceeds the threshold value. In this step, the metric is used as input, and the result of the determination (whether or not it exceeds the threshold value) is output. If it exceeds the threshold value, warning information is then generated.
[0765] Step 7:
[0766] The server generates warning information to be sent to the terminal. The input is the judgment result from step 6, which is used to create and output an appropriate warning message. For example, a message indicating that the user's statement is aggressive is constructed.
[0767] Step 8:
[0768] The terminal notifies the user of the received warning information. The input is warning information sent from the server, which is output as a screen display or audio alert. The user receives this and has an opportunity to re-evaluate their own statements.
[0769] (Application Example 2)
[0770] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0771] In customer support interactions, staff members may unconsciously make inappropriate remarks to customers, which can lead to decreased customer satisfaction. To solve this problem, support is needed to enable staff members to evaluate their own words and actions in real time during conversations with customers and to respond appropriately.
[0772] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0773] In this invention, the server includes means for converting voice information into text information, means for analyzing the converted text information, means for calculating an evaluation score according to criteria based on the analysis results, and means for analyzing emotional data in real time and providing advice to support appropriate response methods. This enables the person in charge to always respond to customers with appropriate attitude and language while interacting with them.
[0774] "Audio information" refers to human speech and other sounds acquired as digital data.
[0775] "Textual information" refers to text data converted from audio information.
[0776] "Analysis" is the process of analyzing content based on textual information to identify specific patterns or emotions.
[0777] An "evaluation score" is a numerical indicator calculated based on the analysis results, which shows the degree of conformity to a specific standard.
[0778] A "notification message" is information that indicates a warning when the evaluation score exceeds a certain threshold.
[0779] "Emotional data" refers to information about emotions extracted from textual data.
[0780] "Providing advice" means offering real-time guidance on appropriate actions based on evaluation scores and sentiment data.
[0781] "Encryption" is the process of protecting audio information by transforming it using a specific algorithm so that it cannot be easily read by a third party.
[0782] The system for realizing this application is basically designed as a software application for acquiring voice information in real time and converting it into text information. The user uses a device such as a smartphone or PC to input voice information through a microphone. The device then uses the SpeechRecognition library to convert the voice information into text information.
[0783] Text information transmitted from the terminal reaches the server, where it undergoes text analysis using natural language processing technology. This analysis utilizes a language processing model (for example, a model using TensorFlow). The server extracts sentiment data from the text information and calculates an evaluation score based on it. Based on the sentiment data, it also provides advice to encourage appropriate responses in specific situations.
[0784] The server generates visual or auditory notification messages for customer service representatives based on sentiment data and evaluation scores. For example, if strong emotions indicating customer dissatisfaction are detected, advice such as "To improve customer satisfaction, please strive for a calm response" might be provided. This message is displayed via the terminal's screen or through an audio alert.
[0785] For example, if the system detects that a customer is experiencing high levels of stress during a customer service call, it analyzes the customer's emotions in real time and advises the caller to remain calm. This approach is expected to improve customer satisfaction.
[0786] An example of a prompt message might be: "Develop an application that provides advice to facilitate smooth conversations with customers. It will use voice input, analyze sentiment in real time, and recommend appropriate responses as needed."
[0787] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0788] Step 1:
[0789] The user inputs voice information using the device's microphone. The device acquires this voice signal and converts the voice into text information using the SpeechRecognition library. In this step, the input is voice data, and the output is text data converted from the voice. Data processing involves voice sampling and subsequent conversion to text information.
[0790] Step 2:
[0791] After the character information is generated, the terminal sends this information to the server. At this stage, the character data is transferred to the server via the network. The input is the character information, and the output is the character information as received by the server. This data is encrypted when it is sent to the server, ensuring security.
[0792] Step 3:
[0793] The server analyzes the received text information and uses a natural language processing model to analyze its content. TensorFlow's language model is used for this purpose. The input is the received text information, and the output is the semantic information of the analyzed text. Data calculations include structural analysis of the language and extraction of sentiment data.
[0794] Step 4:
[0795] Based on the analysis results, the server extracts emotional data and calculates an evaluation score. This score is a numerical representation of how appropriate the conversation is. The input is the analysis results, and the output is the evaluation score. Data processing at this stage includes the evaluation of the type and intensity of emotions by the emotion engine.
[0796] Step 5:
[0797] Based on the evaluation score, the server generates real-time advice for the user. This advice suggests specific actions to facilitate smooth customer interactions. The input is the evaluation score, and the output is a visual or auditory notification message presented to the user. Here, a generative AI model is used to create appropriate advice.
[0798] Step 6:
[0799] Notification messages sent from the server are received by the terminal and presented to the user. Here, the user reflects on their response and improves it as needed. The input is the notification message from the server, and the output is the information displayed to the user. Specific actions include pop-up displays on the terminal screen and audio alerts.
[0800] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0801] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0802] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0803] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0804] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0805] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0806] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0807] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0808] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0809] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0810] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0811] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0812] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0813] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0814] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0815] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0816] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0817] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0818] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0819] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0820] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0821] The following is further disclosed regarding the embodiments described above.
[0822] (Claim 1)
[0823] A means of acquiring audio data and converting it into text data,
[0824] A means for analyzing the converted text data,
[0825] Based on the analysis results, a means for calculating a power harassment score according to specific criteria,
[0826] A means of generating and notifying a warning message when the calculated score exceeds a certain threshold,
[0827] A system that includes this.
[0828] (Claim 2)
[0829] The system according to claim 1, comprising means for performing text data analysis using a natural language processing model.
[0830] (Claim 3)
[0831] The system according to claim 1, further comprising means for encrypting acquired audio data when transmitting it.
[0832] "Example 1"
[0833] (Claim 1)
[0834] A means of acquiring audio data and converting it into text information,
[0835] Means for transmitting the converted character information to a remote device,
[0836] A means of analyzing received text information using a natural language processing model,
[0837] A means to extract linguistic features and context based on the analysis results and calculate a psychological impact score according to the criteria,
[0838] A means for generating a warning message when the calculated psychological impact score exceeds a certain threshold,
[0839] A means of notifying and displaying a warning message on a display device,
[0840] A system that includes this.
[0841] (Claim 2)
[0842] The system according to claim 1, comprising means for analyzing textual information using a generative AI model.
[0843] (Claim 3)
[0844] The system according to claim 1, comprising means for encrypting information when transmitting acquired audio data.
[0845] "Application Example 1"
[0846] (Claim 1)
[0847] A means of acquiring audio information and converting it into text information,
[0848] A means for analyzing the converted character information,
[0849] A means for calculating a behavioral evaluation score according to specific criteria based on the analysis results,
[0850] A means of generating and notifying warning information when the calculated score exceeds the standard value,
[0851] A means of providing suggestions for modifying one's own actions based on the notified warning information,
[0852] ...
[0853] A system that includes this.
[0854] (Claim 2)
[0855] The system according to claim 1, comprising means for performing character information analysis using a natural language processing model.
[0856] (Claim 3)
[0857] The system according to claim 1, further comprising means for encrypting acquired voice information when transmitting it.
[0858] "Example 2 of combining an emotion engine"
[0859] (Claim 1)
[0860] A means of acquiring audio information and converting it into text information,
[0861] A means for analyzing the converted character information,
[0862] A means for calculating an index according to specific criteria based on the analysis results,
[0863] A means of generating and notifying warning information when the calculated indicator exceeds a standard value,
[0864] A means for identifying emotions based on textual information and generating emotion data,
[0865] A means of adjusting indicators based on emotional data,
[0866] A system that includes this.
[0867] (Claim 2)
[0868] The system according to claim 1, comprising means for performing character information analysis using natural language processing.
[0869] (Claim 3)
[0870] The system according to claim 1, further comprising means for encrypting acquired voice information when transmitting it.
[0871] "Application example 2 when combining with an emotional engine"
[0872] (Claim 1)
[0873] A means of acquiring audio information and converting it into text information,
[0874] A means for analyzing the converted character information,
[0875] A means for calculating an evaluation score according to the criteria based on the analysis results,
[0876] A means for generating a notification message and sending it to a communication device when the calculated evaluation score exceeds a certain threshold,
[0877] A means of analyzing emotional data in real time and providing advice to support appropriate response methods,
[0878] A system that includes this.
[0879] (Claim 2)
[0880] The system according to claim 1, comprising means for performing character information analysis using language processing technology.
[0881] (Claim 3)
[0882] The system according to claim 1, comprising means for encrypting information when transmitting acquired audio information. [Explanation of symbols]
[0883] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of acquiring audio information and converting it into text information, A means for analyzing the converted character information, A means for calculating a behavioral evaluation score according to specific criteria based on the analysis results, A means of generating and notifying warning information when the calculated score exceeds the standard value, A means of providing suggestions for modifying one's own actions based on the notified warning information, A system that includes this.
2. The system according to claim 1, comprising means for performing character information analysis using a natural language processing model.
3. The system according to claim 1, further comprising means for encrypting acquired audio information when transmitting it.