system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system using speech recognition and analysis technologies for real-time fraud detection and warning effectively addresses the challenge of imaginary billing fraud by converting ambient sound to text, analyzing for fraud patterns, and issuing timely alerts.

JP2026100715APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Dec 2024

Application

19 Jun 2026

Publication

JP2026100715A

IPC: H04M3/42; G08B25/08; G08B25/04; G08B29/18; G08B23/00; G08B29/10; G08B25/10; G08B25/06; H04M11/04; G08B29/16; G08B26/00; H04M1/72; H04M1/00; H04M1/656; G08B25/14; H04M1/71

AI Tagging

Application Domain

Cordless telephonesSpecial service for subscribers

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

system
JP2026102025ACordless telephonesData processing applications
system
JP2026098677AAcoustic time signalsCordless telephones
system
JP2026101971ACordless telephonesReservations
system
JP2026108294ACordless telephonesData processing applications
Wireless network-based human posture detection method, apparatus, device, and storage medium
JP7873873B2Cordless telephonesImage analysis

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Imaginary billing fraud targeting the elderly is difficult to prevent quickly and effectively due to the diversification of fraud methods, necessitating a system for real-time fraud risk recognition and immediate response.

Method used

A system utilizing speech recognition technology to convert ambient sound into text, analyze it for fraud patterns, and issue warnings or alerts to users and emergency contacts when fraud is likely.

Benefits of technology

Enables real-time detection and warning of fraudulent activities, protecting users by promptly informing them and their contacts, thereby preventing fraud.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026100715000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A speech recognition means that continuously acquires audio data and converts it from audio to text data, An analysis means for analyzing the aforementioned text data and comparing it with past fraud patterns, A notification system that issues a warning when it is determined that there is a high probability of fraud, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Imaginary billing fraud is a malicious fraud targeting especially the elderly and has caused many damages. However, due to the diversification of fraud methods in the current countermeasures, it is difficult to prevent fraud quickly and effectively. Therefore, there is a need for a system that allows users to recognize the risk of fraud in real time and respond immediately.

Means for Solving the Problems

[0005] This invention provides a system that utilizes speech recognition and analysis technology. Specifically, it employs speech recognition means that constantly acquires ambient sound and converts that sound into text. Furthermore, it includes analysis means that analyzes the text data and compares it with past fraud patterns to evaluate the likelihood of fraud. If it is determined that there is a high probability of fraud, it includes notification means that promptly issues a warning to the user and designated emergency contacts. In this way, the invention provides a system that can prevent fraud and protect users.

[0006] "Speech recognition means" refers to a device or technology that acquires ambient sound as digital speech data and converts it into text data.

[0007] "Analysis means" refers to a device or technology that analyzes text data obtained by speech recognition means and compares it with a database of past frauds in order to assess the likelihood of fraud.

[0008] "Notification means" refers to a device or technology for issuing a warning to the user and their emergency contacts when analysis means determine that there is a high probability of fraud.

[0009] A "fraud database" is a collection of information that accumulates past fraud cases and is used to match fraud patterns. [Brief explanation of the drawing]

[0010] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5]This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0011] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0012] First, let's explain the terminology used in the following explanation.

[0013] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0014] In the following embodiments, the numbered RAM (Random Access Memory) is a memory where information is temporarily stored and is used as a work memory by the processor.

[0015] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0016] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0017] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0018] [First Embodiment]

[0019] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0020] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0021] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0022] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0023] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0024] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0025] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0026] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0027] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0028] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0029] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0030] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0031] As an embodiment for carrying out the present invention, a system using speech recognition and analysis technology will be described. This system continuously monitors ambient sounds using a microphone installed in a specific terminal.

[0032] The device continuously acquires this audio data and converts it into a digital format. The acquired audio data is immediately converted into text data by an internal speech recognition engine. This text data is then sent to a server. The server analyzes the transmitted text data using natural language processing techniques to understand the context and intent of the conversation.

[0033] In this analysis, the server compares the text data against an internal fraud database. The fraud database contains a collection of various past fraud cases and includes patterns similar to fraudulent activities. The server checks for any matches with these patterns and assesses the likelihood of fraud.

[0034] If the server determines that there is a high probability of fraud, it sends the result to the device. Based on the information received, the device issues a warning to the user. This warning is provided as an audio or text message to inform the user of the potential fraud. The device also automatically sends an alert to the user's designated emergency contacts, such as family members or the police.

[0035] A concrete example of implementation is a scenario where a fraudulent billing scam is carried out over the phone. The terminal acquires the call content in real time, converts it into scam-specific phrases such as "there is an outstanding bill," and the server detects this signal. If the probability of fraud is assessed as high, a warning or alert is immediately sent to the user and their emergency contact. By introducing this system, real-time detection and warning of fraudulent activity are achieved, effectively protecting users from becoming victims.

[0036] The following describes the processing flow.

[0037] Step 1:

[0038] The device constantly monitors ambient sound and converts the acquired audio data into a digital format. The converted audio is processed as short audio segments at a pre-set buffering rate.

[0039] Step 2:

[0040] The device uses a speech recognition engine to convert speech data into text data in real time. In doing so, the speech recognition engine employs a machine learning model pre-trained on a large-scale speech dataset.

[0041] Step 3:

[0042] The terminal sends the converted text data to the server. The text data is securely transmitted to the server via the network using an encrypted protocol.

[0043] Step 4:

[0044] The server applies natural language processing algorithms to analyze the received text data. This allows the server to understand the content of the conversation and extract key keywords and phrases related to the scam.

[0045] Step 5:

[0046] The server compares the analyzed text against a fraud database. This database contains past fraud cases and is used to identify patterns that are likely to be fraudulent.

[0047] Step 6:

[0048] The server calculates a fraud score and, if it determines that the device is highly likely to be a scam, sends the result to the device. This score and the reason for the determination serve as criteria for issuing a warning to the user.

[0049] Step 7:

[0050] Upon receiving the judgment result from the server, the device will send a warning to the user via voice or text message. This warning will inform the user of the potential for fraud and urge them to be cautious.

[0051] Step 8:

[0052] Depending on the user's settings, the device will send an alert to emergency contacts regarding a potential scam. These emergency contacts include family members and the police, prompting them to take appropriate action depending on the situation.

[0053] Step 9:

[0054] Users verify whether or not a scam is actually occurring based on the warnings and alerts they receive. They then provide feedback through their device, and if there are any misidentifications, the data is used for training purposes.

[0055] Step 10:

[0056] When the server receives feedback, it uses it to improve the AI model, thereby increasing the accuracy of subsequent analyses. The database and model are regularly updated based on the feedback.

[0057] (Example 1)

[0058] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0059] Fraudulent activities are becoming more sophisticated every day, and conventional countermeasures often fail to detect them quickly. In particular, fraud conducted via voice calls requires rapid, real-time detection and warning, but achieving this has been difficult.

[0060] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0061] In this invention, the server includes an acoustic recognition means that continuously acquires voice information and converts it from sound to text information, an information analysis means that analyzes the text information and compares it with existing fraud patterns, and a communication means that transmits the text information to an information processing device using a secure communication protocol. This makes it possible to detect fraudulent activity in real time and quickly issue warnings to users and related parties.

[0062] "Audio information" refers to data that represents acoustic signals in digital format, and is an essential element for use in speech recognition and analysis.

[0063] "Acoustic recognition means" refers to a technical method or device that receives audio information as input and converts the audio data into textual information.

[0064] "Textual information" refers to text data converted from speech, and is fundamental information for analysis and data processing.

[0065] "Information analysis means" refers to technical methods or devices for processing textual information and recognizing and analyzing specific patterns or intentions.

[0066] "Communication means" refers to technical methods or devices for sending and receiving information between remote devices or equipment, and is intended to transfer data through a secure protocol.

[0067] "Transmission means" refers to a technical method or device for transmitting information to users or other stakeholders based on analysis results, and for issuing warnings or notifications.

[0068] "Communication method" refers to a technical technique or device that automatically sends information to a contact specified by the user to prompt a response in an emergency.

[0069] This invention relates to a fraud detection system that uses audio information, and continuously monitors surrounding audio information using a microphone installed in a specific terminal. To implement this system, the terminal is equipped with advanced speech recognition software as an acoustic recognition means. This software can instantly convert acoustic data into text data, which is written information.

[0070] The converted text information is sent to the server using a secure communication protocol (e.g., HTTPS). The server uses a natural language processing engine as an information analysis tool to analyze the text information and compare it with fraud patterns. This engine has an extensive database of existing frauds and performs highly accurate analysis based on past fraud patterns.

[0071] If a device is deemed highly likely to be a scam, the server uses its outgoing communication method to send a warning to the device. The device receives this warning and sends a notification to the user as a photo. This notification is also provided as an audio message or text message to immediately inform the user of the danger.

[0072] Furthermore, the device utilizes communication methods to automatically send alerts to emergency contacts pre-configured by the user. This automated contact function enables a swift response to the user's family and relevant authorities.

[0073] As a concrete example, let's consider a textbook case of a "It's me" scam. In this case, the terminal acquires the call content in real time and captures scam-specific phrases such as "You have an outstanding bill." The server immediately analyzes this information and, if it assesses the likelihood of fraud, issues a warning to the user and their emergency contact.

[0074] An example of a prompt to input into a generative AI model is: "Please describe the specific steps of a system that analyzes voice data in real time and warns the user if it is suspected of being fraudulent."

[0075] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0076] Step 1:

[0077] The device uses a built-in microphone to acquire ambient sound information in real time. The input is an analog acoustic signal, which is then converted into digital data by a digital conversion processor. This process prepares the audio data for subsequent processing.

[0078] Step 2:

[0079] The converted audio data is input into speech recognition software, which is an acoustic recognition means within the device. Here, a deep learning algorithm is used to convert the audio data into text data (written information). In this process, a noise reduction algorithm is also used to enhance sound quality and improve recognition accuracy. The output generated is text data, which is written information.

[0080] Step 3:

[0081] The terminal sends the generated text information to the server using a secure communication protocol. The input is text data, which is encrypted using protocols such as HTTPS to ensure data security before being sent to the server. The output is text data received on the server side.

[0082] Step 4:

[0083] The server inputs the received text data into a natural language processing (NLP) engine for analysis. This involves contextual analysis of the text and matching it against fraud patterns. Using the textual information as input, the analysis results generate evaluation data such as the likelihood of fraud. Generative AI models are useful in this process.

[0084] Step 5:

[0085] Based on text data deemed highly likely to be fraudulent, the server sends the evaluation results to the terminal. The input is the analysis result, which is immediately formatted into a notification format and sent via communication. The output is warning data received by the terminal.

[0086] Step 6:

[0087] The device sends a voice or text warning to the user based on the received warning data. Furthermore, it automatically sends alerts to pre-configured emergency contacts (such as family members or the police). This allows the user and their relevant parties to respond quickly. The input is warning data, and the output is a warning message.

[0088] (Application Example 1)

[0089] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0090] In today's world, where fraudulent activities are becoming increasingly diverse and sophisticated, consumers are at a higher risk of becoming victims of fraud through telephone and other communication methods. Traditional methods make it difficult to detect and warn about fraud in advance, and rapid countermeasures are needed. Therefore, a system is required that can detect fraudulent activity in real time and effectively issue warnings.

[0091] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0092] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, and a warning issuing means for sending the warning to an electronic device as voice or push notification. This enables real-time fraud detection and immediate warning issuance.

[0093] "Audio information" refers to the digital representation of audible speech sounds that have been acquired.

[0094] "Text information" refers to data in string format generated by speech recognition technology based on audio information.

[0095] "Conversion means" refers to a process or device that has the function of converting audio information into text information.

[0096] "Analysis means" refers to a process or device that has the function of comparing text information with past fraud patterns and evaluating the degree of similarity.

[0097] A "notification means" is a process or device that has the function of transmitting a warning to the user via a communication means based on the analysis results.

[0098] A "warning transmission method" is a process or device that has the function of sending information that has been determined to be fraudulent to a user's electronic device via voice or push notification.

[0099] The system for implementing this invention is configured to continuously acquire voice information and detect potential fraud in real time. The server receives voice information transmitted from the user's electronic device. The device is equipped with a high-sensitivity microphone, which is built into smartphones and mobile devices. The voice information is converted into a digital format by the device's software. The conversion uses the Google® Cloud Speech-to-Text API to achieve high-precision speech recognition.

[0100] Next, the server analyzes the received text information. This analysis utilizes natural language processing techniques and software libraries such as NLTK (Natural Language Toolkit). The server compares the text information with a database of previously collected fraud patterns and evaluates whether a matching pattern exists. If there is a high probability of fraudulent activity, the server generates a warning and notifies the user through a warning dissemination system. This notification is delivered to the user's electronic device as an audio or push notification.

[0101] For example, if a user receives a phone call containing the scam-specific phrase, "You have an outstanding bill," the server will detect it. A warning message is immediately sent to the user's smartphone, allowing them to recognize the risk of fraud.

[0102] Furthermore, when a generative AI model is used to support the operation of this program, prompts such as the following can be utilized: "Write code to detect fraudulent phrases and display warnings. For example, generate code that identifies the phrase 'You have an outstanding bill.'" This allows the system to adapt to new fraud patterns.

[0103] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0104] Step 1:

[0105] The device constantly acquires audio information using its built-in microphone. It receives ambient sound as input and converts it into a digital audio file. This allows it to process the audio data to make it easier to handle in subsequent processes.

[0106] Step 2:

[0107] The device converts the converted audio data into text information using speech recognition technology. It uses a digital audio file as input and converts it to text using the Google Cloud Speech-to-Text API. As output, it converts the conversation within the audio into text information and formats it into a format that allows for grammatical analysis.

[0108] Step 3:

[0109] The terminal sends the generated text information to the server. Text information is used as input and sent to the server via a secure communication path (e.g., HTTPS). As output, the server prepares the received text information for analysis.

[0110] Step 4:

[0111] The server analyzes the received text information and understands the context using natural language processing techniques. It uses text information as input and analyzes the meaning and relationships of each word using the NLTK library. The output is the analyzed contextual information.

[0112] Step 5:

[0113] The server compares text information against a database of fraud patterns. It uses the analysis results as input and compares them to known fraud patterns in the database. The output provides an evaluation of whether or not the information is likely to be fraudulent.

[0114] Step 6:

[0115] If the server determines that there is a high probability of fraud, it generates a notification. Using the evaluation results as input, it creates a message to warn the user based on the information identified as fraudulent. The warning content is prepared as output.

[0116] Step 7:

[0117] The server sends a warning to the terminal and notifies the user. Using the generated warning content as input, it sends it to the terminal as an audio or push notification. As output, the user can recognize the risk of fraud and take the necessary action.

[0118] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0119] This invention comprises a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. The system consists of a terminal that acquires the user's voice and a server that performs the analysis.

[0120] The device has the ability to convert ambient sound into digital data and then convert this data into text data in real time through a speech recognition engine. This text data is sent to a server where further analysis is performed.

[0121] The server analyzes the received text data using natural language processing technology and a fraud database to assess the likelihood of fraud. Furthermore, an emotion engine analyzes the voice data to recognize the user's emotional state. If this emotion analysis detects, for example, that the user is experiencing stress or anxiety, the likelihood of fraud is further increased.

[0122] If a high probability of fraud is detected, the server sends the result to the device. The device immediately sends a warning to the user, providing customized advice based on the user's emotional state. For example, a user who is detected as anxious may receive advice such as, "Please stay calm and check carefully." Warnings and alerts are also sent to emergency contacts, enabling a quick response.

[0123] As a concrete example, consider a situation where a user receives a fraudulent phone call. The terminal translates the call content, and if it contains fraud-related phrases, the server quickly identifies the risk of fraud. Furthermore, if the tone of voice during the call detects the user's feelings of surprise or confusion, an enhanced warning is provided, and prompt notification is sent to the user and their family. In this way, the present invention makes it possible to effectively detect fraudulent activity and implement defense measures that take into account the user's emotional state.

[0124] The following describes the processing flow.

[0125] Step 1:

[0126] The device constantly monitors ambient noise and acquires audio data through its microphone. This audio data is converted into a digital format and stored in an internal buffer.

[0127] Step 2:

[0128] The device uses a speech recognition engine to convert stored speech data into text data in real time. To improve the accuracy of speech recognition, the model is trained to handle specific accents and expressions.

[0129] Step 3:

[0130] The device sends voice and text data to the server. This data transmission is performed through an encrypted protocol to enhance security.

[0131] Step 4:

[0132] The server uses natural language processing techniques to extract keywords and phrases related to fraud from the received text data. Furthermore, it compares this data with a fraud database to determine if it is likely to be fraudulent.

[0133] Step 5:

[0134] The server analyzes the user's voice data to activate an emotion engine and evaluate the user's emotional state. Identifying emotions such as stress, anxiety, and anger influences the fraud risk assessment.

[0135] Step 6:

[0136] The server calculates a fraud score based on the analysis results and makes a final decision based on that score and the user's emotional state.

[0137] Step 7:

[0138] The server sends the judgment result to the terminal, including information to issue a warning if there is a high probability of fraud.

[0139] Step 8:

[0140] The device displays or voices a warning message to the user. The message includes personalized advice based on the identified emotional state.

[0141] Step 9:

[0142] The device also sends warnings and alerts to emergency contacts (family and police) to encourage a quick response.

[0143] Step 10:

[0144] The user receives a warning and then actually checks whether or not it is a scam. After checking, the results are fed back to the device to help improve the system's accuracy.

[0145] Step 11:

[0146] The server receives user feedback and uses it to update the AI model and improve the accuracy of future analyses.

[0147] (Example 2)

[0148] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0149] In modern times, fraudulent activities are becoming increasingly sophisticated, leading to a growing number of situations that cause users anxiety and stress. Conventional systems have limitations in detecting fraud and lack appropriate responses that take into account the user's emotional state. This invention aims to efficiently detect the risk of fraud while simultaneously providing protective measures that are tailored to the user's emotional state.

[0150] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0151] In this invention, the server includes recognition means for continuously acquiring voice information and converting it from voice to text information, evaluation means for analyzing the text information and comparing it with past fraud cases, and emotion analysis means for analyzing the user's emotional state and reinforcing the possibility of fraud. This enables early detection of fraudulent activity and appropriate, customized responses tailored to the user's emotions.

[0152] "Audio information" refers to data obtained by converting sounds generated around the user into a digital format.

[0153] "Textual information" refers to text data converted by speech recognition technology.

[0154] "Recognition means" refers to a device or algorithm for converting audio information into text information in real time.

[0155] "Evaluation method" refers to the process of analyzing textual information and comparing it with previously collected fraud cases to determine the risk of fraud.

[0156] "Emotional analysis methods" refer to technologies that analyze the tone and content of a user's voice to determine their current emotional state.

[0157] "Notification method" refers to a function that sends a warning to the user or their emergency contacts when a potential scam is detected.

[0158] "Advice generation means" refers to a system for providing individually tailored advice based on the user's emotional state.

[0159] This invention includes a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. This system consists of a terminal that acquires the user's voice information and a server that performs the analysis.

[0160] The device uses its built-in acoustic sensor to capture sound emitted from the user's surroundings as digital data. Typical speech recognition software includes a speech recognition API, which converts the speech information into text in real time. The converted text is immediately sent to the server.

[0161] The server analyzes textual information using advanced natural language processing algorithms. This analysis utilizes generative AI models as language modeling technology. Furthermore, it has a function to evaluate the likelihood of fraud by comparing it with a previously collected fraud database. In addition, an emotion analysis engine is used to understand the user's emotional state from the tone and content of their voice. This technology helps determine whether the user is experiencing stress or anxiety, thereby enhancing the fraud likelihood assessment.

[0162] If a server determines that a transaction is highly likely to be fraudulent, it returns that information to the device. The device then immediately sends a warning to the user, including personalized advice tailored to their emotional state. For example, a user experiencing anxiety might receive a message advising them to "calm down and reassess the situation." Furthermore, it includes a rapid notification function to emergency contacts, enabling immediate action when necessary.

[0163] A concrete example would be a situation where a user receives a suspicious phone call. The device converts the audio to text, and this text is analyzed by the server, resulting in an immediate fraud warning. An example of a prompt message might be, "Convert this audio data to text and check for signs of fraud. Then, analyze the user's emotional state to create an appropriate message." In this way, the system can effectively detect fraudulent activity and provide a defense that takes the user's emotional state into consideration.

[0164] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0165] Step 1:

[0166] The device uses acoustic sensors to acquire ambient sound information. The input is analog audio, which is then converted into digital data. This conversion process utilizes a speech recognition API to convert audio into text in real time. Specifically, the conversion begins when the user presses the "Start Audio Capture" button. The output is the converted text data.

[0167] Step 2:

[0168] The terminal sends the converted character information to the server. The input is text data, which is sent to the server using a secure protocol (e.g., HTTPS). Specifically, the text data is accumulated in a buffer in real time and transferred to the server each time it reaches a certain size. The output is the character information received by the server.

[0169] Step 3:

[0170] The server activates a natural language processing algorithm to analyze textual information. The input is text data received from the terminal, and a generative AI model is used to analyze the context of the text. In this process, the text is compared to a fraud database, and a risk assessment is performed. The output is assessment data indicating the risk of fraud.

[0171] Step 4:

[0172] The server uses an emotion analysis engine to analyze the user's emotional state. The input consists of voice and text information, which is used to estimate the user's emotions. Specifically, factors such as voice tone, speed, and volume are analyzed, and an emotion score is calculated. The output is data indicating the emotional state.

[0173] Step 5:

[0174] If the server determines that there is a high risk of fraud, it sends the evaluation result to the terminal. The input consists of evaluation data and sentiment state data, which are used to create a warning. The output is the warning information sent to the terminal.

[0175] Step 6:

[0176] The device immediately issues a warning to the user. The input is warning information received from the server, which is then communicated to the user via the device's display or audio output. It also simultaneously provides customized advice based on the user's emotional state. Specifically, a notification sound plays, and a message is displayed on the screen. The output consists of a warning and advice to the user.

[0177] Step 7:

[0178] If necessary, the device will automatically send a notification to emergency contacts. Input consists of evaluation data from the server and the user's in-app settings, which are used to create the notification. Specifically, a message is sent to pre-registered email addresses or phone numbers. The output is an alert message sent to emergency contacts.

[0179] (Application Example 2)

[0180] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0181] In recent years, fraudulent activities have become more sophisticated, making it difficult to adequately defend against them using traditional methods. Furthermore, many users experience anxiety and stress when encountering fraud, which can prevent them from taking appropriate action. There is a need for a system that addresses these issues, effectively detects potential fraud, and provides users with appropriate warnings and advice.

[0182] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0183] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, an emotion analysis means for analyzing the tone of voice information and recognizing the user's emotional state, and an advice means for providing customized advice based on the user's emotional state. This makes it possible to quickly detect fraudulent activity and take appropriate action according to the user's emotional state.

[0184] "Voice information" refers to data obtained by converting ambient sounds into digital signals, which are used for conversation and voice commands.

[0185] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is recorded as a string of characters.

[0186] "Conversion means" refers to technologies and devices for converting audio information into text information, and includes speech recognition engines.

[0187] "Analysis means" refers to technologies and devices for examining textual information and comparing it with previously recorded fraud patterns.

[0188] "Notification means" refers to technologies and devices used to issue warnings or alerts to users based on analysis results.

[0189] "Emotional analysis methods" refer to technologies and devices that recognize a user's emotional state based on voice information, analyzing voice tone and word choice.

[0190] "Advice tools" refer to technologies and devices that provide appropriate advice and action suggestions based on the user's emotional state.

[0191] A "server" is a computer system used for various data processing tasks, such as analyzing audio information and recognizing emotional states.

[0192] This invention is realized by a system that continuously acquires and analyzes audio information. Specifically, a terminal placed near the user acquires surrounding audio information. The terminal converts this audio information into digital data and then converts it into text information in real time using a speech recognition engine (e.g., Google Speech-to-Text API). This text information is sent to a server for further processing.

[0193] The server uses natural language processing technology (e.g., Google Cloud Natural Language API) to analyze textual information and compare it against past fraud patterns. If it is determined that there is a high probability of fraud, a notification system is activated and a warning is sent to the user.

[0194] The tone of voice information is also analyzed on the server to recognize the user's emotional state. Technologies such as IBM Watson® Tone Analyzer are used for emotion analysis. If the user is experiencing anxiety or stress, personalized advice is provided through the advice system.

[0195] For example, if a user receives a fraudulent phone call regarding their bank account, the device converts the conversation into text, and the server analyzes this information to identify the risk of fraud. Furthermore, if alarm is detected from the tone of voice, a warning is issued to the user saying, "This may be a scam. Please remain calm."

[0196] An example of a prompt for a generative AI model is: "Please tell me how to improve a smartphone app that performs voice analysis and emotion recognition to detect scam calls. I'm thinking about how to provide effective advice when the user is emotionally unstable."

[0197] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0198] Step 1:

[0199] The device acquires ambient audio information and converts it into digital signals. This input is audio data obtained from the device's microphone. The device then uses the Google Speech-to-Text API to convert this audio data into text information and generates its output.

[0200] Step 2:

[0201] The terminal sends the converted character information to the server. This input character information is passed to the server as is, without any processing. The server then prepares for further data analysis.

[0202] Step 3:

[0203] The server analyzes the received text information and compares it to past fraud patterns using natural language processing techniques. During this process, it uses the Google Cloud Natural Language API to process the text information and obtain output indicating the potential for fraud.

[0204] Step 4:

[0205] The server analyzes the tone of the voice information and uses IBM Watson Tone Analyzer to recognize the user's emotional state. This analysis yields an output of emotional states, including emotions such as surprise and anxiety.

[0206] Step 5:

[0207] The server issues warnings to users based on the likelihood of fraud and their emotional state. It sends notifications to the user's device via a notification system, customizing the message based on this data.

[0208] Step 6:

[0209] Users receive warnings and customized advice from their device and take appropriate action. Output to the user is displayed on the device screen or as an audio notification.

[0210] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0211] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0212] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0213] [Second Embodiment]

[0214] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0215] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0216] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0217] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0218] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0219] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0220] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0221] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0222] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0223] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0224] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0225] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0226] As an embodiment for carrying out the present invention, a system using speech recognition and analysis technology will be described. This system continuously monitors ambient sounds using a microphone installed in a specific terminal.

[0227] The device continuously acquires this audio data and converts it into a digital format. The acquired audio data is immediately converted into text data by an internal speech recognition engine. This text data is then sent to a server. The server analyzes the transmitted text data using natural language processing techniques to understand the context and intent of the conversation.

[0228] In this analysis, the server compares the text data against an internal fraud database. The fraud database contains a collection of various past fraud cases and includes patterns similar to fraudulent activities. The server checks for any matches with these patterns and assesses the likelihood of fraud.

[0229] If the server determines that there is a high probability of fraud, it sends the result to the device. Based on the information received, the device issues a warning to the user. This warning is provided as an audio or text message to inform the user of the potential fraud. The device also automatically sends an alert to the user's designated emergency contacts, such as family members or the police.

[0230] A concrete example of implementation is a scenario where a fraudulent billing scam is carried out over the phone. The terminal acquires the call content in real time, converts it into scam-specific phrases such as "there is an outstanding bill," and the server detects this signal. If the probability of fraud is assessed as high, a warning or alert is immediately sent to the user and their emergency contact. By introducing this system, real-time detection and warning of fraudulent activity are achieved, effectively protecting users from becoming victims.

[0231] The following describes the processing flow.

[0232] Step 1:

[0233] The device constantly monitors ambient sound and converts the acquired audio data into a digital format. The converted audio is processed as short audio segments at a pre-set buffering rate.

[0234] Step 2:

[0235] The device uses a speech recognition engine to convert speech data into text data in real time. In doing so, the speech recognition engine employs a machine learning model pre-trained on a large-scale speech dataset.

[0236] Step 3:

[0237] The terminal sends the converted text data to the server. The text data is securely transmitted to the server via the network using an encrypted protocol.

[0238] Step 4:

[0239] The server applies natural language processing algorithms to analyze the received text data. This allows the server to understand the content of the conversation and extract key keywords and phrases related to the scam.

[0240] Step 5:

[0241] The server compares the analyzed text against a fraud database. This database contains past fraud cases and is used to identify patterns that are likely to be fraudulent.

[0242] Step 6:

[0243] The server calculates a fraud score and, if it determines that the device is highly likely to be a scam, sends the result to the device. This score and the reason for the determination serve as criteria for issuing a warning to the user.

[0244] Step 7:

[0245] Upon receiving the judgment result from the server, the device will send a warning to the user via voice or text message. This warning will inform the user of the potential for fraud and urge them to be cautious.

[0246] Step 8:

[0247] Depending on the user's settings, the device will send an alert to emergency contacts regarding a potential scam. These emergency contacts include family members and the police, prompting them to take appropriate action depending on the situation.

[0248] Step 9:

[0249] Users verify whether or not a scam is actually occurring based on the warnings and alerts they receive. They then provide feedback through their device, and if there are any misidentifications, the data is used for training purposes.

[0250] Step 10:

[0251] When the server receives feedback, it uses it to improve the AI model, thereby increasing the accuracy of subsequent analyses. The database and model are regularly updated based on the feedback.

[0252] (Example 1)

[0253] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0254] Fraudulent activities are becoming more sophisticated every day, and conventional countermeasures often fail to detect them quickly. In particular, fraud conducted via voice calls requires rapid, real-time detection and warning, but achieving this has been difficult.

[0255] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0256] In this invention, the server includes an acoustic recognition means that continuously acquires voice information and converts it from sound to text information, an information analysis means that analyzes the text information and compares it with existing fraud patterns, and a communication means that transmits the text information to an information processing device using a secure communication protocol. This makes it possible to detect fraudulent activity in real time and quickly issue warnings to users and related parties.

[0257] "Audio information" refers to data that represents acoustic signals in digital format, and is an essential element for use in speech recognition and analysis.

[0258] "Acoustic recognition means" refers to a technical method or device that receives audio information as input and converts the audio data into textual information.

[0259] "Textual information" refers to text data converted from speech, and is fundamental information for analysis and data processing.

[0260] "Information analysis means" refers to technical methods or devices for processing textual information and recognizing and analyzing specific patterns or intentions.

[0261] "Communication means" refers to technical methods or devices for sending and receiving information between remote devices or equipment, and is intended to transfer data through a secure protocol.

[0262] "Transmission means" refers to a technical method or device for transmitting information to users or other stakeholders based on analysis results, and for issuing warnings or notifications.

[0263] "Communication method" refers to a technical technique or device that automatically sends information to a contact specified by the user to prompt a response in an emergency.

[0264] This invention relates to a fraud detection system that uses audio information, and continuously monitors surrounding audio information using a microphone installed in a specific terminal. To implement this system, the terminal is equipped with advanced speech recognition software as an acoustic recognition means. This software can instantly convert acoustic data into text data, which is written information.

[0265] The converted text information is sent to the server using a secure communication protocol (e.g., HTTPS). The server uses a natural language processing engine as an information analysis tool to analyze the text information and compare it with fraud patterns. This engine has an extensive database of existing frauds and performs highly accurate analysis based on past fraud patterns.

[0266] If a device is deemed highly likely to be a scam, the server uses its outgoing communication method to send a warning to the device. The device receives this warning and sends a notification to the user as a photo. This notification is also provided as an audio message or text message to immediately inform the user of the danger.

[0267] Furthermore, the device utilizes communication methods to automatically send alerts to emergency contacts pre-configured by the user. This automated contact function enables a swift response to the user's family and relevant authorities.

[0268] As a concrete example, let's consider a textbook case of a "It's me" scam. In this case, the terminal acquires the call content in real time and captures scam-specific phrases such as "You have an outstanding bill." The server immediately analyzes this information and, if it assesses the likelihood of fraud, issues a warning to the user and their emergency contact.

[0269] An example of a prompt to input into a generative AI model is: "Please describe the specific steps of a system that analyzes voice data in real time and warns the user if it is suspected of being fraudulent."

[0270] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0271] Step 1:

[0272] The device uses a built-in microphone to acquire ambient sound information in real time. The input is an analog acoustic signal, which is then converted into digital data by a digital conversion processor. This process prepares the audio data for subsequent processing.

[0273] Step 2:

[0274] The converted audio data is input into speech recognition software, which is an acoustic recognition means within the device. Here, a deep learning algorithm is used to convert the audio data into text data (written information). In this process, a noise reduction algorithm is also used to enhance sound quality and improve recognition accuracy. The output generated is text data, which is written information.

[0275] Step 3:

[0276] The terminal sends the generated text information to the server using a secure communication protocol. The input is text data, which is encrypted using protocols such as HTTPS to ensure data security before being sent to the server. The output is text data received on the server side.

[0277] Step 4:

[0278] The server inputs the received text data into a natural language processing (NLP) engine for analysis. This involves contextual analysis of the text and matching it against fraud patterns. Using the textual information as input, the analysis results generate evaluation data such as the likelihood of fraud. Generative AI models are useful in this process.

[0279] Step 5:

[0280] Based on text data deemed highly likely to be fraudulent, the server sends the evaluation results to the terminal. The input is the analysis result, which is immediately formatted into a notification format and sent via communication. The output is warning data received by the terminal.

[0281] Step 6:

[0282] The device sends a voice or text warning to the user based on the received warning data. Furthermore, it automatically sends alerts to pre-configured emergency contacts (such as family members or the police). This allows the user and their relevant parties to respond quickly. The input is warning data, and the output is a warning message.

[0283] Application Example 1

[0284] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0285] In modern times when fraud is diversifying and becoming more sophisticated, the risk that consumers will suffer fraud through telephone or communication means is increasing. With conventional methods, it is difficult to detect fraud in advance and issue a warning, and prompt countermeasures are required. For this reason, a system that can detect fraud in real time and effectively issue a warning is needed.

[0286] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0287] In this invention, the server includes conversion means for constantly acquiring voice information and converting it from voice to text information, analysis means for analyzing the text information and comparing it with past fraud patterns, notification means for issuing a warning when it is determined that there is a high possibility of fraud, and warning transmission means for transmitting the warning to an electronic device as voice or push notification. Thereby, real-time detection of fraud and immediate warning transmission become possible.

[0288] "Voice information" is a digital representation of the acquired audible language voice.

[0289] "Text information" is data in string format generated by voice recognition technology based on voice information.

[0290] "Conversion means" is a process or device having a function of converting voice information into text information.

[0291] "Analysis means" is a process or device having a function of comparing text information with past fraud patterns and evaluating the degree of coincidence.

[0292] A "notification means" is a process or device that has the function of transmitting a warning to the user via a communication means based on the analysis results.

[0293] A "warning transmission method" is a process or device that has the function of sending information that has been determined to be fraudulent to a user's electronic device via voice or push notification.

[0294] The system for implementing this invention is configured to continuously acquire voice information and detect potential fraud in real time. The server receives voice information transmitted from the user's electronic device. The device is equipped with a high-sensitivity microphone, which is built into smartphones and mobile devices. The voice information is converted into a digital format by the device's software. The Google Cloud Speech-to-Text API is used for this conversion to achieve high-precision speech recognition.

[0295] Next, the server analyzes the received text information. This analysis utilizes natural language processing techniques and software libraries such as NLTK (Natural Language Toolkit). The server compares the text information with a database of previously collected fraud patterns and evaluates whether a matching pattern exists. If there is a high probability of fraudulent activity, the server generates a warning and notifies the user through a warning dissemination system. This notification is delivered to the user's electronic device as an audio or push notification.

[0296] For example, if a user receives a phone call containing the scam-specific phrase, "You have an outstanding bill," the server will detect it. A warning message is immediately sent to the user's smartphone, allowing them to recognize the risk of fraud.

[0297] Furthermore, when a generative AI model is used to support the operation of this program, prompts such as the following can be utilized: "Write code to detect fraudulent phrases and display warnings. For example, generate code that identifies the phrase 'You have an outstanding bill.'" This allows the system to adapt to new fraud patterns.

[0298] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0299] Step 1:

[0300] The device constantly acquires audio information using its built-in microphone. It receives ambient sound as input and converts it into a digital audio file. This allows it to process the audio data to make it easier to handle in subsequent processes.

[0301] Step 2:

[0302] The device converts the converted audio data into text information using speech recognition technology. It uses a digital audio file as input and converts it to text using the Google Cloud Speech-to-Text API. As output, it converts the conversation within the audio into text information and formats it into a format that allows for grammatical analysis.

[0303] Step 3:

[0304] The terminal sends the generated text information to the server. Text information is used as input and sent to the server via a secure communication path (e.g., HTTPS). As output, the server prepares the received text information for analysis.

[0305] Step 4:

[0306] The server analyzes the received text information and uses natural language processing technology to understand the context. Using the text information as input, it analyzes the meaning and relationship of each word using the NLTK library. As output, the analyzed context information is obtained.

[0307] Step 5:

[0308] The server compares the text information with a database of fraud patterns. Using the analysis result as input, it compares with the known fraud patterns in the database. As output, an evaluation result of whether there is suspicion of fraud is obtained.

[0309] Step 6:

[0310] If the server determines that the possibility of fraud is high, it generates a notification. Using the evaluation result as input, it creates a message for warning the user based on the information determined to be fraud. As output, the warning content is prepared.

[0311] Step 7:

[0312] The server sends the warning to the terminal to notify the user. Using the generated warning content as input, it sends it to the terminal as a voice or push notification. As output, the user can recognize the risk of fraud and take necessary actions.

[0313] Furthermore, an emotion engine for estimating the user's emotion may be combined. That is, the specific processing unit 290 may estimate the user's emotion using the emotion identification model 59 and perform specific processing using the user's emotion.

[0314] The present invention comprises a system that integrates voice recognition technology, natural language processing technology, and emotion analysis technology. The system is composed of a terminal that acquires the user's voice and a server that performs analysis.

[0315] The device has the ability to convert ambient sound into digital data and then convert this data into text data in real time through a speech recognition engine. This text data is sent to a server where further analysis is performed.

[0316] The server analyzes the received text data using natural language processing technology and a fraud database to assess the likelihood of fraud. Furthermore, an emotion engine analyzes the voice data to recognize the user's emotional state. If this emotion analysis detects, for example, that the user is experiencing stress or anxiety, the likelihood of fraud is further increased.

[0317] If a high probability of fraud is detected, the server sends the result to the device. The device immediately sends a warning to the user, providing customized advice based on the user's emotional state. For example, a user who is detected as anxious may receive advice such as, "Please stay calm and check carefully." Warnings and alerts are also sent to emergency contacts, enabling a quick response.

[0318] As a concrete example, consider a situation where a user receives a fraudulent phone call. The terminal translates the call content, and if it contains fraud-related phrases, the server quickly identifies the risk of fraud. Furthermore, if the tone of voice during the call detects the user's feelings of surprise or confusion, an enhanced warning is provided, and prompt notification is sent to the user and their family. In this way, the present invention makes it possible to effectively detect fraudulent activity and implement defense measures that take into account the user's emotional state.

[0319] The following describes the processing flow.

[0320] Step 1:

[0321] The device constantly monitors ambient noise and acquires audio data through its microphone. This audio data is converted into a digital format and stored in an internal buffer.

[0322] Step 2:

[0323] The device uses a speech recognition engine to convert stored speech data into text data in real time. To improve the accuracy of speech recognition, the model is trained to handle specific accents and expressions.

[0324] Step 3:

[0325] The device sends voice and text data to the server. This data transmission is performed through an encrypted protocol to enhance security.

[0326] Step 4:

[0327] The server uses natural language processing techniques to extract keywords and phrases related to fraud from the received text data. Furthermore, it compares this data with a fraud database to determine if it is likely to be fraudulent.

[0328] Step 5:

[0329] The server analyzes the user's voice data to activate an emotion engine and evaluate the user's emotional state. Identifying emotions such as stress, anxiety, and anger influences the fraud risk assessment.

[0330] Step 6:

[0331] The server calculates a fraud score based on the analysis results and makes a final decision based on that score and the user's emotional state.

[0332] Step 7:

[0333] The server sends the judgment result to the terminal, including information to issue a warning if there is a high probability of fraud.

[0334] Step 8:

[0335] The device displays or voices a warning message to the user. The message includes personalized advice based on the identified emotional state.

[0336] Step 9:

[0337] The device also sends warnings and alerts to emergency contacts (family and police) to encourage a quick response.

[0338] Step 10:

[0339] The user receives a warning and then actually checks whether or not it is a scam. After checking, the results are fed back to the device to help improve the system's accuracy.

[0340] Step 11:

[0341] The server receives user feedback and uses it to update the AI model and improve the accuracy of future analyses.

[0342] (Example 2)

[0343] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0344] In modern times, fraudulent activities are becoming increasingly sophisticated, leading to a growing number of situations that cause users anxiety and stress. Conventional systems have limitations in detecting fraud and lack appropriate responses that take into account the user's emotional state. This invention aims to efficiently detect the risk of fraud while simultaneously providing protective measures that are tailored to the user's emotional state.

[0345] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0346] In this invention, the server includes recognition means for continuously acquiring voice information and converting it from voice to text information, evaluation means for analyzing the text information and comparing it with past fraud cases, and emotion analysis means for analyzing the user's emotional state and reinforcing the possibility of fraud. This enables early detection of fraudulent activity and appropriate, customized responses tailored to the user's emotions.

[0347] "Audio information" refers to data obtained by converting sounds generated around the user into a digital format.

[0348] "Textual information" refers to text data converted by speech recognition technology.

[0349] "Recognition means" refers to a device or algorithm for converting audio information into text information in real time.

[0350] "Evaluation method" refers to the process of analyzing textual information and comparing it with previously collected fraud cases to determine the risk of fraud.

[0351] "Emotional analysis methods" refer to technologies that analyze the tone and content of a user's voice to determine their current emotional state.

[0352] "Notification method" refers to a function that sends a warning to the user or their emergency contacts when a potential scam is detected.

[0353] "Advice generation means" refers to a system for providing individually tailored advice based on the user's emotional state.

[0354] This invention includes a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. This system consists of a terminal that acquires the user's voice information and a server that performs the analysis.

[0355] The device uses its built-in acoustic sensor to capture sound emitted from the user's surroundings as digital data. Typical speech recognition software includes a speech recognition API, which converts the speech information into text in real time. The converted text is immediately sent to the server.

[0356] The server analyzes textual information using advanced natural language processing algorithms. This analysis utilizes generative AI models as language modeling technology. Furthermore, it has a function to evaluate the likelihood of fraud by comparing it with a previously collected fraud database. In addition, an emotion analysis engine is used to understand the user's emotional state from the tone and content of their voice. This technology helps determine whether the user is experiencing stress or anxiety, thereby enhancing the fraud likelihood assessment.

[0357] If a server determines that a transaction is highly likely to be fraudulent, it returns that information to the device. The device then immediately sends a warning to the user, including personalized advice tailored to their emotional state. For example, a user experiencing anxiety might receive a message advising them to "calm down and reassess the situation." Furthermore, it includes a rapid notification function to emergency contacts, enabling immediate action when necessary.

[0358] A concrete example would be a situation where a user receives a suspicious phone call. The device converts the audio to text, and this text is analyzed by the server, resulting in an immediate fraud warning. An example of a prompt message might be, "Convert this audio data to text and check for signs of fraud. Then, analyze the user's emotional state to create an appropriate message." In this way, the system can effectively detect fraudulent activity and provide a defense that takes the user's emotional state into consideration.

[0359] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0360] Step 1:

[0361] The device uses acoustic sensors to acquire ambient sound information. The input is analog audio, which is then converted into digital data. This conversion process utilizes a speech recognition API to convert audio into text in real time. Specifically, the conversion begins when the user presses the "Start Audio Capture" button. The output is the converted text data.

[0362] Step 2:

[0363] The terminal sends the converted character information to the server. The input is text data, which is sent to the server using a secure protocol (e.g., HTTPS). Specifically, the text data is accumulated in a buffer in real time and transferred to the server each time it reaches a certain size. The output is the character information received by the server.

[0364] Step 3:

[0365] The server activates a natural language processing algorithm to analyze textual information. The input is text data received from the terminal, and a generative AI model is used to analyze the context of the text. In this process, the text is compared to a fraud database, and a risk assessment is performed. The output is assessment data indicating the risk of fraud.

[0366] Step 4:

[0367] The server uses an emotion analysis engine to analyze the user's emotional state. The input consists of voice and text information, which is used to estimate the user's emotions. Specifically, factors such as voice tone, speed, and volume are analyzed, and an emotion score is calculated. The output is data indicating the emotional state.

[0368] Step 5:

[0369] If the server determines that there is a high risk of fraud, it sends the evaluation result to the terminal. The input consists of evaluation data and sentiment state data, which are used to create a warning. The output is the warning information sent to the terminal.

[0370] Step 6:

[0371] The device immediately issues a warning to the user. The input is warning information received from the server, which is then communicated to the user via the device's display or audio output. It also simultaneously provides customized advice based on the user's emotional state. Specifically, a notification sound plays, and a message is displayed on the screen. The output consists of a warning and advice to the user.

[0372] Step 7:

[0373] If necessary, the device will automatically send a notification to emergency contacts. Input consists of evaluation data from the server and the user's in-app settings, which are used to create the notification. Specifically, a message is sent to pre-registered email addresses or phone numbers. The output is an alert message sent to emergency contacts.

[0374] (Application Example 2)

[0375] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0376] In recent years, fraudulent activities have become more sophisticated, making it difficult to adequately defend against them using traditional methods. Furthermore, many users experience anxiety and stress when encountering fraud, which can prevent them from taking appropriate action. There is a need for a system that addresses these issues, effectively detects potential fraud, and provides users with appropriate warnings and advice.

[0377] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0378] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, an emotion analysis means for analyzing the tone of voice information and recognizing the user's emotional state, and an advice means for providing customized advice based on the user's emotional state. This makes it possible to quickly detect fraudulent activity and take appropriate action according to the user's emotional state.

[0379] "Voice information" refers to data obtained by converting ambient sounds into digital signals, which are used for conversation and voice commands.

[0380] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is recorded as a string of characters.

[0381] "Conversion means" refers to technologies and devices for converting audio information into text information, and includes speech recognition engines.

[0382] "Analysis means" refers to technologies and devices for examining textual information and comparing it with previously recorded fraud patterns.

[0383] "Notification means" refers to technologies and devices used to issue warnings or alerts to users based on analysis results.

[0384] "Emotional analysis methods" refer to technologies and devices that recognize a user's emotional state based on voice information, analyzing voice tone and word choice.

[0385] "Advice tools" refer to technologies and devices that provide appropriate advice and action suggestions based on the user's emotional state.

[0386] A "server" is a computer system used for various data processing tasks, such as analyzing audio information and recognizing emotional states.

[0387] This invention is realized by a system that continuously acquires and analyzes audio information. Specifically, a terminal placed near the user acquires surrounding audio information. The terminal converts this audio information into digital data and then converts it into text information in real time using a speech recognition engine (e.g., Google Speech-to-Text API). This text information is sent to a server for further processing.

[0388] The server uses natural language processing technology (e.g., Google Cloud Natural Language API) to analyze textual information and compare it against past fraud patterns. If it is determined that there is a high probability of fraud, a notification system is activated and a warning is sent to the user.

[0389] The tone of voice information is also analyzed on the server to recognize the user's emotional state. Technologies such as IBM Watson Tone Analyzer are used for emotion analysis. If the user is experiencing anxiety or stress, personalized advice is provided through the advice system.

[0390] For example, if a user receives a fraudulent phone call regarding their bank account, the device converts the conversation into text, and the server analyzes this information to identify the risk of fraud. Furthermore, if alarm is detected from the tone of voice, a warning is issued to the user saying, "This may be a scam. Please remain calm."

[0391] An example of a prompt for a generative AI model is: "Please tell me how to improve a smartphone app that performs voice analysis and emotion recognition to detect scam calls. I'm thinking about how to provide effective advice when the user is emotionally unstable."

[0392] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0393] Step 1:

[0394] The device acquires ambient audio information and converts it into digital signals. This input is audio data obtained from the device's microphone. The device then uses the Google Speech-to-Text API to convert this audio data into text information and generates its output.

[0395] Step 2:

[0396] The terminal sends the converted character information to the server. This input character information is passed to the server as is, without any processing. The server then prepares for further data analysis.

[0397] Step 3:

[0398] The server analyzes the received text information and compares it to past fraud patterns using natural language processing techniques. During this process, it uses the Google Cloud Natural Language API to process the text information and obtain output indicating the potential for fraud.

[0399] Step 4:

[0400] The server analyzes the tone of the voice information and uses IBM Watson Tone Analyzer to recognize the user's emotional state. This analysis yields an output of emotional states, including emotions such as surprise and anxiety.

[0401] Step 5:

[0402] The server issues warnings to users based on the likelihood of fraud and their emotional state. It sends notifications to the user's device via a notification system, customizing the message based on this data.

[0403] Step 6:

[0404] Users receive warnings and customized advice from their device and take appropriate action. Output to the user is displayed on the device screen or as an audio notification.

[0405] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0406] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0407] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0408] [Third Embodiment]

[0409] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0410] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0411] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0412] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0413] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0414] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0415] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0416] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0417] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0418] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0419] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0420] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0421] As an embodiment for carrying out the present invention, a system using speech recognition and analysis technology will be described. This system continuously monitors ambient sounds using a microphone installed in a specific terminal.

[0422] The device continuously acquires this audio data and converts it into a digital format. The acquired audio data is immediately converted into text data by an internal speech recognition engine. This text data is then sent to a server. The server analyzes the transmitted text data using natural language processing techniques to understand the context and intent of the conversation.

[0423] In this analysis, the server compares the text data against an internal fraud database. The fraud database contains a collection of various past fraud cases and includes patterns similar to fraudulent activities. The server checks for any matches with these patterns and assesses the likelihood of fraud.

[0424] If the server determines that there is a high probability of fraud, it sends the result to the device. Based on the information received, the device issues a warning to the user. This warning is provided as an audio or text message to inform the user of the potential fraud. The device also automatically sends an alert to the user's designated emergency contacts, such as family members or the police.

[0425] A concrete example of implementation is a scenario where a fraudulent billing scam is carried out over the phone. The terminal acquires the call content in real time, converts it into scam-specific phrases such as "there is an outstanding bill," and the server detects this signal. If the probability of fraud is assessed as high, a warning or alert is immediately sent to the user and their emergency contact. By introducing this system, real-time detection and warning of fraudulent activity are achieved, effectively protecting users from becoming victims.

[0426] The following describes the processing flow.

[0427] Step 1:

[0428] The device constantly monitors ambient sound and converts the acquired audio data into a digital format. The converted audio is processed as short audio segments at a pre-set buffering rate.

[0429] Step 2:

[0430] The device uses a speech recognition engine to convert speech data into text data in real time. In doing so, the speech recognition engine employs a machine learning model pre-trained on a large-scale speech dataset.

[0431] Step 3:

[0432] The terminal sends the converted text data to the server. The text data is securely transmitted to the server via the network using an encrypted protocol.

[0433] Step 4:

[0434] The server applies natural language processing algorithms to analyze the received text data. This allows the server to understand the content of the conversation and extract key keywords and phrases related to the scam.

[0435] Step 5:

[0436] The server compares the analyzed text against a fraud database. This database contains past fraud cases and is used to identify patterns that are likely to be fraudulent.

[0437] Step 6:

[0438] The server calculates a fraud score and, if it determines that the device is highly likely to be a scam, sends the result to the device. This score and the reason for the determination serve as criteria for issuing a warning to the user.

[0439] Step 7:

[0440] Upon receiving the judgment result from the server, the device will send a warning to the user via voice or text message. This warning will inform the user of the potential for fraud and urge them to be cautious.

[0441] Step 8:

[0442] Depending on the user's settings, the device will send an alert to emergency contacts regarding a potential scam. These emergency contacts include family members and the police, prompting them to take appropriate action depending on the situation.

[0443] Step 9:

[0444] Users verify whether or not a scam is actually occurring based on the warnings and alerts they receive. They then provide feedback through their device, and if there are any misidentifications, the data is used for training purposes.

[0445] Step 10:

[0446] When the server receives feedback, it uses it to improve the AI model, thereby increasing the accuracy of subsequent analyses. The database and model are regularly updated based on the feedback.

[0447] (Example 1)

[0448] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0449] Fraudulent activities are becoming more sophisticated every day, and conventional countermeasures often fail to detect them quickly. In particular, fraud conducted via voice calls requires rapid, real-time detection and warning, but achieving this has been difficult.

[0450] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0451] In this invention, the server includes an acoustic recognition means that continuously acquires voice information and converts it from sound to text information, an information analysis means that analyzes the text information and compares it with existing fraud patterns, and a communication means that transmits the text information to an information processing device using a secure communication protocol. This makes it possible to detect fraudulent activity in real time and quickly issue warnings to users and related parties.

[0452] "Audio information" refers to data that represents acoustic signals in digital format, and is an essential element for use in speech recognition and analysis.

[0453] "Acoustic recognition means" refers to a technical method or device that receives audio information as input and converts the audio data into textual information.

[0454] "Textual information" refers to text data converted from speech, and is fundamental information for analysis and data processing.

[0455] "Information analysis means" refers to technical methods or devices for processing textual information and recognizing and analyzing specific patterns or intentions.

[0456] "Communication means" refers to technical methods or devices for sending and receiving information between remote devices or equipment, and is intended to transfer data through a secure protocol.

[0457] "Transmission means" refers to a technical method or device for transmitting information to users or other stakeholders based on analysis results, and for issuing warnings or notifications.

[0458] "Communication method" refers to a technical technique or device that automatically sends information to a contact specified by the user to prompt a response in an emergency.

[0459] This invention relates to a fraud detection system that uses audio information, and continuously monitors surrounding audio information using a microphone installed in a specific terminal. To implement this system, the terminal is equipped with advanced speech recognition software as an acoustic recognition means. This software can instantly convert acoustic data into text data, which is written information.

[0460] The converted text information is sent to the server using a secure communication protocol (e.g., HTTPS). The server uses a natural language processing engine as an information analysis tool to analyze the text information and compare it with fraud patterns. This engine has an extensive database of existing frauds and performs highly accurate analysis based on past fraud patterns.

[0461] If a device is deemed highly likely to be a scam, the server uses its outgoing communication method to send a warning to the device. The device receives this warning and sends a notification to the user as a photo. This notification is also provided as an audio message or text message to immediately inform the user of the danger.

[0462] Furthermore, the device utilizes communication methods to automatically send alerts to emergency contacts pre-configured by the user. This automated contact function enables a swift response to the user's family and relevant authorities.

[0463] As a concrete example, let's consider a textbook case of a "It's me" scam. In this case, the terminal acquires the call content in real time and captures scam-specific phrases such as "You have an outstanding bill." The server immediately analyzes this information and, if it assesses the likelihood of fraud, issues a warning to the user and their emergency contact.

[0464] An example of a prompt to input into a generative AI model is: "Please describe the specific steps of a system that analyzes voice data in real time and warns the user if it is suspected of being fraudulent."

[0465] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0466] Step 1:

[0467] The device uses a built-in microphone to acquire ambient sound information in real time. The input is an analog acoustic signal, which is then converted into digital data by a digital conversion processor. This process prepares the audio data for subsequent processing.

[0468] Step 2:

[0469] The converted audio data is input into speech recognition software, which is an acoustic recognition means within the device. Here, a deep learning algorithm is used to convert the audio data into text data (written information). In this process, a noise reduction algorithm is also used to enhance sound quality and improve recognition accuracy. The output generated is text data, which is written information.

[0470] Step 3:

[0471] The terminal sends the generated text information to the server using a secure communication protocol. The input is text data, which is encrypted using protocols such as HTTPS to ensure data security before being sent to the server. The output is text data received on the server side.

[0472] Step 4:

[0473] The server inputs the received text data into a natural language processing (NLP) engine for analysis. This involves contextual analysis of the text and matching it against fraud patterns. Using the textual information as input, the analysis results generate evaluation data such as the likelihood of fraud. Generative AI models are useful in this process.

[0474] Step 5:

[0475] Based on text data deemed highly likely to be fraudulent, the server sends the evaluation results to the terminal. The input is the analysis result, which is immediately formatted into a notification format and sent via communication. The output is warning data received by the terminal.

[0476] Step 6:

[0477] The device sends a voice or text warning to the user based on the received warning data. Furthermore, it automatically sends alerts to pre-configured emergency contacts (such as family members or the police). This allows the user and their relevant parties to respond quickly. The input is warning data, and the output is a warning message.

[0478] (Application Example 1)

[0479] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0480] In today's world, where fraudulent activities are becoming increasingly diverse and sophisticated, consumers are at a higher risk of becoming victims of fraud through telephone and other communication methods. Traditional methods make it difficult to detect and warn about fraud in advance, and rapid countermeasures are needed. Therefore, a system is required that can detect fraudulent activity in real time and effectively issue warnings.

[0481] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0482] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, and a warning issuing means for sending the warning to an electronic device as voice or push notification. This enables real-time fraud detection and immediate warning issuance.

[0483] "Audio information" refers to the digital representation of audible speech sounds that have been acquired.

[0484] "Text information" refers to data in string format generated by speech recognition technology based on audio information.

[0485] "Conversion means" refers to a process or device that has the function of converting audio information into text information.

[0486] "Analysis means" refers to a process or device that has the function of comparing text information with past fraud patterns and evaluating the degree of similarity.

[0487] A "notification means" is a process or device that has the function of transmitting a warning to the user via a communication means based on the analysis results.

[0488] A "warning transmission method" is a process or device that has the function of sending information that has been determined to be fraudulent to a user's electronic device via voice or push notification.

[0489] The system for implementing this invention is configured to continuously acquire voice information and detect potential fraud in real time. The server receives voice information transmitted from the user's electronic device. The device is equipped with a high-sensitivity microphone, which is built into smartphones and mobile devices. The voice information is converted into a digital format by the device's software. The Google Cloud Speech-to-Text API is used for this conversion to achieve high-precision speech recognition.

[0490] Next, the server analyzes the received text information. This analysis utilizes natural language processing techniques and software libraries such as NLTK (Natural Language Toolkit). The server compares the text information with a database of previously collected fraud patterns and evaluates whether a matching pattern exists. If there is a high probability of fraudulent activity, the server generates a warning and notifies the user through a warning dissemination system. This notification is delivered to the user's electronic device as an audio or push notification.

[0491] For example, if a user receives a phone call containing the scam-specific phrase, "You have an outstanding bill," the server will detect it. A warning message is immediately sent to the user's smartphone, allowing them to recognize the risk of fraud.

[0492] Furthermore, when a generative AI model is used to support the operation of this program, prompts such as the following can be utilized: "Write code to detect fraudulent phrases and display warnings. For example, generate code that identifies the phrase 'You have an outstanding bill.'" This allows the system to adapt to new fraud patterns.

[0493] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0494] Step 1:

[0495] The device constantly acquires audio information using its built-in microphone. It receives ambient sound as input and converts it into a digital audio file. This allows it to process the audio data to make it easier to handle in subsequent processes.

[0496] Step 2:

[0497] The device converts the converted audio data into text information using speech recognition technology. It uses a digital audio file as input and converts it to text using the Google Cloud Speech-to-Text API. As output, it converts the conversation within the audio into text information and formats it into a format that allows for grammatical analysis.

[0498] Step 3:

[0499] The terminal sends the generated text information to the server. Text information is used as input and sent to the server via a secure communication path (e.g., HTTPS). As output, the server prepares the received text information for analysis.

[0500] Step 4:

[0501] The server analyzes the received text information and understands the context using natural language processing techniques. It uses text information as input and analyzes the meaning and relationships of each word using the NLTK library. The output is the analyzed contextual information.

[0502] Step 5:

[0503] The server compares text information against a database of fraud patterns. It uses the analysis results as input and compares them to known fraud patterns in the database. The output provides an evaluation of whether or not the information is likely to be fraudulent.

[0504] Step 6:

[0505] If the server determines that there is a high probability of fraud, it generates a notification. Using the evaluation results as input, it creates a message to warn the user based on the information identified as fraudulent. The warning content is prepared as output.

[0506] Step 7:

[0507] The server sends a warning to the terminal and notifies the user. Using the generated warning content as input, it sends it to the terminal as an audio or push notification. As output, the user can recognize the risk of fraud and take the necessary action.

[0508] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0509] This invention comprises a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. The system consists of a terminal that acquires the user's voice and a server that performs the analysis.

[0510] The device has the ability to convert ambient sound into digital data and then convert this data into text data in real time through a speech recognition engine. This text data is sent to a server where further analysis is performed.

[0511] The server analyzes the received text data using natural language processing technology and a fraud database to assess the likelihood of fraud. Furthermore, an emotion engine analyzes the voice data to recognize the user's emotional state. If this emotion analysis detects, for example, that the user is experiencing stress or anxiety, the likelihood of fraud is further increased.

[0512] If a high probability of fraud is detected, the server sends the result to the device. The device immediately sends a warning to the user, providing customized advice based on the user's emotional state. For example, a user who is detected as anxious may receive advice such as, "Please stay calm and check carefully." Warnings and alerts are also sent to emergency contacts, enabling a quick response.

[0513] As a concrete example, consider a situation where a user receives a fraudulent phone call. The terminal translates the call content, and if it contains fraud-related phrases, the server quickly identifies the risk of fraud. Furthermore, if the tone of voice during the call detects the user's feelings of surprise or confusion, an enhanced warning is provided, and prompt notification is sent to the user and their family. In this way, the present invention makes it possible to effectively detect fraudulent activity and implement defense measures that take into account the user's emotional state.

[0514] The following describes the processing flow.

[0515] Step 1:

[0516] The device constantly monitors ambient noise and acquires audio data through its microphone. This audio data is converted into a digital format and stored in an internal buffer.

[0517] Step 2:

[0518] The device uses a speech recognition engine to convert stored speech data into text data in real time. To improve the accuracy of speech recognition, the model is trained to handle specific accents and expressions.

[0519] Step 3:

[0520] The device sends voice and text data to the server. This data transmission is performed through an encrypted protocol to enhance security.

[0521] Step 4:

[0522] The server uses natural language processing techniques to extract keywords and phrases related to fraud from the received text data. Furthermore, it compares this data with a fraud database to determine if it is likely to be fraudulent.

[0523] Step 5:

[0524] The server analyzes the user's voice data to activate an emotion engine and evaluate the user's emotional state. Identifying emotions such as stress, anxiety, and anger influences the fraud risk assessment.

[0525] Step 6:

[0526] The server calculates a fraud score based on the analysis results and makes a final decision based on that score and the user's emotional state.

[0527] Step 7:

[0528] The server sends the judgment result to the terminal, including information to issue a warning if there is a high probability of fraud.

[0529] Step 8:

[0530] The device displays or voices a warning message to the user. The message includes personalized advice based on the identified emotional state.

[0531] Step 9:

[0532] The device also sends warnings and alerts to emergency contacts (family and police) to encourage a quick response.

[0533] Step 10:

[0534] The user receives a warning and then actually checks whether or not it is a scam. After checking, the results are fed back to the device to help improve the system's accuracy.

[0535] Step 11:

[0536] The server receives user feedback and uses it to update the AI model and improve the accuracy of future analyses.

[0537] (Example 2)

[0538] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0539] In modern times, fraudulent activities are becoming increasingly sophisticated, leading to a growing number of situations that cause users anxiety and stress. Conventional systems have limitations in detecting fraud and lack appropriate responses that take into account the user's emotional state. This invention aims to efficiently detect the risk of fraud while simultaneously providing protective measures that are tailored to the user's emotional state.

[0540] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0541] In this invention, the server includes recognition means for continuously acquiring voice information and converting it from voice to text information, evaluation means for analyzing the text information and comparing it with past fraud cases, and emotion analysis means for analyzing the user's emotional state and reinforcing the possibility of fraud. This enables early detection of fraudulent activity and appropriate, customized responses tailored to the user's emotions.

[0542] "Audio information" refers to data obtained by converting sounds generated around the user into a digital format.

[0543] "Textual information" refers to text data converted by speech recognition technology.

[0544] "Recognition means" refers to a device or algorithm for converting audio information into text information in real time.

[0545] "Evaluation method" refers to the process of analyzing textual information and comparing it with previously collected fraud cases to determine the risk of fraud.

[0546] "Emotional analysis methods" refer to technologies that analyze the tone and content of a user's voice to determine their current emotional state.

[0547] "Notification method" refers to a function that sends a warning to the user or their emergency contacts when a potential scam is detected.

[0548] "Advice generation means" refers to a system for providing individually tailored advice based on the user's emotional state.

[0549] This invention includes a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. This system consists of a terminal that acquires the user's voice information and a server that performs the analysis.

[0550] The device uses its built-in acoustic sensor to capture sound emitted from the user's surroundings as digital data. Typical speech recognition software includes a speech recognition API, which converts the speech information into text in real time. The converted text is immediately sent to the server.

[0551] The server analyzes textual information using advanced natural language processing algorithms. This analysis utilizes generative AI models as language modeling technology. Furthermore, it has a function to evaluate the likelihood of fraud by comparing it with a previously collected fraud database. In addition, an emotion analysis engine is used to understand the user's emotional state from the tone and content of their voice. This technology helps determine whether the user is experiencing stress or anxiety, thereby enhancing the fraud likelihood assessment.

[0552] If a server determines that a transaction is highly likely to be fraudulent, it returns that information to the device. The device then immediately sends a warning to the user, including personalized advice tailored to their emotional state. For example, a user experiencing anxiety might receive a message advising them to "calm down and reassess the situation." Furthermore, it includes a rapid notification function to emergency contacts, enabling immediate action when necessary.

[0553] A concrete example would be a situation where a user receives a suspicious phone call. The device converts the audio to text, and this text is analyzed by the server, resulting in an immediate fraud warning. An example of a prompt message might be, "Convert this audio data to text and check for signs of fraud. Then, analyze the user's emotional state to create an appropriate message." In this way, the system can effectively detect fraudulent activity and provide a defense that takes the user's emotional state into consideration.

[0554] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0555] Step 1:

[0556] The device uses acoustic sensors to acquire ambient sound information. The input is analog audio, which is then converted into digital data. This conversion process utilizes a speech recognition API to convert audio into text in real time. Specifically, the conversion begins when the user presses the "Start Audio Capture" button. The output is the converted text data.

[0557] Step 2:

[0558] The terminal sends the converted character information to the server. The input is text data, which is sent to the server using a secure protocol (e.g., HTTPS). Specifically, the text data is accumulated in a buffer in real time and transferred to the server each time it reaches a certain size. The output is the character information received by the server.

[0559] Step 3:

[0560] The server activates a natural language processing algorithm to analyze textual information. The input is text data received from the terminal, and a generative AI model is used to analyze the context of the text. In this process, the text is compared to a fraud database, and a risk assessment is performed. The output is assessment data indicating the risk of fraud.

[0561] Step 4:

[0562] The server uses an emotion analysis engine to analyze the user's emotional state. The input consists of voice and text information, which is used to estimate the user's emotions. Specifically, factors such as voice tone, speed, and volume are analyzed, and an emotion score is calculated. The output is data indicating the emotional state.

[0563] Step 5:

[0564] If the server determines that there is a high risk of fraud, it sends the evaluation result to the terminal. The input consists of evaluation data and sentiment state data, which are used to create a warning. The output is the warning information sent to the terminal.

[0565] Step 6:

[0566] The device immediately issues a warning to the user. The input is warning information received from the server, which is then communicated to the user via the device's display or audio output. It also simultaneously provides customized advice based on the user's emotional state. Specifically, a notification sound plays, and a message is displayed on the screen. The output consists of a warning and advice to the user.

[0567] Step 7:

[0568] If necessary, the device will automatically send a notification to emergency contacts. Input consists of evaluation data from the server and the user's in-app settings, which are used to create the notification. Specifically, a message is sent to pre-registered email addresses or phone numbers. The output is an alert message sent to emergency contacts.

[0569] (Application Example 2)

[0570] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0571] In recent years, fraudulent activities have become more sophisticated, making it difficult to adequately defend against them using traditional methods. Furthermore, many users experience anxiety and stress when encountering fraud, which can prevent them from taking appropriate action. There is a need for a system that addresses these issues, effectively detects potential fraud, and provides users with appropriate warnings and advice.

[0572] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0573] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, an emotion analysis means for analyzing the tone of voice information and recognizing the user's emotional state, and an advice means for providing customized advice based on the user's emotional state. This makes it possible to quickly detect fraudulent activity and take appropriate action according to the user's emotional state.

[0574] "Voice information" refers to data obtained by converting ambient sounds into digital signals, which are used for conversation and voice commands.

[0575] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is recorded as a string of characters.

[0576] "Conversion means" refers to technologies and devices for converting audio information into text information, and includes speech recognition engines.

[0577] "Analysis means" refers to technologies and devices for examining textual information and comparing it with previously recorded fraud patterns.

[0578] "Notification means" refers to technologies and devices used to issue warnings or alerts to users based on analysis results.

[0579] "Emotional analysis methods" refer to technologies and devices that recognize a user's emotional state based on voice information, analyzing voice tone and word choice.

[0580] "Advice tools" refer to technologies and devices that provide appropriate advice and action suggestions based on the user's emotional state.

[0581] A "server" is a computer system used for various data processing tasks, such as analyzing audio information and recognizing emotional states.

[0582] This invention is realized by a system that continuously acquires and analyzes audio information. Specifically, a terminal placed near the user acquires surrounding audio information. The terminal converts this audio information into digital data and then converts it into text information in real time using a speech recognition engine (e.g., Google Speech-to-Text API). This text information is sent to a server for further processing.

[0583] The server uses natural language processing technology (e.g., Google Cloud Natural Language API) to analyze textual information and compare it against past fraud patterns. If it is determined that there is a high probability of fraud, a notification system is activated and a warning is sent to the user.

[0584] The tone of voice information is also analyzed on the server to recognize the user's emotional state. Technologies such as IBM Watson Tone Analyzer are used for emotion analysis. If the user is experiencing anxiety or stress, personalized advice is provided through the advice system.

[0585] For example, if a user receives a fraudulent phone call regarding their bank account, the device converts the conversation into text, and the server analyzes this information to identify the risk of fraud. Furthermore, if alarm is detected from the tone of voice, a warning is issued to the user saying, "This may be a scam. Please remain calm."

[0586] An example of a prompt for a generative AI model is: "Please tell me how to improve a smartphone app that performs voice analysis and emotion recognition to detect scam calls. I'm thinking about how to provide effective advice when the user is emotionally unstable."

[0587] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0588] Step 1:

[0589] The device acquires ambient audio information and converts it into digital signals. This input is audio data obtained from the device's microphone. The device then uses the Google Speech-to-Text API to convert this audio data into text information and generates its output.

[0590] Step 2:

[0591] The terminal sends the converted character information to the server. This input character information is passed to the server as is, without any processing. The server then prepares for further data analysis.

[0592] Step 3:

[0593] The server analyzes the received text information and compares it to past fraud patterns using natural language processing techniques. During this process, it uses the Google Cloud Natural Language API to process the text information and obtain output indicating the potential for fraud.

[0594] Step 4:

[0595] The server analyzes the tone of the voice information and uses IBM Watson Tone Analyzer to recognize the user's emotional state. This analysis yields an output of emotional states, including emotions such as surprise and anxiety.

[0596] Step 5:

[0597] The server issues warnings to users based on the likelihood of fraud and their emotional state. It sends notifications to the user's device via a notification system, customizing the message based on this data.

[0598] Step 6:

[0599] Users receive warnings and customized advice from their device and take appropriate action. Output to the user is displayed on the device screen or as an audio notification.

[0600] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0601] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0602] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0603] [Fourth Embodiment]

[0604] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0605] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0606] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0607] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0608] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0609] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0610] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0611] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0612] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0613] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0614] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0615] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0616] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0617] As an embodiment for carrying out the present invention, a system using speech recognition and analysis technology will be described. This system continuously monitors ambient sounds using a microphone installed in a specific terminal.

[0618] The device continuously acquires this audio data and converts it into a digital format. The acquired audio data is immediately converted into text data by an internal speech recognition engine. This text data is then sent to a server. The server analyzes the transmitted text data using natural language processing techniques to understand the context and intent of the conversation.

[0619] In this analysis, the server compares the text data against an internal fraud database. The fraud database contains a collection of various past fraud cases and includes patterns similar to fraudulent activities. The server checks for any matches with these patterns and assesses the likelihood of fraud.

[0620] If the server determines that there is a high probability of fraud, it sends the result to the device. Based on the information received, the device issues a warning to the user. This warning is provided as an audio or text message to inform the user of the potential fraud. The device also automatically sends an alert to the user's designated emergency contacts, such as family members or the police.

[0621] A concrete example of implementation is a scenario where a fraudulent billing scam is carried out over the phone. The terminal acquires the call content in real time, converts it into scam-specific phrases such as "there is an outstanding bill," and the server detects this signal. If the probability of fraud is assessed as high, a warning or alert is immediately sent to the user and their emergency contact. By introducing this system, real-time detection and warning of fraudulent activity are achieved, effectively protecting users from becoming victims.

[0622] The following describes the processing flow.

[0623] Step 1:

[0624] The device constantly monitors ambient sound and converts the acquired audio data into a digital format. The converted audio is processed as short audio segments at a pre-set buffering rate.

[0625] Step 2:

[0626] The device uses a speech recognition engine to convert speech data into text data in real time. In doing so, the speech recognition engine employs a machine learning model pre-trained on a large-scale speech dataset.

[0627] Step 3:

[0628] The terminal sends the converted text data to the server. The text data is securely transmitted to the server via the network using an encrypted protocol.

[0629] Step 4:

[0630] The server applies natural language processing algorithms to analyze the received text data. This allows the server to understand the content of the conversation and extract key keywords and phrases related to the scam.

[0631] Step 5:

[0632] The server compares the analyzed text against a fraud database. This database contains past fraud cases and is used to identify patterns that are likely to be fraudulent.

[0633] Step 6:

[0634] The server calculates a fraud score and, if it determines that the device is highly likely to be a scam, sends the result to the device. This score and the reason for the determination serve as criteria for issuing a warning to the user.

[0635] Step 7:

[0636] Upon receiving the judgment result from the server, the device will send a warning to the user via voice or text message. This warning will inform the user of the potential for fraud and urge them to be cautious.

[0637] Step 8:

[0638] Depending on the user's settings, the device will send an alert to emergency contacts regarding a potential scam. These emergency contacts include family members and the police, prompting them to take appropriate action depending on the situation.

[0639] Step 9:

[0640] Users verify whether or not a scam is actually occurring based on the warnings and alerts they receive. They then provide feedback through their device, and if there are any misidentifications, the data is used for training purposes.

[0641] Step 10:

[0642] When the server receives feedback, it uses it to improve the AI model, thereby increasing the accuracy of subsequent analyses. The database and model are regularly updated based on the feedback.

[0643] (Example 1)

[0644] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0645] Fraudulent activities are becoming more sophisticated every day, and conventional countermeasures often fail to detect them quickly. In particular, fraud conducted via voice calls requires rapid, real-time detection and warning, but achieving this has been difficult.

[0646] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0647] In this invention, the server includes an acoustic recognition means that continuously acquires voice information and converts it from sound to text information, an information analysis means that analyzes the text information and compares it with existing fraud patterns, and a communication means that transmits the text information to an information processing device using a secure communication protocol. This makes it possible to detect fraudulent activity in real time and quickly issue warnings to users and related parties.

[0648] "Audio information" refers to data that represents acoustic signals in digital format, and is an essential element for use in speech recognition and analysis.

[0649] "Acoustic recognition means" refers to a technical method or device that receives audio information as input and converts the audio data into textual information.

[0650] "Textual information" refers to text data converted from speech, and is fundamental information for analysis and data processing.

[0651] "Information analysis means" refers to technical methods or devices for processing textual information and recognizing and analyzing specific patterns or intentions.

[0652] "Communication means" refers to technical methods or devices for sending and receiving information between remote devices or equipment, and is intended to transfer data through a secure protocol.

[0653] "Transmission means" refers to a technical method or device for transmitting information to users or other stakeholders based on analysis results, and for issuing warnings or notifications.

[0654] "Communication method" refers to a technical technique or device that automatically sends information to a contact specified by the user to prompt a response in an emergency.

[0655] This invention relates to a fraud detection system that uses audio information, and continuously monitors surrounding audio information using a microphone installed in a specific terminal. To implement this system, the terminal is equipped with advanced speech recognition software as an acoustic recognition means. This software can instantly convert acoustic data into text data, which is written information.

[0656] The converted text information is sent to the server using a secure communication protocol (e.g., HTTPS). The server uses a natural language processing engine as an information analysis tool to analyze the text information and compare it with fraud patterns. This engine has an extensive database of existing frauds and performs highly accurate analysis based on past fraud patterns.

[0657] If a device is deemed highly likely to be a scam, the server uses its outgoing communication method to send a warning to the device. The device receives this warning and sends a notification to the user as a photo. This notification is also provided as an audio message or text message to immediately inform the user of the danger.

[0658] Furthermore, the device utilizes communication methods to automatically send alerts to emergency contacts pre-configured by the user. This automated contact function enables a swift response to the user's family and relevant authorities.

[0659] As a concrete example, let's consider a textbook case of a "It's me" scam. In this case, the terminal acquires the call content in real time and captures scam-specific phrases such as "You have an outstanding bill." The server immediately analyzes this information and, if it assesses the likelihood of fraud, issues a warning to the user and their emergency contact.

[0660] An example of a prompt to input into a generative AI model is: "Please describe the specific steps of a system that analyzes voice data in real time and warns the user if it is suspected of being fraudulent."

[0661] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0662] Step 1:

[0663] The device uses a built-in microphone to acquire ambient sound information in real time. The input is an analog acoustic signal, which is then converted into digital data by a digital conversion processor. This process prepares the audio data for subsequent processing.

[0664] Step 2:

[0665] The converted audio data is input into speech recognition software, which is an acoustic recognition means within the device. Here, a deep learning algorithm is used to convert the audio data into text data (written information). In this process, a noise reduction algorithm is also used to enhance sound quality and improve recognition accuracy. The output generated is text data, which is written information.

[0666] Step 3:

[0667] The terminal sends the generated text information to the server using a secure communication protocol. The input is text data, which is encrypted using protocols such as HTTPS to ensure data security before being sent to the server. The output is text data received on the server side.

[0668] Step 4:

[0669] The server inputs the received text data into a natural language processing (NLP) engine for analysis. This involves contextual analysis of the text and matching it against fraud patterns. Using the textual information as input, the analysis results generate evaluation data such as the likelihood of fraud. Generative AI models are useful in this process.

[0670] Step 5:

[0671] Based on text data deemed highly likely to be fraudulent, the server sends the evaluation results to the terminal. The input is the analysis result, which is immediately formatted into a notification format and sent via communication. The output is warning data received by the terminal.

[0672] Step 6:

[0673] The device sends a voice or text warning to the user based on the received warning data. Furthermore, it automatically sends alerts to pre-configured emergency contacts (such as family members or the police). This allows the user and their relevant parties to respond quickly. The input is warning data, and the output is a warning message.

[0674] (Application Example 1)

[0675] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0676] In today's world, where fraudulent activities are becoming increasingly diverse and sophisticated, consumers are at a higher risk of becoming victims of fraud through telephone and other communication methods. Traditional methods make it difficult to detect and warn about fraud in advance, and rapid countermeasures are needed. Therefore, a system is required that can detect fraudulent activity in real time and effectively issue warnings.

[0677] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0678] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, and a warning issuing means for sending the warning to an electronic device as voice or push notification. This enables real-time fraud detection and immediate warning issuance.

[0679] "Audio information" refers to the digital representation of audible speech sounds that have been acquired.

[0680] "Text information" refers to data in string format generated by speech recognition technology based on audio information.

[0681] "Conversion means" refers to a process or device that has the function of converting audio information into text information.

[0682] "Analysis means" refers to a process or device that has the function of comparing text information with past fraud patterns and evaluating the degree of similarity.

[0683] A "notification means" is a process or device that has the function of transmitting a warning to the user via a communication means based on the analysis results.

[0684] A "warning transmission method" is a process or device that has the function of sending information that has been determined to be fraudulent to a user's electronic device via voice or push notification.

[0685] The system for implementing this invention is configured to continuously acquire voice information and detect potential fraud in real time. The server receives voice information transmitted from the user's electronic device. The device is equipped with a high-sensitivity microphone, which is built into smartphones and mobile devices. The voice information is converted into a digital format by the device's software. The Google Cloud Speech-to-Text API is used for this conversion to achieve high-precision speech recognition.

[0686] Next, the server analyzes the received text information. This analysis utilizes natural language processing techniques and software libraries such as NLTK (Natural Language Toolkit). The server compares the text information with a database of previously collected fraud patterns and evaluates whether a matching pattern exists. If there is a high probability of fraudulent activity, the server generates a warning and notifies the user through a warning dissemination system. This notification is delivered to the user's electronic device as an audio or push notification.

[0687] For example, if a user receives a phone call containing the scam-specific phrase, "You have an outstanding bill," the server will detect it. A warning message is immediately sent to the user's smartphone, allowing them to recognize the risk of fraud.

[0688] Furthermore, when a generative AI model is used to support the operation of this program, prompts such as the following can be utilized: "Write code to detect fraudulent phrases and display warnings. For example, generate code that identifies the phrase 'You have an outstanding bill.'" This allows the system to adapt to new fraud patterns.

[0689] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0690] Step 1:

[0691] The device constantly acquires audio information using its built-in microphone. It receives ambient sound as input and converts it into a digital audio file. This allows it to process the audio data to make it easier to handle in subsequent processes.

[0692] Step 2:

[0693] The device converts the converted audio data into text information using speech recognition technology. It uses a digital audio file as input and converts it to text using the Google Cloud Speech-to-Text API. As output, it converts the conversation within the audio into text information and formats it into a format that allows for grammatical analysis.

[0694] Step 3:

[0695] The terminal sends the generated text information to the server. Text information is used as input and sent to the server via a secure communication path (e.g., HTTPS). As output, the server prepares the received text information for analysis.

[0696] Step 4:

[0697] The server analyzes the received text information and understands the context using natural language processing techniques. It uses text information as input and analyzes the meaning and relationships of each word using the NLTK library. The output is the analyzed contextual information.

[0698] Step 5:

[0699] The server compares text information against a database of fraud patterns. It uses the analysis results as input and compares them to known fraud patterns in the database. The output provides an evaluation of whether or not the information is likely to be fraudulent.

[0700] Step 6:

[0701] If the server determines that there is a high probability of fraud, it generates a notification. Using the evaluation results as input, it creates a message to warn the user based on the information identified as fraudulent. The warning content is prepared as output.

[0702] Step 7:

[0703] The server sends a warning to the terminal and notifies the user. Using the generated warning content as input, it sends it to the terminal as an audio or push notification. As output, the user can recognize the risk of fraud and take the necessary action.

[0704] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0705] This invention comprises a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. The system consists of a terminal that acquires the user's voice and a server that performs the analysis.

[0706] The device has the ability to convert ambient sound into digital data and then convert this data into text data in real time through a speech recognition engine. This text data is sent to a server where further analysis is performed.

[0707] The server analyzes the received text data using natural language processing technology and a fraud database to assess the likelihood of fraud. Furthermore, an emotion engine analyzes the voice data to recognize the user's emotional state. If this emotion analysis detects, for example, that the user is experiencing stress or anxiety, the likelihood of fraud is further increased.

[0708] If a high probability of fraud is detected, the server sends the result to the device. The device immediately sends a warning to the user, providing customized advice based on the user's emotional state. For example, a user who is detected as anxious may receive advice such as, "Please stay calm and check carefully." Warnings and alerts are also sent to emergency contacts, enabling a quick response.

[0709] As a concrete example, consider a situation where a user receives a fraudulent phone call. The terminal translates the call content, and if it contains fraud-related phrases, the server quickly identifies the risk of fraud. Furthermore, if the tone of voice during the call detects the user's feelings of surprise or confusion, an enhanced warning is provided, and prompt notification is sent to the user and their family. In this way, the present invention makes it possible to effectively detect fraudulent activity and implement defense measures that take into account the user's emotional state.

[0710] The following describes the processing flow.

[0711] Step 1:

[0712] The device constantly monitors ambient noise and acquires audio data through its microphone. This audio data is converted into a digital format and stored in an internal buffer.

[0713] Step 2:

[0714] The device uses a speech recognition engine to convert stored speech data into text data in real time. To improve the accuracy of speech recognition, the model is trained to handle specific accents and expressions.

[0715] Step 3:

[0716] The device sends voice and text data to the server. This data transmission is performed through an encrypted protocol to enhance security.

[0717] Step 4:

[0718] The server uses natural language processing techniques to extract keywords and phrases related to fraud from the received text data. Furthermore, it compares this data with a fraud database to determine if it is likely to be fraudulent.

[0719] Step 5:

[0720] The server analyzes the user's voice data to activate an emotion engine and evaluate the user's emotional state. Identifying emotions such as stress, anxiety, and anger influences the fraud risk assessment.

[0721] Step 6:

[0722] The server calculates a fraud score based on the analysis results and makes a final decision based on that score and the user's emotional state.

[0723] Step 7:

[0724] The server sends the judgment result to the terminal, including information to issue a warning if there is a high probability of fraud.

[0725] Step 8:

[0726] The device displays or voices a warning message to the user. The message includes personalized advice based on the identified emotional state.

[0727] Step 9:

[0728] The device also sends warnings and alerts to emergency contacts (family and police) to encourage a quick response.

[0729] Step 10:

[0730] The user receives a warning and then actually checks whether or not it is a scam. After checking, the results are fed back to the device to help improve the system's accuracy.

[0731] Step 11:

[0732] The server receives user feedback and uses it to update the AI model and improve the accuracy of future analyses.

[0733] (Example 2)

[0734] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0735] In modern times, fraudulent activities are becoming increasingly sophisticated, leading to a growing number of situations that cause users anxiety and stress. Conventional systems have limitations in detecting fraud and lack appropriate responses that take into account the user's emotional state. This invention aims to efficiently detect the risk of fraud while simultaneously providing protective measures that are tailored to the user's emotional state.

[0736] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0737] In this invention, the server includes recognition means for continuously acquiring voice information and converting it from voice to text information, evaluation means for analyzing the text information and comparing it with past fraud cases, and emotion analysis means for analyzing the user's emotional state and reinforcing the possibility of fraud. This enables early detection of fraudulent activity and appropriate, customized responses tailored to the user's emotions.

[0738] "Audio information" refers to data obtained by converting sounds generated around the user into a digital format.

[0739] "Textual information" refers to text data converted by speech recognition technology.

[0740] "Recognition means" refers to a device or algorithm for converting audio information into text information in real time.

[0741] "Evaluation method" refers to the process of analyzing textual information and comparing it with previously collected fraud cases to determine the risk of fraud.

[0742] "Emotional analysis methods" refer to technologies that analyze the tone and content of a user's voice to determine their current emotional state.

[0743] "Notification method" refers to a function that sends a warning to the user or their emergency contacts when a potential scam is detected.

[0744] "Advice generation means" refers to a system for providing individually tailored advice based on the user's emotional state.

[0745] This invention includes a system that integrates speech recognition technology, natural language processing technology, and sentiment analysis technology. This system consists of a terminal that acquires the user's voice information and a server that performs the analysis.

[0746] The device uses its built-in acoustic sensor to capture sound emitted from the user's surroundings as digital data. Typical speech recognition software includes a speech recognition API, which converts the speech information into text in real time. The converted text is immediately sent to the server.

[0747] The server analyzes textual information using advanced natural language processing algorithms. This analysis utilizes generative AI models as language modeling technology. Furthermore, it has a function to evaluate the likelihood of fraud by comparing it with a previously collected fraud database. In addition, an emotion analysis engine is used to understand the user's emotional state from the tone and content of their voice. This technology helps determine whether the user is experiencing stress or anxiety, thereby enhancing the fraud likelihood assessment.

[0748] If a server determines that a transaction is highly likely to be fraudulent, it returns that information to the device. The device then immediately sends a warning to the user, including personalized advice tailored to their emotional state. For example, a user experiencing anxiety might receive a message advising them to "calm down and reassess the situation." Furthermore, it includes a rapid notification function to emergency contacts, enabling immediate action when necessary.

[0749] A concrete example would be a situation where a user receives a suspicious phone call. The device converts the audio to text, and this text is analyzed by the server, resulting in an immediate fraud warning. An example of a prompt message might be, "Convert this audio data to text and check for signs of fraud. Then, analyze the user's emotional state to create an appropriate message." In this way, the system can effectively detect fraudulent activity and provide a defense that takes the user's emotional state into consideration.

[0750] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0751] Step 1:

[0752] The device uses acoustic sensors to acquire ambient sound information. The input is analog audio, which is then converted into digital data. This conversion process utilizes a speech recognition API to convert audio into text in real time. Specifically, the conversion begins when the user presses the "Start Audio Capture" button. The output is the converted text data.

[0753] Step 2:

[0754] The terminal sends the converted character information to the server. The input is text data, which is sent to the server using a secure protocol (e.g., HTTPS). Specifically, the text data is accumulated in a buffer in real time and transferred to the server each time it reaches a certain size. The output is the character information received by the server.

[0755] Step 3:

[0756] The server activates a natural language processing algorithm to analyze textual information. The input is text data received from the terminal, and a generative AI model is used to analyze the context of the text. In this process, the text is compared to a fraud database, and a risk assessment is performed. The output is assessment data indicating the risk of fraud.

[0757] Step 4:

[0758] The server uses an emotion analysis engine to analyze the user's emotional state. The input consists of voice and text information, which is used to estimate the user's emotions. Specifically, factors such as voice tone, speed, and volume are analyzed, and an emotion score is calculated. The output is data indicating the emotional state.

[0759] Step 5:

[0760] If the server determines that there is a high risk of fraud, it sends the evaluation result to the terminal. The input consists of evaluation data and sentiment state data, which are used to create a warning. The output is the warning information sent to the terminal.

[0761] Step 6:

[0762] The device immediately issues a warning to the user. The input is warning information received from the server, which is then communicated to the user via the device's display or audio output. It also simultaneously provides customized advice based on the user's emotional state. Specifically, a notification sound plays, and a message is displayed on the screen. The output consists of a warning and advice to the user.

[0763] Step 7:

[0764] If necessary, the device will automatically send a notification to emergency contacts. Input consists of evaluation data from the server and the user's in-app settings, which are used to create the notification. Specifically, a message is sent to pre-registered email addresses or phone numbers. The output is an alert message sent to emergency contacts.

[0765] (Application Example 2)

[0766] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0767] In recent years, fraudulent activities have become more sophisticated, making it difficult to adequately defend against them using traditional methods. Furthermore, many users experience anxiety and stress when encountering fraud, which can prevent them from taking appropriate action. There is a need for a system that addresses these issues, effectively detects potential fraud, and provides users with appropriate warnings and advice.

[0768] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0769] In this invention, the server includes a conversion means for continuously acquiring voice information and converting it from voice to text information, an analysis means for analyzing the text information and comparing it with past fraud patterns, a notification means for issuing a warning when it is determined that there is a high probability of fraud, an emotion analysis means for analyzing the tone of voice information and recognizing the user's emotional state, and an advice means for providing customized advice based on the user's emotional state. This makes it possible to quickly detect fraudulent activity and take appropriate action according to the user's emotional state.

[0770] "Voice information" refers to data obtained by converting ambient sounds into digital signals, which are used for conversation and voice commands.

[0771] "Textual information" refers to text data obtained by converting audio information using speech recognition technology, and is recorded as a string of characters.

[0772] "Conversion means" refers to technologies and devices for converting audio information into text information, and includes speech recognition engines.

[0773] "Analysis means" refers to technologies and devices for examining textual information and comparing it with previously recorded fraud patterns.

[0774] "Notification means" refers to technologies and devices used to issue warnings or alerts to users based on analysis results.

[0775] "Emotional analysis methods" refer to technologies and devices that recognize a user's emotional state based on voice information, analyzing voice tone and word choice.

[0776] "Advice tools" refer to technologies and devices that provide appropriate advice and action suggestions based on the user's emotional state.

[0777] A "server" is a computer system used for various data processing tasks, such as analyzing audio information and recognizing emotional states.

[0778] This invention is realized by a system that continuously acquires and analyzes audio information. Specifically, a terminal placed near the user acquires surrounding audio information. The terminal converts this audio information into digital data and then converts it into text information in real time using a speech recognition engine (e.g., Google Speech-to-Text API). This text information is sent to a server for further processing.

[0779] The server uses natural language processing technology (e.g., Google Cloud Natural Language API) to analyze textual information and compare it against past fraud patterns. If it is determined that there is a high probability of fraud, a notification system is activated and a warning is sent to the user.

[0780] The tone of voice information is also analyzed on the server to recognize the user's emotional state. Technologies such as IBM Watson Tone Analyzer are used for emotion analysis. If the user is experiencing anxiety or stress, personalized advice is provided through the advice system.

[0781] For example, if a user receives a fraudulent phone call regarding their bank account, the device converts the conversation into text, and the server analyzes this information to identify the risk of fraud. Furthermore, if alarm is detected from the tone of voice, a warning is issued to the user saying, "This may be a scam. Please remain calm."

[0782] An example of a prompt for a generative AI model is: "Please tell me how to improve a smartphone app that performs voice analysis and emotion recognition to detect scam calls. I'm thinking about how to provide effective advice when the user is emotionally unstable."

[0783] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0784] Step 1:

[0785] The device acquires ambient audio information and converts it into digital signals. This input is audio data obtained from the device's microphone. The device then uses the Google Speech-to-Text API to convert this audio data into text information and generates its output.

[0786] Step 2:

[0787] The terminal sends the converted character information to the server. This input character information is passed to the server as is, without any processing. The server then prepares for further data analysis.

[0788] Step 3:

[0789] The server analyzes the received text information and compares it to past fraud patterns using natural language processing techniques. During this process, it uses the Google Cloud Natural Language API to process the text information and obtain output indicating the potential for fraud.

[0790] Step 4:

[0791] The server analyzes the tone of the voice information and uses IBM Watson Tone Analyzer to recognize the user's emotional state. This analysis yields an output of emotional states, including emotions such as surprise and anxiety.

[0792] Step 5:

[0793] The server issues warnings to users based on the likelihood of fraud and their emotional state. It sends notifications to the user's device via a notification system, customizing the message based on this data.

[0794] Step 6:

[0795] Users receive warnings and customized advice from their device and take appropriate action. Output to the user is displayed on the device screen or as an audio notification.

[0796] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0797] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0798] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0799] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0800] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0801] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0802] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0803] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0804] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0805] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0806] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0807] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0808] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0809] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0810] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0811] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0812] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0813] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0814] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0815] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0816] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0817] The following is further disclosed regarding the embodiments described above.

[0818] (Claim 1)

[0819] A speech recognition means that continuously acquires audio data and converts it from audio to text data,

[0820] An analysis means for analyzing the aforementioned text data and comparing it with past fraud patterns,

[0821] A notification system that issues a warning when it is determined that there is a high probability of fraud,

[0822] A system that includes this.

[0823] (Claim 2)

[0824] The system according to claim 1, wherein the analysis means calculates the probability of fraud, and the notification means issues a warning based on that probability.

[0825] (Claim 3)

[0826] The system according to claim 1, wherein the notification means sends a warning to an emergency contact set by the user.

[0827] "Example 1"

[0828] (Claim 1)

[0829] A sound recognition means that continuously acquires audio information and converts it from sound to text information,

[0830] Information analysis means for analyzing the aforementioned textual information and comparing it with existing fraud patterns,

[0831] A communication means for transmitting the aforementioned textual information to an information processing device using a secure communication protocol,

[0832] A means of issuing a warning when it is determined that there is a high probability of fraud,

[0833] A means of communication that automatically sends the aforementioned warning to the user's pre-configured contacts,

[0834] A system that includes this.

[0835] (Claim 2)

[0836] The system according to claim 1, wherein the information analysis means evaluates the probability of fraud occurring, and the transmission means issues a warning based on that probability.

[0837] (Claim 3)

[0838] The system according to claim 1, wherein the communication means sends a warning to an emergency contact set by the user.

[0839] "Application Example 1"

[0840] (Claim 1)

[0841] A means for continuously acquiring audio information and converting it from audio to text information,

[0842] An analysis means for analyzing the aforementioned text information and comparing it with past fraud patterns,

[0843] A notification system that issues a warning when it is determined that there is a high probability of fraud,

[0844] A warning transmission means that transmits the aforementioned warning to an electronic device as an audio or push notification,

[0845] A system that includes this.

[0846] (Claim 2)

[0847] The system according to claim 1, wherein the analysis means calculates the probability of fraud, and the warning issuing means issues a warning based on that probability.

[0848] (Claim 3)

[0849] The system according to claim 1, wherein the warning issuing means transmits the warning to an emergency contact set by the user.

[0850] "Example 2 of combining an emotion engine"

[0851] (Claim 1)

[0852] A recognition means that continuously acquires audio information and converts it from audio to text information,

[0853] An evaluation method for analyzing the aforementioned textual information and comparing it with past fraud cases,

[0854] A sentiment analysis method that analyzes the user's emotional state and strengthens the possibility of fraud,

[0855] A notification system that issues a warning when it is determined that there is a high probability of fraud,

[0856] An advice generation means that provides customized advice according to the user's emotional state,

[0857] A system that includes this.

[0858] (Claim 2)

[0859] The system according to claim 1, wherein the evaluation means calculates the probability of fraud, and the notification means issues a warning based on that probability.

[0860] (Claim 3)

[0861] The system according to claim 1, wherein the notification means sends a warning to an emergency contact set by the user.

[0862] "Application example 2 when combining with an emotional engine"

[0863] (Claim 1)

[0864] A means for continuously acquiring audio information and converting it from audio to text information,

[0865] An analysis means for analyzing the aforementioned textual information and comparing it with past fraud patterns,

[0866] A notification system that issues a warning when it is determined that there is a high probability of fraud,

[0867] An emotion analysis method that analyzes the tone of voice information and recognizes the user's emotional state,

[0868] An advisory tool that provides customized advice based on the user's emotional state,

[0869] A system that includes this.

[0870] (Claim 2)

[0871] The system according to claim 1, wherein the analysis means calculates the probability of fraud, and the notification means issues a warning based on that probability.

[0872] (Claim 3)

[0873] The system according to claim 1, wherein the notification means transmits a warning to an emergency communication destination set by the user. [Explanation of Symbols]

[0874] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A speech recognition means that continuously acquires audio data and converts it from audio to text data, An analysis means for analyzing the aforementioned text data and comparing it with past fraud patterns, A notification system that issues a warning when it is determined that there is a high probability of fraud, A system that includes this.

2. The system according to claim 1, wherein the analysis means calculates the probability of fraud, and the notification means issues a warning based on that probability.

3. The system according to claim 1, wherein the notification means transmits a warning to an emergency contact set by the user.