system
A system that converts communication content to text, analyzes for fraud signs, and issues warnings to users and third parties, effectively preventing phone fraud by using speech recognition and natural language processing.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-13
- Publication Date
- 2026-06-25
AI Technical Summary
The increasing prevalence of phone fraud, particularly targeting the elderly, necessitates a technology that can quickly and accurately detect potential fraud and prevent damage before it occurs.
A system that acquires communication content, converts it into text data, analyzes the text for signs of fraud, and issues immediate warnings to users and pre-configured third parties, utilizing speech recognition and natural language processing technologies.
Enables real-time detection and prevention of fraudulent activities by providing users with timely warnings and allowing third parties to assess and respond to potential scams.
Smart Images

Figure 2026104452000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot performed by at least one processor, the method including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In modern society, the number of frauds committed over the phone has been increasing rapidly, and particularly, special fraud targeting the elderly has become a social problem. Since these frauds are carried out skillfully before the victims notice, it is difficult to prevent them in advance. In such a situation, there is a need for a technology that can quickly and accurately detect the possibility of fraud and prevent damage before it occurs.
Means for Solving the Problems
[0005] To solve this problem, we provide a means for acquiring communication content, converting it into text data, and analyzing the text data. Furthermore, we provide a system that includes a means for automatically issuing warnings when potential fraud is detected, and also a means for notifying pre-configured third parties. This makes it possible to monitor and prevent fraudulent activities conducted via telephone in real time.
[0006] "Communication content" refers to the collective term for information such as voice, data, and messages that are sent and received via telephone or electronic devices.
[0007] "Text data" refers to data obtained by converting audio or other formats into written information, which is in a format that can be analyzed by a computer.
[0008] "Analysis" is the process of thoroughly examining the obtained data, extracting necessary information and specific patterns from it, and giving them meaning.
[0009] "Fraud" refers to acts of deceiving others, particularly acts of using dishonest means for the purpose of obtaining money or property.
[0010] A "means of notifying warnings" refers to a system for communicating information to users or third parties when danger or anomalies are detected.
[0011] A "third party" is an individual or entity that is not a user or a direct party involved in the communication, but has the authority to receive information as needed. [Brief explanation of the drawing]
[0012] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.
Embodiments for Carrying Out the Invention
[0013] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described according to the accompanying drawings.
[0014] First, the language used in the following description will be explained.
[0015] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0016] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0017] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0018] In the following embodiments, the numbered communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0019] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0020] [First Embodiment]
[0021] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0022] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0023] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0024] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0025] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0026] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0027] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0028] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0029] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0030] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0031] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0032] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0033] One embodiment of this invention is a system that primarily operates via a communication network. This system has the function of analyzing voice communications and detecting the possibility of fraud.
[0034] The server is responsible for monitoring voice communications initiated or received by users. Once a call begins, the server acquires the communication content in real time and converts the voice data into text using advanced speech recognition technology. This text data is then used for analysis to determine the potential for fraud.
[0035] During the analysis process, the server detects specific keywords, phrases, and conversational flow, and applies algorithms to evaluate signs of fraudulent activity. In particular, it focuses on analyzing requests related to money or personal information, and expressions that convey a sense of urgency.
[0036] If the server detects a potential scam, it immediately sends a warning notification to the device. This notification serves as a trigger for the user to decide whether to continue the call or to remain vigilant. The server also notifies registered third parties, usually family members, allowing them to quickly assess the situation and take action.
[0037] As a concrete example, suppose a user receives a phone call from an unknown person who asks them to prepare money. The server analyzes the conversation and determines that it may be a sophisticated scam. A warning is immediately displayed on the device, and the user's family is also notified. As a result, the user and their family can take appropriate measures to protect themselves from fraudulent activity.
[0038] In this way, this system provides a means to prevent fraudulent activities via communications and ensure user safety through automated monitoring and notification functions.
[0039] The following describes the processing flow.
[0040] Step 1:
[0041] The server detects when a user initiates a call and acquires audio data in real time. This audio data is necessary to understand the content of the call in detail.
[0042] Step 2:
[0043] The server converts the acquired audio data into text data using speech recognition technology. This allows the content of the conversation to be treated as written information.
[0044] Step 3:
[0045] The server analyzes the converted text data. This analysis applies fraud detection algorithms to identify specific keywords and phrases.
[0046] Step 4:
[0047] The server determines the likelihood of fraud based on the analysis results. If it is determined that there is a high probability of fraud, it proceeds to the next step.
[0048] Step 5:
[0049] If the server determines that a call may be fraudulent, it will send a warning notification to the device. This allows the user to pay attention to the content of the call.
[0050] Step 6:
[0051] The server notifies pre-registered third parties of potential fraud. Typically, family members are set to receive these notifications.
[0052] Step 7:
[0053] Users and third parties can review call recordings stored on the server. This allows for a detailed understanding of fraudulent activity and enables appropriate countermeasures.
[0054] (Example 1)
[0055] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0056] Currently, fraudulent activities conducted via voice communication are on the rise, and there is a need for effective means for users to prevent becoming victims. However, existing systems lack the ability to detect potential fraud in real time and respond quickly. Furthermore, it is difficult for users themselves to recognize the signs of fraud, and in many cases, fraud is only recognized after it has occurred. To improve this situation and ensure user safety, a more advanced and immediate detection system is needed.
[0057] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0058] In this invention, the server includes means for acquiring communication information and converting it into text information using speech recognition technology; means for analyzing the text information to extract specific language patterns and words and evaluate the possibility of fraud; and means for displaying an immediate alert on the user terminal when the possibility of fraud is evaluated. This makes it possible to analyze the possibility of fraud in real time and issue a rapid warning to the user.
[0059] "Communication information" refers to the exchange of any digital data, including voice data.
[0060] "Speech recognition technology" is a technology that processes speech data on a computer and converts it into text information.
[0061] "Textual information" refers to data in text format generated by speech recognition technology.
[0062] "Specific language patterns and words" refer to keywords or phrases that are useful for fraud detection, based on past fraud cases and risky expressions.
[0063] "Means of assessing the likelihood of fraud" refers to methods or algorithms for quantifying or ranking the likelihood of fraudulent activity based on analyzed textual information.
[0064] A "user terminal" refers to an electronic device owned by a user that is used to receive warnings and communications.
[0065] An "immediate alert" is a notification issued in real time to warn users when a potential scam is detected.
[0066] "External parties" refer to third parties who are designated as emergency contacts by the user, and information will be shared as needed.
[0067] Modes for carrying out the invention
[0068] This invention realizes a system for detecting and warning about fraudulent activities in voice communications, in which the server, terminal, and user each play specific roles. This system, which mainly operates over a communication network, uses advanced speech recognition and natural language processing technologies to analyze communication information in real time.
[0069] Server Role
[0070] The server monitors voice communications initiated or received by users. At the start of communication, the server acquires voice data in digital format. This process involves using a cloud-based API that provides speech recognition services to convert the voice data into text. Speech recognition technologies such as Google® Cloud Speech-to-Text API may be used.
[0071] The server sends the converted text information to a natural language processing engine, which uses various algorithms to detect specific language patterns and words. Python natural language processing libraries (e.g., NLTK and spaCy) are used here. Based on the analysis results, the likelihood of fraud is quantified or ranked, and preparations are made to immediately warn the user and external parties.
[0072] Terminal role
[0073] The device receives warnings from the server and provides the user with an immediate alert. This alert is displayed as a pop-up notification or audio alert to warn the user of danger. The device also has a built-in function to contact external parties and can send warning messages to pre-configured contacts as needed.
[0074] User roles
[0075] Users will receive a warning from their device and pay attention to the content of the call. If necessary, they will promptly end the call and take action such as reporting to the police or other relevant authorities. Users are also expected to coordinate with family and friends based on the call risk notified by the system.
[0076] Specific example
[0077] As a concrete example, consider a scenario where a user receives a call from an unknown number. If the caller requests immediate payment, the server's system detects this as a highly likely scam. The server immediately displays a warning on the user's device and simultaneously sends notifications to designated external parties. This approach allows users to quickly take appropriate measures to protect themselves from fraudulent activity.
[0078] Example of a prompt
[0079] "Please explain the system you have in place to warn users if a phone call is potentially fraudulent."
[0080] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0081] Step 1:
[0082] The server acquires the voice communication and captures it as digital data. When the user starts a call, the server receives the voice data in streaming format via the communication protocol. The voice data as input is recorded within the server and stored for subsequent processing. The output is digital voice data.
[0083] Step 2:
[0084] The server converts audio data into text information using speech recognition technology. Using the acquired audio data as input, the server calls a speech recognition API to convert the audio to text. This process includes noise reduction and sampling to improve the accuracy of speech recognition. The output is communication information in text format.
[0085] Step 3:
[0086] The server analyzes textual information and extracts specific language patterns and words. Using text data as input, the server analyzes it using a natural language processing engine to identify potentially fraudulent keywords and expressions. This process involves text tokenization, morphological analysis, and semantic analysis. The output is a list of extracted suspicious language elements.
[0087] Step 4:
[0088] The server assesses the likelihood of fraud and sends a warning to the user's terminal if necessary. Using extracted language patterns as input, an algorithm is applied to calculate a fraud risk score. If the likelihood of fraud exceeds a certain threshold, the server generates a warning signal and sends it to the terminal. The output is a flag indicating whether or not a fraud warning was issued.
[0089] Step 5:
[0090] The terminal receives a warning from the server and provides an immediate warning to the user. Receiving the warning signal from the server, the terminal displays a notification to the user. This warning is delivered via a pop-up or audio notification, prompting the user to carefully review the call content. The output consists of visual and auditory warnings to the user.
[0091] Step 6:
[0092] The user receives a warning, examines the call content, and, if necessary, interrupts the call or contacts the relevant authorities. Upon receiving a warning from the device, the user carefully reviews the content. This allows the user to take appropriate action against fraud. As an output, specific actions are taken to ensure safety.
[0093] (Application Example 1)
[0094] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0095] In recent years, fraudulent activities using voice communication have been increasing, creating a need for automated security systems that can respond quickly to such threats. However, conventional technologies struggle to monitor voice communication and detect signs of fraud in real time, making it difficult to adequately ensure user safety. Furthermore, even when signs of fraud are detected, appropriate action is often not taken, making it difficult to prevent damage.
[0096] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0097] In this invention, the server includes means for acquiring communication information and converting said communication information into character data, means for analyzing said character data to detect signs of fraud, means for notifying a warning when signs of fraud are detected, means for notifying a pre-configured external party based on said warning, and means for displaying said warning and simultaneously prompting the user to choose whether to continue or terminate the communication. This enables the user to recognize the possibility of fraud in real time and take a quick and appropriate response.
[0098] "Communication information" refers to all information transmitted and received through communication, such as voice data and message content.
[0099] "Text data" refers to information format in which speech is converted into text and recorded in an analyzable form.
[0100] "Signs of fraud" refer to suspicious elements that suggest fraudulent activity, based on specific keywords or expressions contained in the communication content.
[0101] "Means of notifying warnings" refers to methods or devices used to inform users of potential fraud.
[0102] "External parties" refer to third-party stakeholders, such as family members or monitoring service providers, who are pre-configured by the communication monitoring system and are separate from the user.
[0103] "Means of prompting the user to choose whether to continue or end communication" refers to interfaces or functions that support the user in deciding whether to continue or interrupt communication.
[0104] To implement this invention, the process begins with the server acquiring communication information. The server captures audio data received or transmitted through the user's terminal in real time. The hardware used is the smartphone's microphone, which transmits the audio data to the server in digital format. This audio data is processed within the server and converted into text data using the Google Cloud Speech-to-Text API.
[0105] Next, the server analyzes the converted text data to detect signs of fraud. These signs include specific keywords, unnatural expressions, or conversational patterns based on past fraud cases. A custom fraud detection algorithm is used for this analysis.
[0106] If signs of fraud are detected, the server will issue a warning. This warning will appear immediately on the user's device and will also notify pre-configured external parties (e.g., family members). This allows the user to decide whether to continue or interrupt the communication depending on the situation. Furthermore, the notification will be appropriately delivered via Firebase Cloud Messaging, and details will be shared with external parties as needed.
[0107] A concrete example is a scenario where a user is asked to "pay money immediately" during a call. In this case, the server quickly transcribes the speech into text and detects predetermined fraud-related keywords. As a result, it immediately displays a warning to the user and designated external parties, allowing the user to make a safe choice.
[0108] Examples of input prompts for a generative AI model:
[0109] "Please provide keywords to detect potential fraud during a phone call."
[0110] "Please list words that emphasize urgency."
[0111] This system provides an effective means of protecting users from fraudulent activities via voice communication through real-time monitoring and automated notifications.
[0112] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0113] Step 1:
[0114] The server receives audio data from the user's device in real time. The input is the audio signal from the device's microphone. The server digitizes this audio signal and acquires it as audio data. The output is the digitized audio data.
[0115] Step 2:
[0116] The server uses the Google Cloud Speech-to-Text API to convert digitized audio data into text data. The input is the audio data obtained in step 1. The server performs speech recognition technology to convert the audio data into text format. In this process, pronunciation recognition and grammatical analysis are performed. The output is the converted text data.
[0117] Step 3:
[0118] The server analyzes text data to detect signs of fraud. The input is the text data obtained in step 2. A custom fraud detection algorithm is applied to search for specific keywords and conversation patterns. If there is a possibility of fraud, the signs are extracted. The output is the analysis results regarding signs of fraud.
[0119] Step 4:
[0120] The server notifies the user of any signs of fraud. The input is the analysis results obtained in step 3. The server uses Firebase Cloud Messaging to send a warning message to the user's device. The output is the warning message displayed on the user's device.
[0121] Step 5:
[0122] The server displays options on the terminal so that the user can choose to continue or interrupt the communication. The input is the warning notification made in step 4. The user uses the on-screen options to decide whether to continue or end the communication. The output is the status of the communication based on the user's selection.
[0123] Step 6:
[0124] The server also notifies pre-configured external parties in response to signs of fraud. The inputs are the analysis results obtained in step 4 and the user's selection status. The server uses the external parties' contact information to send notifications via SMS, email, etc. The output is the notification message sent to the external parties.
[0125] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0126] One embodiment of this invention is an advanced analysis system that operates via a communication network. This system not only analyzes voice communications and detects potential fraud, but also has the function of recognizing the user's emotional state using an emotion engine.
[0127] The server monitors user calls and acquires audio data in real time. This acquired audio is first converted to text using speech recognition technology. Then, it is analyzed by a fraud detection algorithm. This analysis process considers not only specific keywords and phrases, but also the overall flow and context of the conversation to evaluate signs of fraud with high accuracy.
[0128] Furthermore, this system incorporates an emotion engine. The server extracts emotional characteristics such as tone, pitch, and speed from the voice data and uses this information to analyze the user's emotional state. For example, if a user shows extreme anxiety or distrust, the emotion engine detects this and uses it as information to further reinforce the possibility of fraud.
[0129] When the server detects a potential scam, a standard warning notification is sent to the device. This notification helps the user to make an informed decision about the situation. If the emotion engine determines that the user is emotionally concerned, the content and format of the notification are adjusted accordingly. In addition, notifications are sent to pre-registered third parties, usually family members, enabling a quick response.
[0130] As a concrete example, consider a scenario where a user receives a phone call from a stranger requesting money in an urgent manner. The server analyzes the content of the call, and if the emotion engine highly values the user's anxiety, it determines that there is a high probability of fraud. In this case, a highlighted warning is displayed on the device, and the server immediately alerts the user's family. This allows the user and their family to take quick and appropriate action.
[0131] In this way, this system, which combines an emotion engine, significantly reduces the risk of fraud via communications and provides a more reliable means of protecting user safety.
[0132] The following describes the processing flow.
[0133] Step 1:
[0134] The server detects when a user initiates a call. It then acquires and begins monitoring the call's audio data in real time.
[0135] Step 2:
[0136] The server converts the acquired audio data into text data using a speech recognition system. This allows the communication content to be treated as text information.
[0137] Step 3:
[0138] The server analyzes the converted text data and applies fraud detection algorithms. It recognizes keywords and specific phrases and makes an initial assessment of the likelihood of fraud.
[0139] Step 4:
[0140] The server uses an emotion engine to extract emotional characteristics from the audio data. It analyzes the user's emotional state (e.g., anxiety, doubt) from the tone and pitch of the sound.
[0141] Step 5:
[0142] The server integrates text analysis results and sentiment analysis results to make a comprehensive judgment on the likelihood of fraud. If emotional abnormalities are detected, the fraud alert level is increased.
[0143] Step 6:
[0144] If the server determines that there is a high probability of fraud, it will send a warning notification to the device. The notification will include details relevant to the situation.
[0145] Step 7:
[0146] The server also notifies pre-registered third parties of the warning. This allows family members and other relevant parties to understand the situation and provide support.
[0147] Step 8:
[0148] The user can collaborate with the third party who received the notification to review the content and circumstances of the call and take appropriate action. This makes it possible to prevent fraud.
[0149] (Example 2)
[0150] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0151] Fraudulent activities conducted via communications have long been a problem, and the methods used are becoming more diverse and sophisticated. Conventional systems only perform superficial analyses of communication content and do not take into account the emotional state of the user, leading to problems such as false positives and delayed responses. This increased the risk of users becoming involved in fraudulent activities.
[0152] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0153] In this invention, the server includes means for acquiring communication information and converting said communication information into digital text, means for analyzing said digital text to detect the possibility of fraudulent activity, and means for recognizing the user's emotional state using an emotion engine based on the possibility of fraudulent activity. This makes it possible to detect the possibility of fraudulent activity through communication with high accuracy and to provide prompt and appropriate warning notifications that take into account the user's emotional state.
[0154] "Communication information" is a general term for information transmitted and received over a network, including voice, text, and digital data.
[0155] "Converting to digital text" refers to the process of changing audio information and other data into text data that can be processed by a computer.
[0156] "Detecting potential fraudulent activity" means analyzing communication information to determine whether there is any suspicion of fraudulent activity such as fraud or impersonation.
[0157] An "emotion engine" is a collection of technologies and algorithms that analyze features such as tone, pitch, and speed from voice and text data to infer the emotional state of a user.
[0158] "Recognizing the user's emotional state" means analyzing and understanding the user's emotions (for example, their level of anxiety or caution) from their statements and tone of voice during communication.
[0159] "Notifying relevant information" means sending warning or alert messages to users and related parties based on the possibility of detected fraudulent activity.
[0160] "Pre-configured third parties" refers to third parties such as family members or friends that the user has registered as emergency contacts in advance.
[0161] This invention is a system for detecting fraudulent activity via communications and ensuring user safety. It mainly consists of three elements: a server, a terminal, and a user.
[0162] The server acquires user communication information via the network. The acquired audio data is converted into digital text using speech recognition technology. Speech recognition APIs can be utilized for this process. Subsequently, an algorithm using a generative AI model is applied to analyze the text data and detect potential fraudulent activity. In this process, advanced analysis is performed that considers not only specific keywords and phrases, but also the flow and context of the conversation.
[0163] Furthermore, the server utilizes an emotion engine to analyze characteristics such as tone, pitch, and speed of communication information to evaluate the user's emotional state. This allows the server to detect, for example, signs of anxiety or caution, and use that information to assess the possibility of fraudulent activity.
[0164] The device receives instructions from the server and notifies the user of the information it needs to convey. If fraudulent activity is detected, the device displays a warning notification. This notification is tailored to match the user's pre-set emotional state.
[0165] As a concrete example, consider a scenario where a user receives a phone call from a stranger demanding money. The server analyzes the communication content, and if the emotion engine detects the user's anxiety, it determines that there is a high probability of fraudulent activity. In this case, a highlighted warning is immediately displayed on the device, and the user is required to take prompt action based on that information.
[0166] An example of a prompt message is, "Analyze the contents of this call for signs of fraudulent activity and assess the user's emotional state." This system is designed to protect users from fraudulent activity and contribute to improved security through communication.
[0167] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0168] Step 1:
[0169] The server acquires user voice communication information over the network. The input is a voice stream on the network, which is received in real time. Specifically, the server collects the voice data in a specified format and initially stores it in a compressed state. The output is voice data ready for analysis.
[0170] Step 2:
[0171] The server converts the acquired audio data into digital text using speech recognition technology. The input is the audio data saved in step 1, which is converted into text using a speech recognition API. Specifically, the audio data is divided into frames, and each frame is analyzed and converted into a string. The output is the audio converted into text.
[0172] Step 3:
[0173] The server analyzes text data to detect potential fraudulent activity. The input is the text data obtained in step 2, and analysis is performed here using a generative AI model. Specifically, the model detects keywords, phrases, and context, and performs calculations to score the likelihood of fraudulent activity. The output is a score or evaluation regarding the likelihood of fraudulent activity.
[0174] Step 4:
[0175] The server extracts tone, pitch, and speed from the audio data and uses an emotion engine to analyze the user's emotional state. The input is the audio data from step 1, which is then analyzed to extract emotional features. Specifically, the frequency band of the audio is analyzed to detect emotional changes and provide information about the user's emotional state. The output is data indicating the user's emotional state.
[0176] Step 5:
[0177] If the server detects potential fraudulent activity, it generates and sends an alert notification to the device. The inputs are the fraudulent activity score from step 3 and the emotional state data from step 4. The server determines the format of the notification while considering the user's emotional state. Specifically, the server assembles the notification content and sends the information to the device. The output is the alert notification displayed on the device.
[0178] Step 6:
[0179] The server sends information to pre-registered individuals. The input is the alert notification generated in step 5, using pre-configured contact information. Specifically, it sends messages to emergency contacts depending on the content of the notification. The output is an alert message to family and friends.
[0180] (Application Example 2)
[0181] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0182] In today's communication environment, fraudulent activities are becoming more sophisticated, making it difficult to detect potential fraud in real time and take appropriate countermeasures. In particular, the process of detecting signs of fraud from voice is not adequately considering changes in emotional state, and there is a risk that users may become victims without realizing it. Therefore, there is a need for a system that can analyze communication content, detect potential fraud with high accuracy, issue appropriate warnings based on the user's emotional state, and quickly notify third parties.
[0183] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0184] In this invention, the server includes means for acquiring communication content and converting said communication content into text data, means for analyzing said text data to detect the possibility of fraud, means for notifying a warning when the possibility of fraud is detected, means for analyzing the emotional state, and means for adjusting the content and format of the notification based on the emotional state. This makes it possible to detect the possibility of fraud with high accuracy while taking into account the user's emotional state, and to take swift and accurate countermeasures.
[0185] "Communication content" refers to information transmitted in voice or text format, and encompasses all data exchanged between users.
[0186] "Means of converting to text data" refers to the process and technology of analyzing audio data and converting it into a machine-readable text format.
[0187] "Methods for detecting potential fraud" refer to algorithms that analyze communication content and identify signs of fraudulent activity based on specific keywords and context.
[0188] "Means of notifying warnings" refers to alert functions or interfaces that inform users when something is deemed potentially fraudulent.
[0189] "Means of notifying pre-configured third parties" refers to a system that automatically sends warnings to trusted contacts of the user, such as family and friends, when a potential scam is detected.
[0190] "Methods for analyzing emotional states" refer to technologies for evaluating a speaker's psychological state and emotions based on the characteristics of their voice data.
[0191] "Means for adjusting the content and format of notifications" refers to a function that appropriately changes the way warnings are expressed and communicated based on the user's emotional state assessment.
[0192] The system based on this invention is designed to analyze the user's call content in real time and assess the likelihood of fraud. The user's device collects audio data during the call and first converts it from audio to text data using speech recognition software. At this point, it is common to use a speech recognition API such as Google Cloud Speech-to-Text.
[0193] The server then analyzes the acquired text data and runs an algorithm to detect potential fraud. This algorithm considers not only specific keywords and phrases, but also the overall flow and context of the conversation. It also uses sentiment analysis libraries such as Microsoft® Azure® Text Analytics to assess the user's emotional state and uses this information to reinforce the assessment of potential fraud.
[0194] If a potential scam is detected, an immediate warning notification is sent to the device. This warning is designed to alert the user, and its content and format are adjusted based on their emotional state. Furthermore, if necessary, a pre-configured third party, usually a family member, is also notified to enable a quick response. Upon receiving the notification, the family member can review the situation and provide appropriate support to the user.
[0195] For example, if a user is elderly and receives a suspicious phone call from a stranger, the system will display a warning that "this may be a scam" and send an alert to family members saying, "Dad may be involved in a phone scam." This feature allows users and their families to prevent fraud more quickly.
[0196] An example of a prompt would be, "What steps are necessary to build an AI system that detects fraud during phone calls with elderly people?"
[0197] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0198] Step 1:
[0199] The user initiates a call, and the device collects voice data in real time. The input is the user's voice, and the output is digitized voice data. The device sends this voice data to the server at regular intervals.
[0200] Step 2:
[0201] The server converts the received audio data into text data using speech recognition software. The input is digitized audio data, and the output is text data. This conversion is performed using the Google Cloud Speech-to-Text API.
[0202] Step 3:
[0203] The server analyzes text data and evaluates the likelihood of fraud based on a fraud detection algorithm. The input is text data, and the output is an evaluation result indicating the likelihood of fraud. This involves analysis of specific keywords and phrases, as well as contextual analysis.
[0204] Step 4:
[0205] The server further uses an emotion analysis library to extract emotional features from the audio data and evaluate the user's emotional state. The input is audio data, and the output is an indicator of the emotional state. Microsoft Azure Text Analytics is used.
[0206] Step 5:
[0207] If the server determines that there is a high probability of fraud, a warning notification will be displayed on the device. The input is the evaluation result of the likelihood of fraud and the emotional state, and the output is a warning message displayed on the device screen. The warning is adjusted from a normal notification format to content and format according to the emotional state.
[0208] Step 6:
[0209] The server notifies a pre-configured third party when a warning is issued. Inputs are a fraud probability assessment and emotional state indicators, and output is an alert notification to the third party. This allows the user's family to respond immediately.
[0210] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0211] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0212] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0213] [Second Embodiment]
[0214] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0215] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0216] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0217] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0218] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0219] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0220] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0221] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0222] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0223] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0224] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0225] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0226] One embodiment of this invention is a system that primarily operates via a communication network. This system has the function of analyzing voice communications and detecting the possibility of fraud.
[0227] The server is responsible for monitoring voice communications initiated or received by users. Once a call begins, the server acquires the communication content in real time and converts the voice data into text using advanced speech recognition technology. This text data is then used for analysis to determine the potential for fraud.
[0228] During the analysis process, the server detects specific keywords, phrases, and conversational flow, and applies algorithms to evaluate signs of fraudulent activity. In particular, it focuses on analyzing requests related to money or personal information, and expressions that convey a sense of urgency.
[0229] If the server detects a potential scam, it immediately sends a warning notification to the device. This notification serves as a trigger for the user to decide whether to continue the call or to remain vigilant. The server also notifies registered third parties, usually family members, allowing them to quickly assess the situation and take action.
[0230] As a concrete example, suppose a user receives a phone call from an unknown person who asks them to prepare money. The server analyzes the conversation and determines that it may be a sophisticated scam. A warning is immediately displayed on the device, and the user's family is also notified. As a result, the user and their family can take appropriate measures to protect themselves from fraudulent activity.
[0231] In this way, this system provides a means to prevent fraudulent activities via communications and ensure user safety through automated monitoring and notification functions.
[0232] The following describes the processing flow.
[0233] Step 1:
[0234] The server detects when a user initiates a call and acquires audio data in real time. This audio data is necessary to understand the content of the call in detail.
[0235] Step 2:
[0236] The server converts the acquired audio data into text data using speech recognition technology. This allows the content of the conversation to be treated as written information.
[0237] Step 3:
[0238] The server analyzes the converted text data. This analysis applies fraud detection algorithms to identify specific keywords and phrases.
[0239] Step 4:
[0240] The server determines the likelihood of fraud based on the analysis results. If it is determined that there is a high probability of fraud, it proceeds to the next step.
[0241] Step 5:
[0242] If the server determines that a call may be fraudulent, it will send a warning notification to the device. This allows the user to pay attention to the content of the call.
[0243] Step 6:
[0244] The server notifies pre-registered third parties of potential fraud. Typically, family members are set to receive these notifications.
[0245] Step 7:
[0246] Users and third parties can review call recordings stored on the server. This allows for a detailed understanding of fraudulent activity and enables appropriate countermeasures.
[0247] (Example 1)
[0248] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0249] Currently, fraudulent activities conducted via voice communication are on the rise, and there is a need for effective means for users to prevent becoming victims. However, existing systems lack the ability to detect potential fraud in real time and respond quickly. Furthermore, it is difficult for users themselves to recognize the signs of fraud, and in many cases, fraud is only recognized after it has occurred. To improve this situation and ensure user safety, a more advanced and immediate detection system is needed.
[0250] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0251] In this invention, the server includes means for acquiring communication information and converting it into text information using speech recognition technology; means for analyzing the text information to extract specific language patterns and words and evaluate the possibility of fraud; and means for displaying an immediate alert on the user terminal when the possibility of fraud is evaluated. This makes it possible to analyze the possibility of fraud in real time and issue a rapid warning to the user.
[0252] "Communication information" refers to the exchange of any digital data, including voice data.
[0253] "Speech recognition technology" is a technology that processes speech data on a computer and converts it into text information.
[0254] "Textual information" refers to data in text format generated by speech recognition technology.
[0255] "Specific language patterns and words" refer to keywords or phrases that are useful for fraud detection, based on past fraud cases and risky expressions.
[0256] "Means of assessing the likelihood of fraud" refers to methods or algorithms for quantifying or ranking the likelihood of fraudulent activity based on analyzed textual information.
[0257] A "user terminal" refers to an electronic device owned by a user that is used to receive warnings and communications.
[0258] An "immediate alert" is a notification issued in real time to warn users when a potential scam is detected.
[0259] "External parties" refer to third parties who are designated as emergency contacts by the user, and information will be shared as needed.
[0260] Modes for carrying out the invention
[0261] This invention realizes a system for detecting and warning about fraudulent activities in voice communications, in which the server, terminal, and user each play specific roles. This system, which mainly operates over a communication network, uses advanced speech recognition and natural language processing technologies to analyze communication information in real time.
[0262] Server Role
[0263] The server monitors voice communications initiated or received by users. At the start of communication, the server acquires voice data in digital format. This process involves using a cloud-based API that provides speech recognition services to convert the voice data into text. Speech recognition technologies such as the Google Cloud Speech-to-Text API may be used.
[0264] The server sends the converted text information to a natural language processing engine, which uses various algorithms to detect specific language patterns and words. Python natural language processing libraries (e.g., NLTK and spaCy) are used here. Based on the analysis results, the likelihood of fraud is quantified or ranked, and preparations are made to immediately warn the user and external parties.
[0265] Terminal role
[0266] The device receives warnings from the server and provides the user with an immediate alert. This alert is displayed as a pop-up notification or audio alert to warn the user of danger. The device also has a built-in function to contact external parties and can send warning messages to pre-configured contacts as needed.
[0267] User roles
[0268] Users will receive a warning from their device and pay attention to the content of the call. If necessary, they will promptly end the call and take action such as reporting to the police or other relevant authorities. Users are also expected to coordinate with family and friends based on the call risk notified by the system.
[0269] Specific example
[0270] As a concrete example, consider a scenario where a user receives a call from an unknown number. If the caller requests immediate payment, the server's system detects this as a highly likely scam. The server immediately displays a warning on the user's device and simultaneously sends notifications to designated external parties. This approach allows users to quickly take appropriate measures to protect themselves from fraudulent activity.
[0271] Example of a prompt
[0272] "Please explain the system you have in place to warn users if a phone call is potentially fraudulent."
[0273] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0274] Step 1:
[0275] The server acquires the voice communication and captures it as digital data. When the user starts a call, the server receives the voice data in streaming format via the communication protocol. The voice data as input is recorded within the server and stored for subsequent processing. The output is digital voice data.
[0276] Step 2:
[0277] The server converts audio data into text information using speech recognition technology. Using the acquired audio data as input, the server calls a speech recognition API to convert the audio to text. This process includes noise reduction and sampling to improve the accuracy of speech recognition. The output is communication information in text format.
[0278] Step 3:
[0279] The server analyzes textual information and extracts specific language patterns and words. Using text data as input, the server analyzes it using a natural language processing engine to identify potentially fraudulent keywords and expressions. This process involves text tokenization, morphological analysis, and semantic analysis. The output is a list of extracted suspicious language elements.
[0280] Step 4:
[0281] The server assesses the likelihood of fraud and sends a warning to the user's terminal if necessary. Using extracted language patterns as input, an algorithm is applied to calculate a fraud risk score. If the likelihood of fraud exceeds a certain threshold, the server generates a warning signal and sends it to the terminal. The output is a flag indicating whether or not a fraud warning was issued.
[0282] Step 5:
[0283] The terminal receives a warning from the server and provides an immediate warning to the user. Using the warning signal from the server as an input, the terminal performs the operation of displaying a notification to the user. This warning is by means of a pop-up or voice notification, which serves as a trigger for the user to carefully check the content of the call. As an output, visual and auditory warnings to the user are issued.
[0284] Step 6:
[0285] Upon receiving the warning, the user examines the content of the call and, if necessary, interrupts the call or contacts the relevant authorities. Upon receiving the warning from the terminal, the user performs the operation of carefully checking the content. As a result, the user can take appropriate measures against fraud. As an output, specific actions are taken to ensure safety.
[0286] (Application Example 1)
[0287] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as the "server", and the smart glasses 214 are referred to as the "terminal".
[0288] In recent years, frauds using voice communication have been increasing, and there is a demand for an automated security system that can respond quickly to them. However, with conventional technologies, it is difficult to monitor voice communication and detect signs of fraud in real time, and there is a problem that the safety of users cannot be sufficiently ensured. Also, when signs of fraud are detected, appropriate responses are often not taken, and it is difficult to prevent damage in advance.
[0289] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0290] In this invention, the server includes means for acquiring communication information and converting said communication information into character data, means for analyzing said character data to detect signs of fraud, means for notifying a warning when signs of fraud are detected, means for notifying a pre-configured external party based on said warning, and means for displaying said warning and simultaneously prompting the user to choose whether to continue or terminate the communication. This enables the user to recognize the possibility of fraud in real time and take a quick and appropriate response.
[0291] "Communication information" refers to all information transmitted and received through communication, such as voice data and message content.
[0292] "Text data" refers to information format in which speech is converted into text and recorded in an analyzable form.
[0293] "Signs of fraud" refer to suspicious elements that suggest fraudulent activity, based on specific keywords or expressions contained in the communication content.
[0294] "Means of notifying warnings" refers to methods or devices used to inform users of potential fraud.
[0295] "External parties" refer to third-party stakeholders, such as family members or monitoring service providers, who are pre-configured by the communication monitoring system and are separate from the user.
[0296] "Means of prompting the user to choose whether to continue or end communication" refers to interfaces or functions that support the user in deciding whether to continue or interrupt communication.
[0297] To implement this invention, the process begins with the server acquiring communication information. The server captures audio data received or transmitted through the user's terminal in real time. The hardware used is the smartphone's microphone, which transmits the audio data to the server in digital format. This audio data is processed within the server and converted into text data using the Google Cloud Speech-to-Text API.
[0298] Next, the server analyzes the converted text data to detect signs of fraud. These signs include specific keywords, unnatural expressions, or conversational patterns based on past fraud cases. A custom fraud detection algorithm is used for this analysis.
[0299] If signs of fraud are detected, the server will issue a warning. This warning will appear immediately on the user's device and will also notify pre-configured external parties (e.g., family members). This allows the user to decide whether to continue or interrupt the communication depending on the situation. Furthermore, the notification will be appropriately delivered via Firebase Cloud Messaging, and details will be shared with external parties as needed.
[0300] A concrete example is a scenario where a user is asked to "pay money immediately" during a call. In this case, the server quickly transcribes the speech into text and detects predetermined fraud-related keywords. As a result, it immediately displays a warning to the user and designated external parties, allowing the user to make a safe choice.
[0301] Examples of input prompts for a generative AI model:
[0302] "Please provide keywords to detect potential fraud during a phone call."
[0303] "Please list words that emphasize urgency."
[0304] This system provides an effective means to protect users from fraud through voice communication by means of real-time monitoring and automatic notification.
[0305] The flow of the specific process in Application Example 1 will be described using FIG. 12.
[0306] Step 1:
[0307] The server receives voice data from the user's terminal in real time. The input is the voice signal from the terminal's microphone. The server digitizes this voice signal and obtains it as voice data. The output is the digitized voice data.
[0308] Step 2:
[0309] The server uses the Google Cloud Speech-to-Text API to convert the digitized voice data into character data. The input is the voice data obtained in Step 1. The server executes voice recognition technology and converts the voice data into text format. In this process, pronunciation recognition and grammar analysis are performed. The output is the converted character data.
[0310] Step 3:
[0311] The server analyzes the character data to detect signs of fraud. The input is the character data obtained in Step 2. A custom fraud detection algorithm is applied to search for specific keywords and conversation patterns. If there is a possibility of fraud, the signs are extracted. The output is the analysis result regarding the signs of fraud.
[0312] Step 4:
[0313] When the server detects signs of fraud, it notifies the user with a warning. The input is the analysis result obtained in Step 3. The server uses Firebase Cloud Messaging to send a warning message to the user's terminal. The output is the warning message displayed on the user's terminal.
[0314] Step 5:
[0315] The server displays options on the terminal so that the user can choose to continue or interrupt the communication. The input is the warning notification made in step 4. The user uses the on-screen options to decide whether to continue or end the communication. The output is the status of the communication based on the user's selection.
[0316] Step 6:
[0317] The server also notifies pre-configured external parties in response to signs of fraud. The inputs are the analysis results obtained in step 4 and the user's selection status. The server uses the external parties' contact information to send notifications via SMS, email, etc. The output is the notification message sent to the external parties.
[0318] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0319] One embodiment of this invention is an advanced analysis system that operates via a communication network. This system not only analyzes voice communications and detects potential fraud, but also has the function of recognizing the user's emotional state using an emotion engine.
[0320] The server monitors user calls and acquires audio data in real time. This acquired audio is first converted to text using speech recognition technology. Then, it is analyzed by a fraud detection algorithm. This analysis process considers not only specific keywords and phrases, but also the overall flow and context of the conversation to evaluate signs of fraud with high accuracy.
[0321] Furthermore, this system incorporates an emotion engine. The server extracts emotional characteristics such as tone, pitch, and speed from the voice data and uses this information to analyze the user's emotional state. For example, if a user shows extreme anxiety or distrust, the emotion engine detects this and uses it as information to further reinforce the possibility of fraud.
[0322] When the server detects a potential scam, a standard warning notification is sent to the device. This notification helps the user to make an informed decision about the situation. If the emotion engine determines that the user is emotionally concerned, the content and format of the notification are adjusted accordingly. In addition, notifications are sent to pre-registered third parties, usually family members, enabling a quick response.
[0323] As a concrete example, consider a scenario where a user receives a phone call from a stranger requesting money in an urgent manner. The server analyzes the content of the call, and if the emotion engine highly values the user's anxiety, it determines that there is a high probability of fraud. In this case, a highlighted warning is displayed on the device, and the server immediately alerts the user's family. This allows the user and their family to take quick and appropriate action.
[0324] In this way, this system, which combines an emotion engine, significantly reduces the risk of fraud via communications and provides a more reliable means of protecting user safety.
[0325] The following describes the processing flow.
[0326] Step 1:
[0327] The server detects when a user initiates a call. It then acquires and begins monitoring the call's audio data in real time.
[0328] Step 2:
[0329] The server converts the acquired audio data into text data using a speech recognition system. This allows the communication content to be treated as text information.
[0330] Step 3:
[0331] The server analyzes the converted text data and applies fraud detection algorithms. It recognizes keywords and specific phrases and makes an initial assessment of the likelihood of fraud.
[0332] Step 4:
[0333] The server uses an emotion engine to extract emotional characteristics from the audio data. It analyzes the user's emotional state (e.g., anxiety, doubt) from the tone and pitch of the sound.
[0334] Step 5:
[0335] The server integrates text analysis results and sentiment analysis results to make a comprehensive judgment on the likelihood of fraud. If emotional abnormalities are detected, the fraud alert level is increased.
[0336] Step 6:
[0337] If the server determines that there is a high probability of fraud, it will send a warning notification to the device. The notification will include details relevant to the situation.
[0338] Step 7:
[0339] The server also notifies pre-registered third parties of the warning. This allows family members and other relevant parties to understand the situation and provide support.
[0340] Step 8:
[0341] The user can collaborate with the third party who received the notification to review the content and circumstances of the call and take appropriate action. This makes it possible to prevent fraud.
[0342] (Example 2)
[0343] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0344] Fraudulent activities conducted via communications have long been a problem, and the methods used are becoming more diverse and sophisticated. Conventional systems only perform superficial analyses of communication content and do not take into account the emotional state of the user, leading to problems such as false positives and delayed responses. This increased the risk of users becoming involved in fraudulent activities.
[0345] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0346] In this invention, the server includes means for acquiring communication information and converting said communication information into digital text, means for analyzing said digital text to detect the possibility of fraudulent activity, and means for recognizing the user's emotional state using an emotion engine based on the possibility of fraudulent activity. This makes it possible to detect the possibility of fraudulent activity through communication with high accuracy and to provide prompt and appropriate warning notifications that take into account the user's emotional state.
[0347] "Communication information" is a general term for information transmitted and received over a network, including voice, text, and digital data.
[0348] "Converting to digital text" refers to the process of changing audio information and other data into text data that can be processed by a computer.
[0349] "Detecting potential fraudulent activity" means analyzing communication information to determine whether there is any suspicion of fraudulent activity such as fraud or impersonation.
[0350] An "emotion engine" is a collection of technologies and algorithms that analyze features such as tone, pitch, and speed from voice and text data to infer the emotional state of a user.
[0351] "Recognizing the user's emotional state" means analyzing and understanding the user's emotions (for example, their level of anxiety or caution) from their statements and tone of voice during communication.
[0352] "Notifying relevant information" means sending warning or alert messages to users and related parties based on the possibility of detected fraudulent activity.
[0353] "Pre-configured third parties" refers to third parties such as family members or friends that the user has registered as emergency contacts in advance.
[0354] This invention is a system for detecting fraudulent activity via communications and ensuring user safety. It mainly consists of three elements: a server, a terminal, and a user.
[0355] The server acquires user communication information via the network. The acquired audio data is converted into digital text using speech recognition technology. Speech recognition APIs can be utilized for this process. Subsequently, an algorithm using a generative AI model is applied to analyze the text data and detect potential fraudulent activity. In this process, advanced analysis is performed that considers not only specific keywords and phrases, but also the flow and context of the conversation.
[0356] Furthermore, the server utilizes an emotion engine to analyze characteristics such as tone, pitch, and speed of communication information to evaluate the user's emotional state. This allows the server to detect, for example, signs of anxiety or caution, and use that information to assess the possibility of fraudulent activity.
[0357] The device receives instructions from the server and notifies the user of the information it needs to convey. If fraudulent activity is detected, the device displays a warning notification. This notification is tailored to match the user's pre-set emotional state.
[0358] As a concrete example, consider a scenario where a user receives a phone call from a stranger demanding money. The server analyzes the communication content, and if the emotion engine detects the user's anxiety, it determines that there is a high probability of fraudulent activity. In this case, a highlighted warning is immediately displayed on the device, and the user is required to take prompt action based on that information.
[0359] An example of a prompt message is, "Analyze the contents of this call for signs of fraudulent activity and assess the user's emotional state." This system is designed to protect users from fraudulent activity and contribute to improved security through communication.
[0360] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0361] Step 1:
[0362] The server acquires user voice communication information over the network. The input is a voice stream on the network, which is received in real time. Specifically, the server collects the voice data in a specified format and initially stores it in a compressed state. The output is voice data ready for analysis.
[0363] Step 2:
[0364] The server converts the acquired audio data into digital text using speech recognition technology. The input is the audio data saved in step 1, which is converted into text using a speech recognition API. Specifically, the audio data is divided into frames, and each frame is analyzed and converted into a string. The output is the audio converted into text.
[0365] Step 3:
[0366] The server analyzes text data to detect potential fraudulent activity. The input is the text data obtained in step 2, and analysis is performed here using a generative AI model. Specifically, the model detects keywords, phrases, and context, and performs calculations to score the likelihood of fraudulent activity. The output is a score or evaluation regarding the likelihood of fraudulent activity.
[0367] Step 4:
[0368] The server extracts tone, pitch, and speed from the audio data and uses an emotion engine to analyze the user's emotional state. The input is the audio data from step 1, which is then analyzed to extract emotional features. Specifically, the frequency band of the audio is analyzed to detect emotional changes and provide information about the user's emotional state. The output is data indicating the user's emotional state.
[0369] Step 5:
[0370] If the server detects potential fraudulent activity, it generates and sends an alert notification to the device. The inputs are the fraudulent activity score from step 3 and the emotional state data from step 4. The server determines the format of the notification while considering the user's emotional state. Specifically, the server assembles the notification content and sends the information to the device. The output is the alert notification displayed on the device.
[0371] Step 6:
[0372] The server sends information to pre-registered individuals. The input is the alert notification generated in step 5, using pre-configured contact information. Specifically, it sends messages to emergency contacts depending on the content of the notification. The output is an alert message to family and friends.
[0373] (Application Example 2)
[0374] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0375] In today's communication environment, fraudulent activities are becoming more sophisticated, making it difficult to detect potential fraud in real time and take appropriate countermeasures. In particular, the process of detecting signs of fraud from voice is not adequately considering changes in emotional state, and there is a risk that users may become victims without realizing it. Therefore, there is a need for a system that can analyze communication content, detect potential fraud with high accuracy, issue appropriate warnings based on the user's emotional state, and quickly notify third parties.
[0376] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0377] In this invention, the server includes means for acquiring communication content and converting said communication content into text data, means for analyzing said text data to detect the possibility of fraud, means for notifying a warning when the possibility of fraud is detected, means for analyzing the emotional state, and means for adjusting the content and format of the notification based on the emotional state. This makes it possible to detect the possibility of fraud with high accuracy while taking into account the user's emotional state, and to take swift and accurate countermeasures.
[0378] "Communication content" refers to information transmitted in voice or text format, and encompasses all data exchanged between users.
[0379] "Means of converting to text data" refers to the process and technology of analyzing audio data and converting it into a machine-readable text format.
[0380] "Methods for detecting potential fraud" refer to algorithms that analyze communication content and identify signs of fraudulent activity based on specific keywords and context.
[0381] "Means of notifying warnings" refers to alert functions or interfaces that inform users when something is deemed potentially fraudulent.
[0382] "Means of notifying pre-configured third parties" refers to a system that automatically sends warnings to trusted contacts of the user, such as family and friends, when a potential scam is detected.
[0383] "Methods for analyzing emotional states" refer to technologies for evaluating a speaker's psychological state and emotions based on the characteristics of their voice data.
[0384] "Means for adjusting the content and format of notifications" refers to a function that appropriately changes the way warnings are expressed and communicated based on the user's emotional state assessment.
[0385] The system based on this invention is designed to analyze the user's call content in real time and assess the likelihood of fraud. The user's device collects audio data during the call and first converts it from audio to text data using speech recognition software. At this point, it is common to use a speech recognition API such as Google Cloud Speech-to-Text.
[0386] The server then analyzes the acquired text data and runs an algorithm to detect potential fraud. This algorithm considers not only specific keywords and phrases, but also the overall flow and context of the conversation. It also uses sentiment analysis libraries such as Microsoft Azure Text Analytics to assess the user's emotional state and uses this information to reinforce the assessment of potential fraud.
[0387] If a potential scam is detected, an immediate warning notification is sent to the device. This warning is designed to alert the user, and its content and format are adjusted based on their emotional state. Furthermore, if necessary, a pre-configured third party, usually a family member, is also notified to enable a quick response. Upon receiving the notification, the family member can review the situation and provide appropriate support to the user.
[0388] For example, if a user is elderly and receives a suspicious phone call from a stranger, the system will display a warning that "this may be a scam" and send an alert to family members saying, "Dad may be involved in a phone scam." This feature allows users and their families to prevent fraud more quickly.
[0389] An example of a prompt would be, "What steps are necessary to build an AI system that detects fraud during phone calls with elderly people?"
[0390] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0391] Step 1:
[0392] The user initiates a call, and the device collects voice data in real time. The input is the user's voice, and the output is digitized voice data. The device sends this voice data to the server at regular intervals.
[0393] Step 2:
[0394] The server converts the received audio data into text data using speech recognition software. The input is digitized audio data, and the output is text data. This conversion is performed using the Google Cloud Speech-to-Text API.
[0395] Step 3:
[0396] The server analyzes text data and evaluates the likelihood of fraud based on a fraud detection algorithm. The input is text data, and the output is an evaluation result indicating the likelihood of fraud. This involves analysis of specific keywords and phrases, as well as contextual analysis.
[0397] Step 4:
[0398] The server further uses an emotion analysis library to extract emotional features from the audio data and evaluate the user's emotional state. The input is audio data, and the output is an indicator of the emotional state. Microsoft Azure Text Analytics is used.
[0399] Step 5:
[0400] If the server determines that there is a high probability of fraud, a warning notification will be displayed on the device. The input is the evaluation result of the likelihood of fraud and the emotional state, and the output is a warning message displayed on the device screen. The warning is adjusted from a normal notification format to content and format according to the emotional state.
[0401] Step 6:
[0402] The server notifies a pre-configured third party when a warning is issued. Inputs are a fraud probability assessment and emotional state indicators, and output is an alert notification to the third party. This allows the user's family to respond immediately.
[0403] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0404] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0405] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0406] [Third Embodiment]
[0407] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0408] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0409] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0410] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0411] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0412] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0413] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0414] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0415] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0416] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0417] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0418] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0419] One embodiment of this invention is a system that primarily operates via a communication network. This system has the function of analyzing voice communications and detecting the possibility of fraud.
[0420] The server is responsible for monitoring voice communications initiated or received by users. Once a call begins, the server acquires the communication content in real time and converts the voice data into text using advanced speech recognition technology. This text data is then used for analysis to determine the potential for fraud.
[0421] During the analysis process, the server detects specific keywords, phrases, and conversational flow, and applies algorithms to evaluate signs of fraudulent activity. In particular, it focuses on analyzing requests related to money or personal information, and expressions that convey a sense of urgency.
[0422] If the server detects a potential scam, it immediately sends a warning notification to the device. This notification serves as a trigger for the user to decide whether to continue the call or to remain vigilant. The server also notifies registered third parties, usually family members, allowing them to quickly assess the situation and take action.
[0423] As a concrete example, suppose a user receives a phone call from an unknown person who asks them to prepare money. The server analyzes the conversation and determines that it may be a sophisticated scam. A warning is immediately displayed on the device, and the user's family is also notified. As a result, the user and their family can take appropriate measures to protect themselves from fraudulent activity.
[0424] In this way, this system provides a means to prevent fraudulent activities via communications and ensure user safety through automated monitoring and notification functions.
[0425] The following describes the processing flow.
[0426] Step 1:
[0427] The server detects when a user initiates a call and acquires audio data in real time. This audio data is necessary to understand the content of the call in detail.
[0428] Step 2:
[0429] The server converts the acquired audio data into text data using speech recognition technology. This allows the content of the conversation to be treated as written information.
[0430] Step 3:
[0431] The server analyzes the converted text data. This analysis applies fraud detection algorithms to identify specific keywords and phrases.
[0432] Step 4:
[0433] The server determines the likelihood of fraud based on the analysis results. If it is determined that there is a high probability of fraud, it proceeds to the next step.
[0434] Step 5:
[0435] If the server determines that a call may be fraudulent, it will send a warning notification to the device. This allows the user to pay attention to the content of the call.
[0436] Step 6:
[0437] The server notifies pre-registered third parties of potential fraud. Typically, family members are set to receive these notifications.
[0438] Step 7:
[0439] Users and third parties can review call recordings stored on the server. This allows for a detailed understanding of fraudulent activity and enables appropriate countermeasures.
[0440] (Example 1)
[0441] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0442] Currently, fraudulent activities conducted via voice communication are on the rise, and there is a need for effective means for users to prevent becoming victims. However, existing systems lack the ability to detect potential fraud in real time and respond quickly. Furthermore, it is difficult for users themselves to recognize the signs of fraud, and in many cases, fraud is only recognized after it has occurred. To improve this situation and ensure user safety, a more advanced and immediate detection system is needed.
[0443] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0444] In this invention, the server includes means for acquiring communication information and converting it into text information using speech recognition technology; means for analyzing the text information to extract specific language patterns and words and evaluate the possibility of fraud; and means for displaying an immediate alert on the user terminal when the possibility of fraud is evaluated. This makes it possible to analyze the possibility of fraud in real time and issue a rapid warning to the user.
[0445] "Communication information" refers to the exchange of any digital data, including voice data.
[0446] "Speech recognition technology" is a technology that processes speech data on a computer and converts it into text information.
[0447] "Textual information" refers to data in text format generated by speech recognition technology.
[0448] "Specific language patterns and words" refer to keywords or phrases that are useful for fraud detection, based on past fraud cases and risky expressions.
[0449] "Means of assessing the likelihood of fraud" refers to methods or algorithms for quantifying or ranking the likelihood of fraudulent activity based on analyzed textual information.
[0450] A "user terminal" refers to an electronic device owned by a user that is used to receive warnings and communications.
[0451] An "immediate alert" is a notification issued in real time to warn users when a potential scam is detected.
[0452] "External parties" refer to third parties who are designated as emergency contacts by the user, and information will be shared as needed.
[0453] Modes for carrying out the invention
[0454] This invention realizes a system for detecting and warning about fraudulent activities in voice communications, in which the server, terminal, and user each play specific roles. This system, which mainly operates over a communication network, uses advanced speech recognition and natural language processing technologies to analyze communication information in real time.
[0455] Server Role
[0456] The server monitors voice communications initiated or received by users. At the start of communication, the server acquires voice data in digital format. This process involves using a cloud-based API that provides speech recognition services to convert the voice data into text. Speech recognition technologies such as the Google Cloud Speech-to-Text API may be used.
[0457] The server sends the converted text information to a natural language processing engine, which uses various algorithms to detect specific language patterns and words. Python natural language processing libraries (e.g., NLTK and spaCy) are used here. Based on the analysis results, the likelihood of fraud is quantified or ranked, and preparations are made to immediately warn the user and external parties.
[0458] Terminal role
[0459] The device receives warnings from the server and provides the user with an immediate alert. This alert is displayed as a pop-up notification or audio alert to warn the user of danger. The device also has a built-in function to contact external parties and can send warning messages to pre-configured contacts as needed.
[0460] User roles
[0461] Users will receive a warning from their device and pay attention to the content of the call. If necessary, they will promptly end the call and take action such as reporting to the police or other relevant authorities. Users are also expected to coordinate with family and friends based on the call risk notified by the system.
[0462] Specific example
[0463] As a concrete example, consider a scenario where a user receives a call from an unknown number. If the caller requests immediate payment, the server's system detects this as a highly likely scam. The server immediately displays a warning on the user's device and simultaneously sends notifications to designated external parties. This approach allows users to quickly take appropriate measures to protect themselves from fraudulent activity.
[0464] Example of a prompt
[0465] "Please explain the system you have in place to warn users if a phone call is potentially fraudulent."
[0466] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0467] Step 1:
[0468] The server acquires the voice communication and captures it as digital data. When the user starts a call, the server receives the voice data in streaming format via the communication protocol. The voice data as input is recorded within the server and stored for subsequent processing. The output is digital voice data.
[0469] Step 2:
[0470] The server converts audio data into text information using speech recognition technology. Using the acquired audio data as input, the server calls a speech recognition API to convert the audio to text. This process includes noise reduction and sampling to improve the accuracy of speech recognition. The output is communication information in text format.
[0471] Step 3:
[0472] The server analyzes textual information and extracts specific language patterns and words. Using text data as input, the server analyzes it using a natural language processing engine to identify potentially fraudulent keywords and expressions. This process involves text tokenization, morphological analysis, and semantic analysis. The output is a list of extracted suspicious language elements.
[0473] Step 4:
[0474] The server assesses the likelihood of fraud and sends a warning to the user's terminal if necessary. Using extracted language patterns as input, an algorithm is applied to calculate a fraud risk score. If the likelihood of fraud exceeds a certain threshold, the server generates a warning signal and sends it to the terminal. The output is a flag indicating whether or not a fraud warning was issued.
[0475] Step 5:
[0476] The terminal receives a warning from the server and provides an immediate warning to the user. Receiving the warning signal from the server, the terminal displays a notification to the user. This warning is delivered via a pop-up or audio notification, prompting the user to carefully review the call content. The output consists of visual and auditory warnings to the user.
[0477] Step 6:
[0478] The user receives a warning, examines the call content, and, if necessary, interrupts the call or contacts the relevant authorities. Upon receiving a warning from the device, the user carefully reviews the content. This allows the user to take appropriate action against fraud. As an output, specific actions are taken to ensure safety.
[0479] (Application Example 1)
[0480] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0481] In recent years, fraudulent activities using voice communication have been increasing, creating a need for automated security systems that can respond quickly to such threats. However, conventional technologies struggle to monitor voice communication and detect signs of fraud in real time, making it difficult to adequately ensure user safety. Furthermore, even when signs of fraud are detected, appropriate action is often not taken, making it difficult to prevent damage.
[0482] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0483] In this invention, the server includes means for acquiring communication information and converting said communication information into character data, means for analyzing said character data to detect signs of fraud, means for notifying a warning when signs of fraud are detected, means for notifying a pre-configured external party based on said warning, and means for displaying said warning and simultaneously prompting the user to choose whether to continue or terminate the communication. This enables the user to recognize the possibility of fraud in real time and take a quick and appropriate response.
[0484] "Communication information" refers to all information transmitted and received through communication, such as voice data and message content.
[0485] "Text data" refers to information format in which speech is converted into text and recorded in an analyzable form.
[0486] "Signs of fraud" refer to suspicious elements that suggest fraudulent activity, based on specific keywords or expressions contained in the communication content.
[0487] "Means of notifying warnings" refers to methods or devices used to inform users of potential fraud.
[0488] "External parties" refer to third-party stakeholders, such as family members or monitoring service providers, who are pre-configured by the communication monitoring system and are separate from the user.
[0489] "Means of prompting the user to choose whether to continue or end communication" refers to interfaces or functions that support the user in deciding whether to continue or interrupt communication.
[0490] To implement this invention, the process begins with the server acquiring communication information. The server captures audio data received or transmitted through the user's terminal in real time. The hardware used is the smartphone's microphone, which transmits the audio data to the server in digital format. This audio data is processed within the server and converted into text data using the Google Cloud Speech-to-Text API.
[0491] Next, the server analyzes the converted text data to detect signs of fraud. These signs include specific keywords, unnatural expressions, or conversational patterns based on past fraud cases. A custom fraud detection algorithm is used for this analysis.
[0492] If signs of fraud are detected, the server will issue a warning. This warning will appear immediately on the user's device and will also notify pre-configured external parties (e.g., family members). This allows the user to decide whether to continue or interrupt the communication depending on the situation. Furthermore, the notification will be appropriately delivered via Firebase Cloud Messaging, and details will be shared with external parties as needed.
[0493] A concrete example is a scenario where a user is asked to "pay money immediately" during a call. In this case, the server quickly transcribes the speech into text and detects predetermined fraud-related keywords. As a result, it immediately displays a warning to the user and designated external parties, allowing the user to make a safe choice.
[0494] Examples of input prompts for a generative AI model:
[0495] "Please provide keywords to detect potential fraud during a phone call."
[0496] "Please list words that emphasize urgency."
[0497] This system provides an effective means of protecting users from fraudulent activities via voice communication through real-time monitoring and automated notifications.
[0498] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0499] Step 1:
[0500] The server receives audio data from the user's device in real time. The input is the audio signal from the device's microphone. The server digitizes this audio signal and acquires it as audio data. The output is the digitized audio data.
[0501] Step 2:
[0502] The server uses the Google Cloud Speech-to-Text API to convert digitized audio data into text data. The input is the audio data obtained in step 1. The server performs speech recognition technology to convert the audio data into text format. In this process, pronunciation recognition and grammatical analysis are performed. The output is the converted text data.
[0503] Step 3:
[0504] The server analyzes text data to detect signs of fraud. The input is the text data obtained in step 2. A custom fraud detection algorithm is applied to search for specific keywords and conversation patterns. If there is a possibility of fraud, the signs are extracted. The output is the analysis results regarding signs of fraud.
[0505] Step 4:
[0506] The server notifies the user of any signs of fraud. The input is the analysis results obtained in step 3. The server uses Firebase Cloud Messaging to send a warning message to the user's device. The output is the warning message displayed on the user's device.
[0507] Step 5:
[0508] The server displays options on the terminal so that the user can choose to continue or interrupt the communication. The input is the warning notification made in step 4. The user uses the on-screen options to decide whether to continue or end the communication. The output is the status of the communication based on the user's selection.
[0509] Step 6:
[0510] The server also notifies pre-configured external parties in response to signs of fraud. The inputs are the analysis results obtained in step 4 and the user's selection status. The server uses the external parties' contact information to send notifications via SMS, email, etc. The output is the notification message sent to the external parties.
[0511] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0512] One embodiment of this invention is an advanced analysis system that operates via a communication network. This system not only analyzes voice communications and detects potential fraud, but also has the function of recognizing the user's emotional state using an emotion engine.
[0513] The server monitors user calls and acquires audio data in real time. This acquired audio is first converted to text using speech recognition technology. Then, it is analyzed by a fraud detection algorithm. This analysis process considers not only specific keywords and phrases, but also the overall flow and context of the conversation to evaluate signs of fraud with high accuracy.
[0514] Furthermore, this system incorporates an emotion engine. The server extracts emotional characteristics such as tone, pitch, and speed from the voice data and uses this information to analyze the user's emotional state. For example, if a user shows extreme anxiety or distrust, the emotion engine detects this and uses it as information to further reinforce the possibility of fraud.
[0515] When the server detects a potential scam, a standard warning notification is sent to the device. This notification helps the user to make an informed decision about the situation. If the emotion engine determines that the user is emotionally concerned, the content and format of the notification are adjusted accordingly. In addition, notifications are sent to pre-registered third parties, usually family members, enabling a quick response.
[0516] As a concrete example, consider a scenario where a user receives a phone call from a stranger requesting money in an urgent manner. The server analyzes the content of the call, and if the emotion engine highly values the user's anxiety, it determines that there is a high probability of fraud. In this case, a highlighted warning is displayed on the device, and the server immediately alerts the user's family. This allows the user and their family to take quick and appropriate action.
[0517] In this way, this system, which combines an emotion engine, significantly reduces the risk of fraud via communications and provides a more reliable means of protecting user safety.
[0518] The following describes the processing flow.
[0519] Step 1:
[0520] The server detects when a user initiates a call. It then acquires and begins monitoring the call's audio data in real time.
[0521] Step 2:
[0522] The server converts the acquired audio data into text data using a speech recognition system. This allows the communication content to be treated as text information.
[0523] Step 3:
[0524] The server analyzes the converted text data and applies fraud detection algorithms. It recognizes keywords and specific phrases and makes an initial assessment of the likelihood of fraud.
[0525] Step 4:
[0526] The server uses an emotion engine to extract emotional characteristics from the audio data. It analyzes the user's emotional state (e.g., anxiety, doubt) from the tone and pitch of the sound.
[0527] Step 5:
[0528] The server integrates text analysis results and sentiment analysis results to make a comprehensive judgment on the likelihood of fraud. If emotional abnormalities are detected, the fraud alert level is increased.
[0529] Step 6:
[0530] If the server determines that there is a high probability of fraud, it will send a warning notification to the device. The notification will include details relevant to the situation.
[0531] Step 7:
[0532] The server also notifies pre-registered third parties of the warning. This allows family members and other relevant parties to understand the situation and provide support.
[0533] Step 8:
[0534] The user can collaborate with the third party who received the notification to review the content and circumstances of the call and take appropriate action. This makes it possible to prevent fraud.
[0535] (Example 2)
[0536] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0537] Fraudulent activities conducted via communications have long been a problem, and the methods used are becoming more diverse and sophisticated. Conventional systems only perform superficial analyses of communication content and do not take into account the emotional state of the user, leading to problems such as false positives and delayed responses. This increased the risk of users becoming involved in fraudulent activities.
[0538] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0539] In this invention, the server includes means for acquiring communication information and converting said communication information into digital text, means for analyzing said digital text to detect the possibility of fraudulent activity, and means for recognizing the user's emotional state using an emotion engine based on the possibility of fraudulent activity. This makes it possible to detect the possibility of fraudulent activity through communication with high accuracy and to provide prompt and appropriate warning notifications that take into account the user's emotional state.
[0540] "Communication information" is a general term for information transmitted and received over a network, including voice, text, and digital data.
[0541] "Converting to digital text" refers to the process of changing audio information and other data into text data that can be processed by a computer.
[0542] "Detecting potential fraudulent activity" means analyzing communication information to determine whether there is any suspicion of fraudulent activity such as fraud or impersonation.
[0543] An "emotion engine" is a collection of technologies and algorithms that analyze features such as tone, pitch, and speed from voice and text data to infer the emotional state of a user.
[0544] "Recognizing the user's emotional state" means analyzing and understanding the user's emotions (for example, their level of anxiety or caution) from their statements and tone of voice during communication.
[0545] "Notifying relevant information" means sending warning or alert messages to users and related parties based on the possibility of detected fraudulent activity.
[0546] "Pre-configured third parties" refers to third parties such as family members or friends that the user has registered as emergency contacts in advance.
[0547] This invention is a system for detecting fraudulent activity via communications and ensuring user safety. It mainly consists of three elements: a server, a terminal, and a user.
[0548] The server acquires user communication information via the network. The acquired audio data is converted into digital text using speech recognition technology. Speech recognition APIs can be utilized for this process. Subsequently, an algorithm using a generative AI model is applied to analyze the text data and detect potential fraudulent activity. In this process, advanced analysis is performed that considers not only specific keywords and phrases, but also the flow and context of the conversation.
[0549] Furthermore, the server utilizes an emotion engine to analyze characteristics such as tone, pitch, and speed of communication information to evaluate the user's emotional state. This allows the server to detect, for example, signs of anxiety or caution, and use that information to assess the possibility of fraudulent activity.
[0550] The device receives instructions from the server and notifies the user of the information it needs to convey. If fraudulent activity is detected, the device displays a warning notification. This notification is tailored to match the user's pre-set emotional state.
[0551] As a concrete example, consider a scenario where a user receives a phone call from a stranger demanding money. The server analyzes the communication content, and if the emotion engine detects the user's anxiety, it determines that there is a high probability of fraudulent activity. In this case, a highlighted warning is immediately displayed on the device, and the user is required to take prompt action based on that information.
[0552] An example of a prompt message is, "Analyze the contents of this call for signs of fraudulent activity and assess the user's emotional state." This system is designed to protect users from fraudulent activity and contribute to improved security through communication.
[0553] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0554] Step 1:
[0555] The server acquires user voice communication information over the network. The input is a voice stream on the network, which is received in real time. Specifically, the server collects the voice data in a specified format and initially stores it in a compressed state. The output is voice data ready for analysis.
[0556] Step 2:
[0557] The server converts the acquired audio data into digital text using speech recognition technology. The input is the audio data saved in step 1, which is converted into text using a speech recognition API. Specifically, the audio data is divided into frames, and each frame is analyzed and converted into a string. The output is the audio converted into text.
[0558] Step 3:
[0559] The server analyzes text data to detect potential fraudulent activity. The input is the text data obtained in step 2, and analysis is performed here using a generative AI model. Specifically, the model detects keywords, phrases, and context, and performs calculations to score the likelihood of fraudulent activity. The output is a score or evaluation regarding the likelihood of fraudulent activity.
[0560] Step 4:
[0561] The server extracts tone, pitch, and speed from the audio data and uses an emotion engine to analyze the user's emotional state. The input is the audio data from step 1, which is then analyzed to extract emotional features. Specifically, the frequency band of the audio is analyzed to detect emotional changes and provide information about the user's emotional state. The output is data indicating the user's emotional state.
[0562] Step 5:
[0563] If the server detects potential fraudulent activity, it generates and sends an alert notification to the device. The inputs are the fraudulent activity score from step 3 and the emotional state data from step 4. The server determines the format of the notification while considering the user's emotional state. Specifically, the server assembles the notification content and sends the information to the device. The output is the alert notification displayed on the device.
[0564] Step 6:
[0565] The server sends information to pre-registered individuals. The input is the alert notification generated in step 5, using pre-configured contact information. Specifically, it sends messages to emergency contacts depending on the content of the notification. The output is an alert message to family and friends.
[0566] (Application Example 2)
[0567] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0568] In today's communication environment, fraudulent activities are becoming more sophisticated, making it difficult to detect potential fraud in real time and take appropriate countermeasures. In particular, the process of detecting signs of fraud from voice is not adequately considering changes in emotional state, and there is a risk that users may become victims without realizing it. Therefore, there is a need for a system that can analyze communication content, detect potential fraud with high accuracy, issue appropriate warnings based on the user's emotional state, and quickly notify third parties.
[0569] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0570] In this invention, the server includes means for acquiring communication content and converting said communication content into text data, means for analyzing said text data to detect the possibility of fraud, means for notifying a warning when the possibility of fraud is detected, means for analyzing the emotional state, and means for adjusting the content and format of the notification based on the emotional state. This makes it possible to detect the possibility of fraud with high accuracy while taking into account the user's emotional state, and to take swift and accurate countermeasures.
[0571] "Communication content" refers to information transmitted in voice or text format, and encompasses all data exchanged between users.
[0572] "Means of converting to text data" refers to the process and technology of analyzing audio data and converting it into a machine-readable text format.
[0573] "Methods for detecting potential fraud" refer to algorithms that analyze communication content and identify signs of fraudulent activity based on specific keywords and context.
[0574] "Means of notifying warnings" refers to alert functions or interfaces that inform users when something is deemed potentially fraudulent.
[0575] "Means of notifying pre-configured third parties" refers to a system that automatically sends warnings to trusted contacts of the user, such as family and friends, when a potential scam is detected.
[0576] "Methods for analyzing emotional states" refer to technologies for evaluating a speaker's psychological state and emotions based on the characteristics of their voice data.
[0577] "Means for adjusting the content and format of notifications" refers to a function that appropriately changes the way warnings are expressed and communicated based on the user's emotional state assessment.
[0578] The system based on this invention is designed to analyze the user's call content in real time and assess the likelihood of fraud. The user's device collects audio data during the call and first converts it from audio to text data using speech recognition software. At this point, it is common to use a speech recognition API such as Google Cloud Speech-to-Text.
[0579] The server then analyzes the acquired text data and runs an algorithm to detect potential fraud. This algorithm considers not only specific keywords and phrases, but also the overall flow and context of the conversation. It also uses sentiment analysis libraries such as Microsoft Azure Text Analytics to assess the user's emotional state and uses this information to reinforce the assessment of potential fraud.
[0580] If a potential scam is detected, an immediate warning notification is sent to the device. This warning is designed to alert the user, and its content and format are adjusted based on their emotional state. Furthermore, if necessary, a pre-configured third party, usually a family member, is also notified to enable a quick response. Upon receiving the notification, the family member can review the situation and provide appropriate support to the user.
[0581] For example, if a user is elderly and receives a suspicious phone call from a stranger, the system will display a warning that "this may be a scam" and send an alert to family members saying, "Dad may be involved in a phone scam." This feature allows users and their families to prevent fraud more quickly.
[0582] An example of a prompt would be, "What steps are necessary to build an AI system that detects fraud during phone calls with elderly people?"
[0583] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0584] Step 1:
[0585] The user initiates a call, and the device collects voice data in real time. The input is the user's voice, and the output is digitized voice data. The device sends this voice data to the server at regular intervals.
[0586] Step 2:
[0587] The server converts the received audio data into text data using speech recognition software. The input is digitized audio data, and the output is text data. This conversion is performed using the Google Cloud Speech-to-Text API.
[0588] Step 3:
[0589] The server analyzes text data and evaluates the likelihood of fraud based on a fraud detection algorithm. The input is text data, and the output is an evaluation result indicating the likelihood of fraud. This involves analysis of specific keywords and phrases, as well as contextual analysis.
[0590] Step 4:
[0591] The server further uses an emotion analysis library to extract emotional features from the audio data and evaluate the user's emotional state. The input is audio data, and the output is an indicator of the emotional state. Microsoft Azure Text Analytics is used.
[0592] Step 5:
[0593] If the server determines that there is a high probability of fraud, a warning notification will be displayed on the device. The input is the evaluation result of the likelihood of fraud and the emotional state, and the output is a warning message displayed on the device screen. The warning is adjusted from a normal notification format to content and format according to the emotional state.
[0594] Step 6:
[0595] The server notifies a pre-configured third party when a warning is issued. Inputs are a fraud probability assessment and emotional state indicators, and output is an alert notification to the third party. This allows the user's family to respond immediately.
[0596] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0597] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0598] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0599] [Fourth Embodiment]
[0600] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0601] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0602] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0603] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0604] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0605] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0606] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0607] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0608] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0609] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0610] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0611] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0612] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0613] One embodiment of this invention is a system that primarily operates via a communication network. This system has the function of analyzing voice communications and detecting the possibility of fraud.
[0614] The server is responsible for monitoring voice communications initiated or received by users. Once a call begins, the server acquires the communication content in real time and converts the voice data into text using advanced speech recognition technology. This text data is then used for analysis to determine the potential for fraud.
[0615] During the analysis process, the server detects specific keywords, phrases, and conversational flow, and applies algorithms to evaluate signs of fraudulent activity. In particular, it focuses on analyzing requests related to money or personal information, and expressions that convey a sense of urgency.
[0616] If the server detects a potential scam, it immediately sends a warning notification to the device. This notification serves as a trigger for the user to decide whether to continue the call or to remain vigilant. The server also notifies registered third parties, usually family members, allowing them to quickly assess the situation and take action.
[0617] As a concrete example, suppose a user receives a phone call from an unknown person who asks them to prepare money. The server analyzes the conversation and determines that it may be a sophisticated scam. A warning is immediately displayed on the device, and the user's family is also notified. As a result, the user and their family can take appropriate measures to protect themselves from fraudulent activity.
[0618] In this way, this system provides a means to prevent fraudulent activities via communications and ensure user safety through automated monitoring and notification functions.
[0619] The following describes the processing flow.
[0620] Step 1:
[0621] The server detects when a user initiates a call and acquires audio data in real time. This audio data is necessary to understand the content of the call in detail.
[0622] Step 2:
[0623] The server converts the acquired audio data into text data using speech recognition technology. This allows the content of the conversation to be treated as written information.
[0624] Step 3:
[0625] The server analyzes the converted text data. This analysis applies fraud detection algorithms to identify specific keywords and phrases.
[0626] Step 4:
[0627] The server determines the likelihood of fraud based on the analysis results. If it is determined that there is a high probability of fraud, it proceeds to the next step.
[0628] Step 5:
[0629] If the server determines that a call may be fraudulent, it will send a warning notification to the device. This allows the user to pay attention to the content of the call.
[0630] Step 6:
[0631] The server notifies pre-registered third parties of potential fraud. Typically, family members are set to receive these notifications.
[0632] Step 7:
[0633] Users and third parties can review call recordings stored on the server. This allows for a detailed understanding of fraudulent activity and enables appropriate countermeasures.
[0634] (Example 1)
[0635] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0636] Currently, fraudulent activities conducted via voice communication are on the rise, and there is a need for effective means for users to prevent becoming victims. However, existing systems lack the ability to detect potential fraud in real time and respond quickly. Furthermore, it is difficult for users themselves to recognize the signs of fraud, and in many cases, fraud is only recognized after it has occurred. To improve this situation and ensure user safety, a more advanced and immediate detection system is needed.
[0637] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0638] In this invention, the server includes means for acquiring communication information and converting it into text information using speech recognition technology; means for analyzing the text information to extract specific language patterns and words and evaluate the possibility of fraud; and means for displaying an immediate alert on the user terminal when the possibility of fraud is evaluated. This makes it possible to analyze the possibility of fraud in real time and issue a rapid warning to the user.
[0639] "Communication information" refers to the exchange of any digital data, including voice data.
[0640] "Speech recognition technology" is a technology that processes speech data on a computer and converts it into text information.
[0641] "Textual information" refers to data in text format generated by speech recognition technology.
[0642] "Specific language patterns and words" refer to keywords or phrases that are useful for fraud detection, based on past fraud cases and risky expressions.
[0643] "Means of assessing the likelihood of fraud" refers to methods or algorithms for quantifying or ranking the likelihood of fraudulent activity based on analyzed textual information.
[0644] A "user terminal" refers to an electronic device owned by a user that is used to receive warnings and communications.
[0645] An "immediate alert" is a notification issued in real time to warn users when a potential scam is detected.
[0646] "External parties" refer to third parties who are designated as emergency contacts by the user, and information will be shared as needed.
[0647] Modes for carrying out the invention
[0648] This invention realizes a system for detecting and warning about fraudulent activities in voice communications, in which the server, terminal, and user each play specific roles. This system, which mainly operates over a communication network, uses advanced speech recognition and natural language processing technologies to analyze communication information in real time.
[0649] Server Role
[0650] The server monitors voice communications initiated or received by users. At the start of communication, the server acquires voice data in digital format. This process involves using a cloud-based API that provides speech recognition services to convert the voice data into text. Speech recognition technologies such as the Google Cloud Speech-to-Text API may be used.
[0651] The server sends the converted text information to a natural language processing engine, which uses various algorithms to detect specific language patterns and words. Python natural language processing libraries (e.g., NLTK and spaCy) are used here. Based on the analysis results, the likelihood of fraud is quantified or ranked, and preparations are made to immediately warn the user and external parties.
[0652] Terminal role
[0653] The device receives warnings from the server and provides the user with an immediate alert. This alert is displayed as a pop-up notification or audio alert to warn the user of danger. The device also has a built-in function to contact external parties and can send warning messages to pre-configured contacts as needed.
[0654] User roles
[0655] Users will receive a warning from their device and pay attention to the content of the call. If necessary, they will promptly end the call and take action such as reporting to the police or other relevant authorities. Users are also expected to coordinate with family and friends based on the call risk notified by the system.
[0656] Specific example
[0657] As a concrete example, consider a scenario where a user receives a call from an unknown number. If the caller requests immediate payment, the server's system detects this as a highly likely scam. The server immediately displays a warning on the user's device and simultaneously sends notifications to designated external parties. This approach allows users to quickly take appropriate measures to protect themselves from fraudulent activity.
[0658] Example of a prompt
[0659] "Please explain the system you have in place to warn users if a phone call is potentially fraudulent."
[0660] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0661] Step 1:
[0662] The server acquires the voice communication and captures it as digital data. When the user starts a call, the server receives the voice data in streaming format via the communication protocol. The voice data as input is recorded within the server and stored for subsequent processing. The output is digital voice data.
[0663] Step 2:
[0664] The server converts audio data into text information using speech recognition technology. Using the acquired audio data as input, the server calls a speech recognition API to convert the audio to text. This process includes noise reduction and sampling to improve the accuracy of speech recognition. The output is communication information in text format.
[0665] Step 3:
[0666] The server analyzes textual information and extracts specific language patterns and words. Using text data as input, the server analyzes it using a natural language processing engine to identify potentially fraudulent keywords and expressions. This process involves text tokenization, morphological analysis, and semantic analysis. The output is a list of extracted suspicious language elements.
[0667] Step 4:
[0668] The server assesses the likelihood of fraud and sends a warning to the user's terminal if necessary. Using extracted language patterns as input, an algorithm is applied to calculate a fraud risk score. If the likelihood of fraud exceeds a certain threshold, the server generates a warning signal and sends it to the terminal. The output is a flag indicating whether or not a fraud warning was issued.
[0669] Step 5:
[0670] The terminal receives a warning from the server and provides an immediate warning to the user. Receiving the warning signal from the server, the terminal displays a notification to the user. This warning is delivered via a pop-up or audio notification, prompting the user to carefully review the call content. The output consists of visual and auditory warnings to the user.
[0671] Step 6:
[0672] The user receives a warning, examines the call content, and, if necessary, interrupts the call or contacts the relevant authorities. Upon receiving a warning from the device, the user carefully reviews the content. This allows the user to take appropriate action against fraud. As an output, specific actions are taken to ensure safety.
[0673] (Application Example 1)
[0674] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0675] In recent years, fraudulent activities using voice communication have been increasing, creating a need for automated security systems that can respond quickly to such threats. However, conventional technologies struggle to monitor voice communication and detect signs of fraud in real time, making it difficult to adequately ensure user safety. Furthermore, even when signs of fraud are detected, appropriate action is often not taken, making it difficult to prevent damage.
[0676] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0677] In this invention, the server includes means for acquiring communication information and converting said communication information into character data, means for analyzing said character data to detect signs of fraud, means for notifying a warning when signs of fraud are detected, means for notifying a pre-configured external party based on said warning, and means for displaying said warning and simultaneously prompting the user to choose whether to continue or terminate the communication. This enables the user to recognize the possibility of fraud in real time and take a quick and appropriate response.
[0678] "Communication information" refers to all information transmitted and received through communication, such as voice data and message content.
[0679] "Text data" refers to information format in which speech is converted into text and recorded in an analyzable form.
[0680] "Signs of fraud" refer to suspicious elements that suggest fraudulent activity, based on specific keywords or expressions contained in the communication content.
[0681] "Means of notifying warnings" refers to methods or devices used to inform users of potential fraud.
[0682] "External parties" refer to third-party stakeholders, such as family members or monitoring service providers, who are pre-configured by the communication monitoring system and are separate from the user.
[0683] "Means of prompting the user to choose whether to continue or end communication" refers to interfaces or functions that support the user in deciding whether to continue or interrupt communication.
[0684] To implement this invention, the process begins with the server acquiring communication information. The server captures audio data received or transmitted through the user's terminal in real time. The hardware used is the smartphone's microphone, which transmits the audio data to the server in digital format. This audio data is processed within the server and converted into text data using the Google Cloud Speech-to-Text API.
[0685] Next, the server analyzes the converted text data to detect signs of fraud. These signs include specific keywords, unnatural expressions, or conversational patterns based on past fraud cases. A custom fraud detection algorithm is used for this analysis.
[0686] If signs of fraud are detected, the server will issue a warning. This warning will appear immediately on the user's device and will also notify pre-configured external parties (e.g., family members). This allows the user to decide whether to continue or interrupt the communication depending on the situation. Furthermore, the notification will be appropriately delivered via Firebase Cloud Messaging, and details will be shared with external parties as needed.
[0687] A concrete example is a scenario where a user is asked to "pay money immediately" during a call. In this case, the server quickly transcribes the speech into text and detects predetermined fraud-related keywords. As a result, it immediately displays a warning to the user and designated external parties, allowing the user to make a safe choice.
[0688] Examples of input prompts for a generative AI model:
[0689] "Please provide keywords to detect potential fraud during a phone call."
[0690] "Please list words that emphasize urgency."
[0691] This system provides an effective means of protecting users from fraudulent activities via voice communication through real-time monitoring and automated notifications.
[0692] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0693] Step 1:
[0694] The server receives audio data from the user's device in real time. The input is the audio signal from the device's microphone. The server digitizes this audio signal and acquires it as audio data. The output is the digitized audio data.
[0695] Step 2:
[0696] The server uses the Google Cloud Speech-to-Text API to convert digitized audio data into text data. The input is the audio data obtained in step 1. The server performs speech recognition technology to convert the audio data into text format. In this process, pronunciation recognition and grammatical analysis are performed. The output is the converted text data.
[0697] Step 3:
[0698] The server analyzes text data to detect signs of fraud. The input is the text data obtained in step 2. A custom fraud detection algorithm is applied to search for specific keywords and conversation patterns. If there is a possibility of fraud, the signs are extracted. The output is the analysis results regarding signs of fraud.
[0699] Step 4:
[0700] The server notifies the user of any signs of fraud. The input is the analysis results obtained in step 3. The server uses Firebase Cloud Messaging to send a warning message to the user's device. The output is the warning message displayed on the user's device.
[0701] Step 5:
[0702] The server displays options on the terminal so that the user can choose to continue or interrupt the communication. The input is the warning notification made in step 4. The user uses the on-screen options to decide whether to continue or end the communication. The output is the status of the communication based on the user's selection.
[0703] Step 6:
[0704] The server also notifies pre-configured external parties in response to signs of fraud. The inputs are the analysis results obtained in step 4 and the user's selection status. The server uses the external parties' contact information to send notifications via SMS, email, etc. The output is the notification message sent to the external parties.
[0705] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0706] One embodiment of this invention is an advanced analysis system that operates via a communication network. This system not only analyzes voice communications and detects potential fraud, but also has the function of recognizing the user's emotional state using an emotion engine.
[0707] The server monitors user calls and acquires audio data in real time. This acquired audio is first converted to text using speech recognition technology. Then, it is analyzed by a fraud detection algorithm. This analysis process considers not only specific keywords and phrases, but also the overall flow and context of the conversation to evaluate signs of fraud with high accuracy.
[0708] Furthermore, this system incorporates an emotion engine. The server extracts emotional characteristics such as tone, pitch, and speed from the voice data and uses this information to analyze the user's emotional state. For example, if a user shows extreme anxiety or distrust, the emotion engine detects this and uses it as information to further reinforce the possibility of fraud.
[0709] When the server detects a potential scam, a standard warning notification is sent to the device. This notification helps the user to make an informed decision about the situation. If the emotion engine determines that the user is emotionally concerned, the content and format of the notification are adjusted accordingly. In addition, notifications are sent to pre-registered third parties, usually family members, enabling a quick response.
[0710] As a concrete example, consider a scenario where a user receives a phone call from a stranger requesting money in an urgent manner. The server analyzes the content of the call, and if the emotion engine highly values the user's anxiety, it determines that there is a high probability of fraud. In this case, a highlighted warning is displayed on the device, and the server immediately alerts the user's family. This allows the user and their family to take quick and appropriate action.
[0711] In this way, this system, which combines an emotion engine, significantly reduces the risk of fraud via communications and provides a more reliable means of protecting user safety.
[0712] The following describes the processing flow.
[0713] Step 1:
[0714] The server detects when a user initiates a call. It then acquires and begins monitoring the call's audio data in real time.
[0715] Step 2:
[0716] The server converts the acquired audio data into text data using a speech recognition system. This allows the communication content to be treated as text information.
[0717] Step 3:
[0718] The server analyzes the converted text data and applies fraud detection algorithms. It recognizes keywords and specific phrases and makes an initial assessment of the likelihood of fraud.
[0719] Step 4:
[0720] The server uses an emotion engine to extract emotional characteristics from the audio data. It analyzes the user's emotional state (e.g., anxiety, doubt) from the tone and pitch of the sound.
[0721] Step 5:
[0722] The server integrates text analysis results and sentiment analysis results to make a comprehensive judgment on the likelihood of fraud. If emotional abnormalities are detected, the fraud alert level is increased.
[0723] Step 6:
[0724] If the server determines that there is a high probability of fraud, it will send a warning notification to the device. The notification will include details relevant to the situation.
[0725] Step 7:
[0726] The server also notifies pre-registered third parties of the warning. This allows family members and other relevant parties to understand the situation and provide support.
[0727] Step 8:
[0728] The user can collaborate with the third party who received the notification to review the content and circumstances of the call and take appropriate action. This makes it possible to prevent fraud.
[0729] (Example 2)
[0730] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0731] Fraudulent activities conducted via communications have long been a problem, and the methods used are becoming more diverse and sophisticated. Conventional systems only perform superficial analyses of communication content and do not take into account the emotional state of the user, leading to problems such as false positives and delayed responses. This increased the risk of users becoming involved in fraudulent activities.
[0732] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0733] In this invention, the server includes means for acquiring communication information and converting said communication information into digital text, means for analyzing said digital text to detect the possibility of fraudulent activity, and means for recognizing the user's emotional state using an emotion engine based on the possibility of fraudulent activity. This makes it possible to detect the possibility of fraudulent activity through communication with high accuracy and to provide prompt and appropriate warning notifications that take into account the user's emotional state.
[0734] "Communication information" is a general term for information transmitted and received over a network, including voice, text, and digital data.
[0735] "Converting to digital text" refers to the process of changing audio information and other data into text data that can be processed by a computer.
[0736] "Detecting potential fraudulent activity" means analyzing communication information to determine whether there is any suspicion of fraudulent activity such as fraud or impersonation.
[0737] An "emotion engine" is a collection of technologies and algorithms that analyze features such as tone, pitch, and speed from voice and text data to infer the emotional state of a user.
[0738] "Recognizing the user's emotional state" means analyzing and understanding the user's emotions (for example, their level of anxiety or caution) from their statements and tone of voice during communication.
[0739] "Notifying relevant information" means sending warning or alert messages to users and related parties based on the possibility of detected fraudulent activity.
[0740] "Pre-configured third parties" refers to third parties such as family members or friends that the user has registered as emergency contacts in advance.
[0741] This invention is a system for detecting fraudulent activity via communications and ensuring user safety. It mainly consists of three elements: a server, a terminal, and a user.
[0742] The server acquires user communication information via the network. The acquired audio data is converted into digital text using speech recognition technology. Speech recognition APIs can be utilized for this process. Subsequently, an algorithm using a generative AI model is applied to analyze the text data and detect potential fraudulent activity. In this process, advanced analysis is performed that considers not only specific keywords and phrases, but also the flow and context of the conversation.
[0743] Furthermore, the server utilizes an emotion engine to analyze characteristics such as tone, pitch, and speed of communication information to evaluate the user's emotional state. This allows the server to detect, for example, signs of anxiety or caution, and use that information to assess the possibility of fraudulent activity.
[0744] The device receives instructions from the server and notifies the user of the information it needs to convey. If fraudulent activity is detected, the device displays a warning notification. This notification is tailored to match the user's pre-set emotional state.
[0745] As a concrete example, consider a scenario where a user receives a phone call from a stranger demanding money. The server analyzes the communication content, and if the emotion engine detects the user's anxiety, it determines that there is a high probability of fraudulent activity. In this case, a highlighted warning is immediately displayed on the device, and the user is required to take prompt action based on that information.
[0746] An example of a prompt message is, "Analyze the contents of this call for signs of fraudulent activity and assess the user's emotional state." This system is designed to protect users from fraudulent activity and contribute to improved security through communication.
[0747] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0748] Step 1:
[0749] The server acquires user voice communication information over the network. The input is a voice stream on the network, which is received in real time. Specifically, the server collects the voice data in a specified format and initially stores it in a compressed state. The output is voice data ready for analysis.
[0750] Step 2:
[0751] The server converts the acquired audio data into digital text using speech recognition technology. The input is the audio data saved in step 1, which is converted into text using a speech recognition API. Specifically, the audio data is divided into frames, and each frame is analyzed and converted into a string. The output is the audio converted into text.
[0752] Step 3:
[0753] The server analyzes text data to detect potential fraudulent activity. The input is the text data obtained in step 2, and analysis is performed here using a generative AI model. Specifically, the model detects keywords, phrases, and context, and performs calculations to score the likelihood of fraudulent activity. The output is a score or evaluation regarding the likelihood of fraudulent activity.
[0754] Step 4:
[0755] The server extracts tone, pitch, and speed from the audio data and uses an emotion engine to analyze the user's emotional state. The input is the audio data from step 1, which is then analyzed to extract emotional features. Specifically, the frequency band of the audio is analyzed to detect emotional changes and provide information about the user's emotional state. The output is data indicating the user's emotional state.
[0756] Step 5:
[0757] If the server detects potential fraudulent activity, it generates and sends an alert notification to the device. The inputs are the fraudulent activity score from step 3 and the emotional state data from step 4. The server determines the format of the notification while considering the user's emotional state. Specifically, the server assembles the notification content and sends the information to the device. The output is the alert notification displayed on the device.
[0758] Step 6:
[0759] The server sends information to pre-registered individuals. The input is the alert notification generated in step 5, using pre-configured contact information. Specifically, it sends messages to emergency contacts depending on the content of the notification. The output is an alert message to family and friends.
[0760] (Application Example 2)
[0761] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0762] In today's communication environment, fraudulent activities are becoming more sophisticated, making it difficult to detect potential fraud in real time and take appropriate countermeasures. In particular, the process of detecting signs of fraud from voice is not adequately considering changes in emotional state, and there is a risk that users may become victims without realizing it. Therefore, there is a need for a system that can analyze communication content, detect potential fraud with high accuracy, issue appropriate warnings based on the user's emotional state, and quickly notify third parties.
[0763] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0764] In this invention, the server includes means for acquiring communication content and converting said communication content into text data, means for analyzing said text data to detect the possibility of fraud, means for notifying a warning when the possibility of fraud is detected, means for analyzing the emotional state, and means for adjusting the content and format of the notification based on the emotional state. This makes it possible to detect the possibility of fraud with high accuracy while taking into account the user's emotional state, and to take swift and accurate countermeasures.
[0765] "Communication content" refers to information transmitted in voice or text format, and encompasses all data exchanged between users.
[0766] "Means of converting to text data" refers to the process and technology of analyzing audio data and converting it into a machine-readable text format.
[0767] "Methods for detecting potential fraud" refer to algorithms that analyze communication content and identify signs of fraudulent activity based on specific keywords and context.
[0768] "Means of notifying warnings" refers to alert functions or interfaces that inform users when something is deemed potentially fraudulent.
[0769] "Means of notifying pre-configured third parties" refers to a system that automatically sends warnings to trusted contacts of the user, such as family and friends, when a potential scam is detected.
[0770] "Methods for analyzing emotional states" refer to technologies for evaluating a speaker's psychological state and emotions based on the characteristics of their voice data.
[0771] "Means for adjusting the content and format of notifications" refers to a function that appropriately changes the way warnings are expressed and communicated based on the user's emotional state assessment.
[0772] The system based on this invention is designed to analyze the user's call content in real time and assess the likelihood of fraud. The user's device collects audio data during the call and first converts it from audio to text data using speech recognition software. At this point, it is common to use a speech recognition API such as Google Cloud Speech-to-Text.
[0773] The server then analyzes the acquired text data and runs an algorithm to detect potential fraud. This algorithm considers not only specific keywords and phrases, but also the overall flow and context of the conversation. It also uses sentiment analysis libraries such as Microsoft Azure Text Analytics to assess the user's emotional state and uses this information to reinforce the assessment of potential fraud.
[0774] If a potential scam is detected, an immediate warning notification is sent to the device. This warning is designed to alert the user, and its content and format are adjusted based on their emotional state. Furthermore, if necessary, a pre-configured third party, usually a family member, is also notified to enable a quick response. Upon receiving the notification, the family member can review the situation and provide appropriate support to the user.
[0775] For example, if a user is elderly and receives a suspicious phone call from a stranger, the system will display a warning that "this may be a scam" and send an alert to family members saying, "Dad may be involved in a phone scam." This feature allows users and their families to prevent fraud more quickly.
[0776] An example of a prompt would be, "What steps are necessary to build an AI system that detects fraud during phone calls with elderly people?"
[0777] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0778] Step 1:
[0779] The user initiates a call, and the device collects voice data in real time. The input is the user's voice, and the output is digitized voice data. The device sends this voice data to the server at regular intervals.
[0780] Step 2:
[0781] The server converts the received audio data into text data using speech recognition software. The input is digitized audio data, and the output is text data. This conversion is performed using the Google Cloud Speech-to-Text API.
[0782] Step 3:
[0783] The server analyzes text data and evaluates the likelihood of fraud based on a fraud detection algorithm. The input is text data, and the output is an evaluation result indicating the likelihood of fraud. This involves analysis of specific keywords and phrases, as well as contextual analysis.
[0784] Step 4:
[0785] The server further uses an emotion analysis library to extract emotional features from the audio data and evaluate the user's emotional state. The input is audio data, and the output is an indicator of the emotional state. Microsoft Azure Text Analytics is used.
[0786] Step 5:
[0787] If the server determines that there is a high probability of fraud, a warning notification will be displayed on the device. The input is the evaluation result of the likelihood of fraud and the emotional state, and the output is a warning message displayed on the device screen. The warning is adjusted from a normal notification format to content and format according to the emotional state.
[0788] Step 6:
[0789] The server notifies a pre-configured third party when a warning is issued. Inputs are a fraud probability assessment and emotional state indicators, and output is an alert notification to the third party. This allows the user's family to respond immediately.
[0790] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0791] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0792] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0793] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0794] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0795] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0796] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0797] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0798] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0799] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0800] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0801] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0802] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0803] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0804] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0805] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0806] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0807] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0808] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0809] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0810] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0811] The following is further disclosed regarding the embodiments described above.
[0812] (Claim 1)
[0813] A means for acquiring communication content and converting said communication content into text data,
[0814] A means for analyzing the text data to detect the possibility of fraud,
[0815] A means of notifying a warning when a potential scam is detected,
[0816] A means of notifying a pre-configured third party based on the warning,
[0817] A system that includes this.
[0818] (Claim 2)
[0819] The system according to claim 1, further comprising means for recording the content of communications if the possibility of fraud is detected.
[0820] (Claim 3)
[0821] The system according to claim 1, further comprising means for performing analysis based on past fraud cases in detecting the possibility of fraud.
[0822] "Example 1"
[0823] (Claim 1)
[0824] A means for acquiring communication information and converting said communication information into text information using speech recognition technology,
[0825] A means for analyzing the textual information to extract specific language patterns and words and to evaluate the possibility of fraud,
[0826] A means of displaying an immediate alert on the user's terminal when the possibility of fraud is assessed,
[0827] Based on the alarm, a means of contacting designated external parties,
[0828] A system that includes this.
[0829] (Claim 2)
[0830] The system according to claim 1, further comprising a function for recording communication information if the possibility of fraud is assessed.
[0831] (Claim 3)
[0832] The system according to claim 1, further comprising a function for performing data analysis based on past fraud cases in assessing the likelihood of fraud.
[0833] "Application Example 1"
[0834] (Claim 1)
[0835] A means for acquiring communication information and converting said communication information into character data,
[0836] A means for analyzing the text data to detect signs of fraud,
[0837] A means of notifying a warning when signs of fraud are detected,
[0838] Based on the warning, a means of notifying a pre-configured external party,
[0839] A means to display the warning and simultaneously prompt the user to choose whether to continue or terminate the communication,
[0840] A system that includes this.
[0841] (Claim 2)
[0842] The system according to claim 1, further comprising means for recording communication information when signs of fraud are detected.
[0843] (Claim 3)
[0844] The system according to claim 1, further comprising means for performing analysis based on past fraud cases in detecting signs of fraud.
[0845] "Example 2 of combining an emotion engine"
[0846] (Claim 1)
[0847] A means for acquiring communication information and converting said communication information into digital text,
[0848] A means for analyzing the digital text to detect the possibility of fraudulent activity,
[0849] A means of recognizing the user's emotional state using an emotion engine based on the possibility of the fraudulent activity,
[0850] A means of notifying information when potential fraudulent activity is detected,
[0851] A means of transmitting information to a pre-configured third party based on the information to be notified,
[0852] A system that includes this.
[0853] (Claim 2)
[0854] The system according to claim 1, further comprising means for adjusting information provided by the emotion engine in accordance with the emotional state when potential misconduct is detected.
[0855] (Claim 3)
[0856] The system according to claim 1, further comprising means for performing analysis based on past cases of fraudulent activity in detecting the possibility of fraudulent activity.
[0857] "Application example 2 when combining with an emotional engine"
[0858] (Claim 1)
[0859] A means for acquiring communication content and converting said communication content into text data,
[0860] A means for analyzing the text data to detect the possibility of fraud,
[0861] A means of notifying a warning when a potential scam is detected,
[0862] A means of notifying a pre-configured third party based on the warning,
[0863] A means of analyzing emotional states,
[0864] A means of adjusting the content and format of notifications based on emotional state,
[0865] A system that includes this.
[0866] (Claim 2)
[0867] The system according to claim 1, further comprising means for recording the content of communications if the possibility of fraud is detected.
[0868] (Claim 3)
[0869] The system according to claim 1, further comprising means for performing analysis based on past fraud cases in detecting the possibility of fraud. [Explanation of Symbols]
[0870] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means for acquiring communication information and converting said communication information into character data, A means for analyzing the text data to detect signs of fraud, A means of notifying a warning when signs of fraud are detected, Based on the warning, a means of notifying a pre-configured external party, A means to display the warning and simultaneously prompt the user to choose whether to continue or terminate the communication, A system that includes this.
2. The system according to claim 1, further comprising means for recording communication information when signs of fraud are detected.
3. The system according to claim 1, further comprising means for performing analysis based on past fraud cases in detecting signs of fraud.