system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The system addresses real-time detection and prevention of abnormal behavior by employing real-time video analysis and facial recognition, generating warnings to deter suspicious activities and enhance user safety.

JP2026096451APending Publication Date: 2026-06-15SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-03
Publication Date: 2026-06-15

AI Technical Summary

Technical Problem

Traditional monitoring systems struggle to detect abnormal behavior or suspicious individuals in real-time and fail to implement immediate and effective countermeasures, leading to delayed crime prevention.

Method used

A system equipped with real-time video analysis capabilities to detect abnormal behavior and suspicious individuals, utilizing facial recognition and generative artificial intelligence to generate warnings and notifications through speakers, ensuring rapid response by administrators and law enforcement.

Benefits of technology

Enables immediate and effective crime prevention by detecting and deterring suspicious activities through real-time analysis and personalized warnings, enhancing user safety and security.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096451000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious individuals, A means of issuing a warning if the detected person's facial image is identified by comparing it with an existing database, A means of generating voice messages using generative artificial intelligence to warn suspicious individuals, Means of notifying administrators or the police of information regarding suspicious persons or unusual behavior, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Traditional monitoring systems have the problem that even if they can detect abnormal behavior or suspicious persons, it is difficult to take immediate and effective countermeasures. In addition, due to the delay in identifying and tracking suspicious persons, it has become difficult to prevent crimes. The purpose of the present invention is to solve these problems and enable more rapid and effective crime prevention countermeasures.

Means for Solving the Problems

[0005] This invention provides a system equipped with means for analyzing surveillance video in real time. Specifically, it includes means for analyzing video data received from a video capture device to detect abnormal behavior or suspicious individuals. Furthermore, it includes means for identifying suspicious individuals by comparing the facial image of the detected person with an existing database, and for generating a warning voice if an individual is identified. This enables immediate warning and notification, effectively deterring suspicious individuals. In addition, the generated voice message warns the suspicious person through a speaker and prompts a quick response by notifying administrators and the police, thereby preventing crime from occurring.

[0006] A "video capture device" is a device used to acquire video data captured for surveillance purposes.

[0007] "Real-time analysis" refers to the process of instantly processing data and providing analysis results without having to wait for them to appear.

[0008] "Abnormal behavior" refers to situations in which actions or behaviors deviate from normal behavioral patterns.

[0009] A "suspicious person" refers to an individual who engages in unusual or unauthorized behavior in a normal environment.

[0010] A "face image" is image data that captures the face of a person and is used to identify them.

[0011] A "database" is a system for storing and managing information in an organized and efficient manner.

[0012] A "warning voice" is an audio message generated to convey a warning or caution to suspicious individuals.

[0013] "Generative artificial intelligence" refers to algorithms and systems that have the ability to generate natural language, and are used for dialogue and message creation.

[0014] "Administrator" refers to a person or organization responsible for the operation and monitoring of the entire system.

[0015] "Prompt response" is a process of taking immediate necessary measures when an abnormality or problem occurs.

[0016] "Prevention of crime occurrence" refers to measures and activities for preventing potential crimes in advance.

Brief Explanation of Drawings

[0017] [[ID=1%5]] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which multiple emotions are mapped. [Figure 10] It shows an emotion map to which multiple emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when the emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when the emotion engine is combined.

Mode for Carrying Out the Invention

[0018] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0019] First, the language used in the following description will be explained.

[0020] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0021] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0022] In the following embodiments, the signed storage is one or more non-volatile storage devices that store various programs and various parameters. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes.

[0023] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0024] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0025] [First Embodiment]

[0026] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0027] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0028] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0029] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0030] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0031] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0032] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0033] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0034] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0035] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0036] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0037] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0038] This invention relates to a highly functional security system that analyzes surveillance footage in real time to identify, warn, and notify about abnormal behavior and suspicious individuals. In this system, video data is collected from various locations where surveillance cameras are installed, and a server is responsible for processing this data. The server utilizes artificial intelligence algorithms for video analysis to determine whether the behavior deviates from normal patterns. The server also performs facial recognition in real time and compares the results with an existing database of suspicious individuals.

[0039] For example, in an unmanned store, a server monitors the movements of customers in real time via surveillance cameras. If a customer takes an unusually long time to return an item to its original place, or if they are detected moving back and forth unnaturally within the store, this is treated as abnormal behavior. The server then broadcasts a warning sound to the customer through the store's speakers to deter further actions. Additionally, the facial data of any detected suspicious individuals is cross-referenced with a database sent to the server.

[0040] Even in the case of private residences, the server monitors the yard and parking lot, immediately detecting intruders when they enter the property. Once detected, the server uses AI to generate sounds that suggest a human presence or normal household noises, giving the intruder the impression that someone is home. Users can receive notifications on their devices to get information about the current status of the security system and any anomalies detected.

[0041] Through these embodiments, the present invention can contribute to crime prevention and provide a safer living environment.

[0042] The following describes the processing flow.

[0043] Step 1:

[0044] The server continuously receives video streams from surveillance cameras. The video is captured in real time for automated processing and divided into frames.

[0045] Step 2:

[0046] The server preprocesses each frame, removing unnecessary noise. This process includes adjusting the resolution and clearing the image. This improves the accuracy of the analysis.

[0047] Step 3:

[0048] The server uses artificial intelligence algorithms to detect abnormal behavior and suspicious movements from pre-processed frames. This analysis employs machine learning models to evaluate human movement and patterns.

[0049] Step 4:

[0050] The server extracts the facial image of the detected person and compares it against an existing database. This database contains pre-registered information on specific suspicious individuals.

[0051] Step 5:

[0052] If a suspicious person is identified, the server immediately generates an alarm sound. This sound is played through speakers in the store or home to warn the suspicious person and deter criminal activity.

[0053] Step 6:

[0054] The server issues a warning and simultaneously notifies administrators and the police. The notification includes details of the detected abnormal behavior and profile information of the suspicious person.

[0055] Step 7:

[0056] The device sends real-time notifications to the user. The user can check the details on their device and take prompt action, such as reporting to the police if necessary.

[0057] (Example 1)

[0058] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0059] In recent years, with the increase in criminal activity and intrusions by suspicious individuals, the importance of security systems has grown. However, conventional systems sometimes have difficulty detecting abnormal behavior in real time and taking appropriate action, and further technological advancements are needed. This invention aims to solve these problems and enable more accurate anomaly detection and rapid response.

[0060] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0061] In this invention, the server includes means for analyzing video information received from a video input device in real time to detect abnormal behavior or suspicious persons, means for matching the facial features of the detected person with an existing information aggregate to issue a warning if identified, and means for generating voice information using generative artificial intelligence to warn intruders. This makes it possible to immediately detect abnormal behavior and issue appropriate warnings, thereby preventing crimes from occurring.

[0062] A "video input device" refers to a device that acquires visual information and outputs it as digital data, such as a surveillance camera or sensor camera.

[0063] "Video information" refers to image and video data acquired from video input devices, which are used as the subject of analysis.

[0064] "Real-time analysis" refers to a processing method that performs data processing immediately after it is acquired and outputs the results.

[0065] "Abnormal behavior" refers to actions or movements that deviate from normal behavioral patterns or are unnatural, and these individuals or entities require vigilance from a security standpoint.

[0066] A "suspicious person" refers to an individual deemed suspicious from a crime prevention standpoint and requiring monitoring and response.

[0067] "Facial features" refer to the structural characteristics and patterns of a face that are extracted for facial recognition, and are data used to identify an individual.

[0068] An "information aggregate" refers to a collection of data gathered for a specific purpose, such as information on suspicious individuals or facial recognition data.

[0069] "Generative artificial intelligence" is a type of artificial intelligence that has the ability to generate new data or content based on input data.

[0070] "Audio information" refers to data that represents an audible message or warning, and is played back through a speaker.

[0071] The present invention is a security system that analyzes video information transmitted from a video input device in real time within a surveillance system, detects abnormal behavior and suspicious individuals, and enables a swift and accurate response to these.

[0072] The main components of this system are a server, user terminals, and video input devices, which work together to provide a secure environment. The server uses a computing device equipped with an NVIDIA GPU to process complex video data in real time. The server utilizes software libraries such as OpenCV and TENSORFLOW® to run machine learning models to classify the movement of people in the video and determine whether it is abnormal. It also uses Amazon Rekognition for face recognition processing and quickly performs database matching with existing data collections.

[0073] When abnormal behavior is detected, the server uses the Azure® Speech API to generate audio information and sends a warning sound to a designated output device. Furthermore, it can utilize generative artificial intelligence to output ambient sounds to give intruders the impression that someone is inside the home. Using OpenAI® generative models, it generates natural conversations and ambient sounds. For example, when an intruder is detected, it can generate a voice message saying, "It's time to prepare dinner."

[0074] Users can receive push notifications on their devices when an anomaly is detected. The interface for this is provided through a standard smartphone application, allowing users to monitor the system status in real time and take appropriate action as needed.

[0075] Examples of prompt messages include, "Generate an announcement voice to be used when a customer exhibits suspicious behavior," and "Generate a voice that gives a potential intruder the impression that someone is home."

[0076] With a system configured in this manner, the present invention can contribute to crime prevention and provide a safer and more secure living environment.

[0077] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0078] Step 1:

[0079] The server receives video information in real time from the video input device. A video stream is provided as input, and the server divides it into frames and loads them into memory. Specifically, it acquires data via a network connection and prepares it for analysis.

[0080] Step 2:

[0081] The server performs video analysis on the video information for each frame loaded into memory. The input frames are first preprocessed, undergoing noise reduction and resolution adjustment. Next, a machine learning model using TensorFlow performs data calculations to detect abnormal behavior and suspicious individuals. The output is the coordinates of detected individuals and information on changes in their behavior patterns. Specifically, the AI model compares the detected behavior patterns with normal behavior patterns and identifies behaviors that are judged to be abnormal.

[0082] Step 3:

[0083] The server analyzes the faces of detected individuals using Amazon Rekognition. Face images from video frames are used as input, and feature points are extracted. Face features are compared with an existing database of information aggregates, and if a match is found, a risk assessment is performed. The output is information on whether a particular person has been identified as a suspicious person. Specifically, the face recognition algorithm analyzes the facial features and performs a database search.

[0084] Step 4:

[0085] The server generates audio information using generative artificial intelligence when it detects abnormal behavior or suspicious individuals. A prompt is input to the generative AI model, such as "Generate a warning audio." This prompts the server to generate a warning audio, which is then output to speakers via the network. Specifically, a speech synthesis engine converts the input text into natural-sounding speech.

[0086] Step 5:

[0087] The server notifies the user's terminal of detected suspicious individuals or anomalies. Anomaly information is provided as input and immediately sent to the terminal via a push notification service. The user receives this notification and has the opportunity to take necessary actions. The output is a notification message, and as a concrete action, the system-generated report is displayed in the terminal's notification center.

[0088] (Application Example 1)

[0089] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0090] Conventional security systems have difficulty detecting abnormal behavior or suspicious individuals in real time, and accurately notifying users of such information. Furthermore, it is difficult for administrators located remotely to immediately grasp the situation. Additionally, even when an anomaly is detected, they cannot immediately issue appropriate warnings, resulting in ineffective responses.

[0091] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0092] In this invention, the server includes means for instantly analyzing video information received from a video acquisition device to detect abnormal behavior or suspicious persons, means for issuing a notification when the facial information of the detected person is identified by comparing it with an existing information set, and means for generating voice information using artificial intelligence to issue a warning. This enables administrators and personnel in remote locations to quickly grasp the situation as soon as an anomaly is detected and take appropriate action immediately.

[0093] A "video acquisition device" is a device capable of acquiring video information from the surroundings and transmitting it as digital data.

[0094] "Video information" refers to visual video data obtained from video acquisition devices, and useful information can be extracted by appropriately analyzing its content.

[0095] "Instant analysis" refers to a method of processing acquired information in real time and obtaining results quickly.

[0096] "Abnormal behavior" refers to actions that deviate from normal behavioral patterns and is identified as a situation that warrants attention.

[0097] A "suspicious person" refers to an individual who engages in unexpected or questionable behavior and should be viewed with caution from a crime prevention perspective.

[0098] "Detection" refers to the process of sensing or recognizing the presence or action of an object.

[0099] "Facial information" refers to identifiable data related to a person's face, which is used to identify individuals.

[0100] An "information set" refers to a collection of data that is an aggregate of existing databases or individual pieces of information.

[0101] "Notification" refers to the act of communicating detected information and informing relevant parties and systems.

[0102] "Artificial intelligence" refers to technology that enables machines to mimic human intellectual activity and perform actions with the aim of automatically executing specific functions.

[0103] "Auditory information" refers to synthesized sound data and is used as a means of conveying information through hearing.

[0104] A "warning" refers to an action taken to alert someone to a dangerous or suspicious situation and draw their attention.

[0105] "Administrator" refers to a person or organization responsible for monitoring the system and managing it to ensure its proper operation.

[0106] "Law enforcement agencies" refer to organizations that oversee the enforcement of laws in order to maintain public safety.

[0107] "External devices" refer to additional equipment or terminals that work in conjunction with a system to perform information processing or communication.

[0108] A "user terminal" refers to a device that a user directly operates and uses to receive information, and includes smartphones and computers.

[0109] An "information output device" is a device used to visualize or make information audible, and refers to a device that emits sound or images.

[0110] In the system implementing this invention, the server first receives video information in real time from a video acquisition device. The video acquisition device acquires video information of the surrounding environment and transmits it to the server as a digital signal. The server processes this video information immediately using an advanced video analysis algorithm and has the function of detecting abnormal behavior and suspicious persons. The main software used here is video analysis libraries such as TensorFlow and OpenCV. If an anomaly is detected as a result of the analysis, the server sends that information to the user terminal as a push notification. A smartphone or similar device is used as the user terminal, and the user can receive the notification of an anomaly detection immediately.

[0111] If a suspicious person is detected, the server utilizes a facial recognition API such as Amazon Rekognition to compare the person's facial information with an existing set of information. Once the facial information comparison is complete, the server generates an appropriate warning based on the detection result. The warning voice is synthesized as natural-sounding audio using a generative AI model. This synthesized voice is then sent again by the server to an external audio output device and broadcast as a warning to the surrounding area. For example, a speaker in the home might emit the voice to psychologically alert the suspicious person.

[0112] As a concrete example, if an unauthorized intrusion is detected in a home garden at night, the server will immediately recognize the anomaly and notify the user via their smartphone with a prompt message stating, "An anomaly has been detected. Please check the details." Upon receiving this notification, the user can check the situation via their smartphone and, if necessary, report it to law enforcement. This allows the user to respond to the situation efficiently and ensure their safety.

[0113] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0114] Step 1:

[0115] The server receives video information from the video acquisition device. The input is real-time digital video data transmitted from the video acquisition device, and the server uses this data to prepare for processing in the next step.

[0116] Step 2:

[0117] The server analyzes the received video information using video analysis libraries such as TensorFlow and OpenCV. In this step, the video is analyzed using a machine learning model based on the dataset to detect abnormal behavior or suspicious individuals that deviate from normal behavioral patterns. The output is the result of detecting abnormal behavior.

[0118] Step 3:

[0119] Based on the detection results, the server compares the suspicious person's facial information with an existing facial recognition database using tools such as Amazon Rekognition. This process involves searching and determining matches based on the detected facial information as input. The output is whether or not a specific person is found as a result of the matching.

[0120] Step 4:

[0121] The server uses a generative AI model to generate audio information for when a warning should be issued. In this step, information indicating that an anomaly or suspicious person has been detected is used as input, and the process of converting text data into speech results in a synthesized warning voice as output.

[0122] Step 5:

[0123] The server transmits the generated audio information to an external audio output device to issue a warning to the surrounding area. Specifically, it plays the audio through a home speaker or external output system to alert people. In this process, the data is processed for transmission to the output device, and the output is played back as actual audio.

[0124] Step 6:

[0125] The server sends an anomaly detection notification to the user's terminal as a push notification. Using the detected anomaly information as input, the server notifies the user with the prompt message, "An anomaly has been detected. Please check the details." The output is the notification the user receives on their terminal.

[0126] Step 7:

[0127] The user checks the notification received on their smartphone and understands the situation. In this step, the user receives a push notification as input and then checks the log and video details based on it. In this process, the next action the user should take (e.g., contacting law enforcement) is determined. The output is the action taken based on the user's judgment.

[0128] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0129] This invention provides an advanced security system that can immediately respond to suspicious individuals and abnormal behavior by combining an emotion engine with a surveillance system. The server receives video data from surveillance cameras and analyzes it in real time. This identifies abnormal behavior or suspicious individuals. The server is equipped with an AI-powered facial recognition system, which identifies suspicious individuals by comparing detected facial images with a database. Furthermore, after detection, it generates warning voices using a generation AI and sends voice messages to the suspicious individuals.

[0130] The emotion engine analyzes the emotional state of detected individuals and users, enabling more flexible responses. Specifically, the server uses the emotion analysis results to optimize the content and tone of warning messages to suit the situation. For example, if an intruder is in the initial stages of an intrusion, a milder warning is issued to avoid unnecessary tension and prevent the situation from escalating. On the other hand, if stress or anxiety is detected in the user's emotional state, a more assertive warning and prompt notification to the administrator are issued.

[0131] In this system, the terminal notifies the user, providing details of abnormal activity and the results of sentiment analysis. Users can quickly understand the situation and take necessary actions through the terminal. Furthermore, because these notifications include information refined by the sentiment engine, they help victims or administrators make appropriate decisions.

[0132] As a concrete example, consider a scenario where a suspicious person is detected in a residential parking lot. The server identifies the suspicious person through video processing and facial recognition, and analyzes their emotions. If signs of anxiety or tension are detected, the emotion engine adjusts the warning voice based on that information, issuing a warning in an appropriate tone. This information is then notified to the user via their terminal, allowing them to take appropriate action. In this way, the present invention enables advanced security measures in a variety of situations.

[0133] The following describes the processing flow.

[0134] Step 1:

[0135] The server continuously receives video streams from surveillance cameras. These video streams are processed as digital data and subdivided into individual frames.

[0136] Step 2:

[0137] The server uses AI algorithms to analyze the received video frames. This allows for the rapid detection of abnormal behavior patterns and suspicious individuals.

[0138] Step 3:

[0139] The server compares the detected person's facial image with the database. If the person is registered in the database as a suspicious individual, a specific warning flag is set.

[0140] Step 4:

[0141] The emotion engine is activated and evaluates the emotional state of the suspect and the user from the video and audio data. For example, it identifies feelings such as anxiety, tension, and anger.

[0142] Step 5:

[0143] The server generates the optimal warning message based on the analysis results of the emotion engine. Speech synthesis technology adjusts the tone and content according to the situation.

[0144] Step 6:

[0145] The device plays the generated warning audio through speakers at the site. This audio serves as a warning message to suspicious individuals, attempting to deter any further activity at the scene.

[0146] Step 7:

[0147] The server sends notifications to administrators and police containing data on abnormal behavior and sentiment analysis. These notifications include video clips and analyzed sentiment information.

[0148] Step 8:

[0149] Users receive detailed notifications through their devices. Sentiment-based recommendations are displayed to help them make quick decisions.

[0150] (Example 2)

[0151] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0152] Modern security systems face the challenge of lacking the flexibility to quickly detect suspicious individuals or abnormal behavior and respond appropriately. In particular, there is a need to prevent escalation of situations and achieve more effective security measures by providing warnings and notifications that take emotional states into consideration.

[0153] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0154] In this invention, the server includes means for immediately processing image information acquired from a video acquisition device to identify abnormal behavior or suspicious individuals; means for matching the facial information of the identified individual with a prior set of information to issue an alarm if identified; and means for generating voice instructions using artificial intelligence to issue warnings to suspicious individuals. This enables a rapid and flexible response to suspicious individuals and abnormal behavior, tailored to their emotional state.

[0155] An "image acquisition device" refers to equipment used to capture image information and process or analyze that data.

[0156] "Image information" refers to visual data obtained from an image acquisition device, and includes the visual characteristics of the object or individual being analyzed.

[0157] "Abnormal behavior" refers to actions or behaviors that deviate from normal behavioral patterns and may lead to suspicious situations.

[0158] A "suspicious individual" refers to an individual that exhibits unusual or suspicious behavior or behavior, as detected by a security system.

[0159] "Facial information" refers to data containing facial features used for individual identification, and is usually stored as image data.

[0160] "Prior information sets" refer to reference datasets that have been collected or registered in the past and are used as criteria for identification.

[0161] An "alarm" refers to a notification method used to warn or alert people to detected anomalies or suspicious situations.

[0162] "Generative artificial intelligence" refers to a system that uses artificial intelligence technology to create newly generated content and information.

[0163] "Voice instructions" refer to voice messages generated using speech synthesis technology to give instructions or warnings to a target.

[0164] This invention provides an advanced security system that enables immediate response to suspicious individuals and abnormal behavior through the cooperation of servers, terminals, and users. This system is implemented using the following combination of hardware and software.

[0165] The server acquires real-time image information transmitted from the video acquisition device. Using image processing libraries such as OpenCV, the server analyzes this image information and detects motion and facial features. The detected facial information is compared with a pre-collected database, and a face recognition model using TensorFlow identifies suspicious individuals.

[0166] Furthermore, the server uses generative artificial intelligence to generate warning messages for suspicious individuals. Specifically, it utilizes natural language processing models to create warning texts, which are then converted into voice instructions using speech synthesis technology. For example, a warning message such as "There is a suspicious person in the parking lot. Suspicious behavior has been detected." is generated. This message is optimized according to the emotional state of the suspicious individual detected by an emotion analysis device. Amazon Comprehend, among others, is used for emotion analysis.

[0167] The device transmits generated voice instructions to the suspicious individual via its speaker. Simultaneously, it provides the user with a notification containing information about the suspicious person and sentiment analysis results. Based on the detailed information received through the device, the user can quickly assess the situation and take appropriate action, such as reporting to the police.

[0168] A concrete example of this system is intruder detection in a residential parking lot. The server analyzes video footage of the parking lot, identifies the intruder, and analyzes their emotions. An example of a prompt message might be, "What information is needed to generate a warning message if an intruder enters the parking lot?" In this way, the present invention enhances security measures in various environments.

[0169] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0170] Step 1:

[0171] The server receives image information from the video acquisition device. This input image information is video data consisting of multiple frames. The server uses OpenCV to extract frames from this video and executes a motion detection algorithm. Specifically, it uses methods such as background subtraction and optical flow analysis to identify moving parts. The output of this process is the frame image in which motion was detected.

[0172] Step 2:

[0173] The server detects face regions from frame images where motion has been detected. A face detection model using TensorFlow extracts regions containing facial features. This face region is then used as input data and compared against known faces in a database by a pre-trained face recognition model. The output is the face information of the identified suspicious individual.

[0174] Step 3:

[0175] The server performs emotion analysis on the identified facial information. This uses Amazon Comprehend to analyze emotional states and detect emotions such as anxiety and anger. The input data is the identified facial information, and the output data includes emotional states. This allows the server to understand the emotional state of the suspicious person.

[0176] Step 4:

[0177] The server utilizes a generative artificial intelligence model to generate warning messages for suspicious individuals. Specifically, it inputs a prompt into a natural language processing model to create a warning message such as, "There is a suspicious person in the parking lot." The output warning message is then converted into a voice command using speech synthesis technology.

[0178] Step 5:

[0179] The terminal transmits the generated voice instructions through its speaker. Simultaneously, the server sends a notification about the abnormal situation to the terminal and provides it to the user. The notification includes the results of sentiment analysis and suggests the next action the user should take. As output, the user receives an audio warning and a detailed explanation of the situation.

[0180] Step 6:

[0181] The user acts quickly based on the information provided by the device. This information includes transmitted voice instructions and device notifications. The user can use this information to take countermeasures against suspicious individuals. User actions include, for example, contacting the police or taking additional security measures.

[0182] (Application Example 2)

[0183] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0184] In recent years, security challenges in society have increased, with a particular need for early detection and appropriate response to suspicious individuals and abnormal behavior. However, current security systems struggle to respond flexibly, taking into account the emotional state of suspicious individuals, which can lead to false alarms and unnecessary tension. Therefore, there is a need to provide a system that improves detection accuracy and enables the generation and notification of warning messages based on appropriate situational judgment.

[0185] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0186] In this invention, the server includes means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious persons; means for matching the facial image of the detected person with an existing database and issuing a warning if identified; means for generating voice messages using artificial intelligence to warn suspicious persons; means for analyzing the emotional state of the detected person using an emotion analysis engine and optimizing the content and tone of the warning message; and means for providing notifications of abnormal activity based on the emotion analysis results to personal portable devices. This enables users to take appropriate action based on the emotional state of suspicious persons.

[0187] A "video capture device" is a device, such as a video camera or surveillance camera, that acquires and transmits video data in real time.

[0188] "Real-time analysis" refers to a method of processing input data immediately and outputting results without delay.

[0189] "Abnormal behavior" refers to movements or actions that are not normal or differ from the norm, and is an action that should be taken into consideration in crime prevention.

[0190] A "suspicious person" refers to an individual who does not fit in with the normal environment or circumstances, or who is considered suspicious.

[0191] A "face image" is visual data of a person's face captured by a digital camera or other image acquisition device.

[0192] A "database" is a digital archive organized to efficiently store and search large amounts of information.

[0193] A "voice message" is a recorded or synthesized message that is produced to convey information through sound.

[0194] An "emotion analysis engine" is a system that automatically determines a person's emotional state based on data obtained from video and audio.

[0195] A "personal portable device" refers to a digital device that a user personally owns and can carry with them, such as a smartphone or tablet.

[0196] To implement this invention, the server uses video data received from a high-resolution video capture device. The server uses the open-source image processing library "OpenCV" to perform video analysis. This makes it possible to detect abnormal behavior and suspicious individuals in real time.

[0197] Next, the server inputs the detected facial image into a facial recognition system (e.g., a facial recognition API) and compares it against an existing database. If a known suspicious person is identified through the comparison, the server generates a voice message using generative artificial intelligence (AI). A speech synthesis service (e.g., a speech-to-text API) is used to generate the voice message.

[0198] In addition, the server uses an emotion analysis engine to analyze the emotional state of the detected person. Based on this analysis, the content and tone of the warning message are optimized. For example, if a milder tone of warning is needed, unnecessary escalation can be prevented.

[0199] The server then sends a notification to the user's personal mobile device. This notification includes information about abnormal activity and sentiment analysis results, which will be displayed on the user's smartphone or smart glasses.

[0200] As a concrete example, consider a scenario where a suspicious person is detected at the entrance of a house. An example of a prompt message would be: "A suspicious person has been detected by the camera in front of the entrance. The person appears somewhat nervous. Generate a warning message in a calm tone and send a notification to the user." In this way, the user can respond to the situation appropriately and quickly.

[0201] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0202] Step 1:

[0203] The server receives video data in real time from the video capture device. Based on this video data, it performs motion analysis using OpenCV to detect unusual movements. The output is the information indicating the detection of abnormal behavior.

[0204] Step 2:

[0205] The server extracts faces from the detected image data and inputs them into the facial recognition system. The facial images are compared against an existing database to identify specific suspicious individuals. This matching result is then output.

[0206] Step 3:

[0207] The server generates a voice message using artificial intelligence based on the above matching results. The input information includes identifying the suspicious person and the circumstances surrounding them. A speech-to-text API is used to generate voice data with an appropriate warning tone. The voice data is the output.

[0208] Step 4:

[0209] The server utilizes an emotion analysis engine to analyze a person's emotional state from video data. The input includes video data, and the output includes the detected emotional state (e.g., stress, anxiety). Based on this analysis, the content and tone of the warning message generated in the previous step are adjusted.

[0210] Step 5:

[0211] The server ultimately sends a refined warning message to the device. The user's personal mobile device receives this notification and displays it on the screen. The notification includes information about the suspicious person and sentiment analysis results, which prompt the user to take a quick action.

[0212] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0213] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0214] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0215] [Second Embodiment]

[0216] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0217] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0218] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0219] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0220] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0221] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0222] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0223] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0224] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0225] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0226] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0227] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0228] This invention relates to a highly functional security system that analyzes surveillance footage in real time to identify, warn, and notify about abnormal behavior and suspicious individuals. In this system, video data is collected from various locations where surveillance cameras are installed, and a server is responsible for processing this data. The server utilizes artificial intelligence algorithms for video analysis to determine whether the behavior deviates from normal patterns. The server also performs facial recognition in real time and compares the results with an existing database of suspicious individuals.

[0229] For example, in an unmanned store, a server monitors the movements of customers in real time via surveillance cameras. If a customer takes an unusually long time to return an item to its original place, or if they are detected moving back and forth unnaturally within the store, this is treated as abnormal behavior. The server then broadcasts a warning sound to the customer through the store's speakers to deter further actions. Additionally, the facial data of any detected suspicious individuals is cross-referenced with a database sent to the server.

[0230] Even in the case of private residences, the server monitors the yard and parking lot, immediately detecting intruders when they enter the property. Once detected, the server uses AI to generate sounds that suggest a human presence or normal household noises, giving the intruder the impression that someone is home. Users can receive notifications on their devices to get information about the current status of the security system and any anomalies detected.

[0231] Through these embodiments, the present invention can contribute to crime prevention and provide a safer living environment.

[0232] The following describes the processing flow.

[0233] Step 1:

[0234] The server continuously receives video streams from surveillance cameras. The video is captured in real time for automated processing and divided into frames.

[0235] Step 2:

[0236] The server preprocesses each frame, removing unnecessary noise. This process includes adjusting the resolution and clearing the image. This improves the accuracy of the analysis.

[0237] Step 3:

[0238] The server uses artificial intelligence algorithms to detect abnormal behavior and suspicious movements from pre-processed frames. This analysis employs machine learning models to evaluate human movement and patterns.

[0239] Step 4:

[0240] The server extracts the facial image of the detected person and compares it against an existing database. This database contains pre-registered information on specific suspicious individuals.

[0241] Step 5:

[0242] If a suspicious person is identified, the server immediately generates an alarm sound. This sound is played through speakers in the store or home to warn the suspicious person and deter criminal activity.

[0243] Step 6:

[0244] The server issues a warning and simultaneously notifies administrators and the police. The notification includes details of the detected abnormal behavior and profile information of the suspicious person.

[0245] Step 7:

[0246] The device sends real-time notifications to the user. The user can check the details on their device and take prompt action, such as reporting to the police if necessary.

[0247] (Example 1)

[0248] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0249] In recent years, with the increase in criminal activity and intrusions by suspicious individuals, the importance of security systems has grown. However, conventional systems sometimes have difficulty detecting abnormal behavior in real time and taking appropriate action, and further technological advancements are needed. This invention aims to solve these problems and enable more accurate anomaly detection and rapid response.

[0250] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0251] In this invention, the server includes means for analyzing video information received from a video input device in real time to detect abnormal behavior or suspicious persons, means for matching the facial features of the detected person with an existing information aggregate to issue a warning if identified, and means for generating voice information using generative artificial intelligence to warn intruders. This makes it possible to immediately detect abnormal behavior and issue appropriate warnings, thereby preventing crimes from occurring.

[0252] A "video input device" refers to a device that acquires visual information and outputs it as digital data, such as a surveillance camera or sensor camera.

[0253] "Video information" refers to image and video data acquired from video input devices, which are used as the subject of analysis.

[0254] "Real-time analysis" refers to a processing method that performs data processing immediately after it is acquired and outputs the results.

[0255] "Abnormal behavior" refers to actions or movements that deviate from normal behavioral patterns or are unnatural, and these individuals or entities require vigilance from a security standpoint.

[0256] A "suspicious person" refers to an individual deemed suspicious from a crime prevention standpoint and requiring monitoring and response.

[0257] "Facial features" refer to the structural characteristics and patterns of a face that are extracted for facial recognition, and are data used to identify an individual.

[0258] An "information aggregate" refers to a collection of data gathered for a specific purpose, such as information on suspicious individuals or facial recognition data.

[0259] "Generative artificial intelligence" is a type of artificial intelligence that has the ability to generate new data or content based on input data.

[0260] "Audio information" refers to data that represents an audible message or warning, and is played back through a speaker.

[0261] The present invention is a security system that analyzes video information transmitted from a video input device in real time within a surveillance system, detects abnormal behavior and suspicious individuals, and enables a swift and accurate response to these.

[0262] The main components of this system are a server, user terminals, and video input devices, which work together to provide a secure environment. The server uses a computing device equipped with an NVIDIA GPU to process complex video data in real time. The server utilizes software libraries such as OpenCV and TensorFlow to run machine learning models that classify the movement of people in the video and determine whether it is abnormal. It also uses Amazon Rekognition for face recognition processing and quickly performs database matching with existing data collections.

[0263] When abnormal behavior is detected, the server uses the Azure Speech API to generate speech information and send a warning sound to a designated output device. Furthermore, it can utilize generative artificial intelligence to output ambient sounds to give intruders the impression that someone is inside the home. Using OpenAI's generative model, it generates natural conversations and ambient sounds. For example, when an intruder is detected, it can generate a voice message saying, "It's time to prepare dinner."

[0264] Users can receive push notifications on their devices when an anomaly is detected. The interface for this is provided through a standard smartphone application, allowing users to monitor the system status in real time and take appropriate action as needed.

[0265] Examples of prompt messages include, "Generate an announcement voice to be used when a customer exhibits suspicious behavior," and "Generate a voice that gives a potential intruder the impression that someone is home."

[0266] With a system configured in this manner, the present invention can contribute to crime prevention and provide a safer and more secure living environment.

[0267] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0268] Step 1:

[0269] The server receives video information in real time from the video input device. A video stream is provided as input, and the server divides it into frames and loads them into memory. Specifically, it acquires data via a network connection and prepares it for analysis.

[0270] Step 2:

[0271] The server performs video analysis on the video information for each frame loaded into memory. The input frames are first preprocessed, undergoing noise reduction and resolution adjustment. Next, a machine learning model using TensorFlow performs data calculations to detect abnormal behavior and suspicious individuals. The output is the coordinates of detected individuals and information on changes in their behavior patterns. Specifically, the AI model compares the detected behavior patterns with normal behavior patterns and identifies behaviors that are judged to be abnormal.

[0272] Step 3:

[0273] The server analyzes the faces of detected individuals using Amazon Rekognition. Face images from video frames are used as input, and feature points are extracted. Face features are compared with an existing database of information aggregates, and if a match is found, a risk assessment is performed. The output is information on whether a particular person has been identified as a suspicious person. Specifically, the face recognition algorithm analyzes the facial features and performs a database search.

[0274] Step 4:

[0275] The server generates audio information using generative artificial intelligence when it detects abnormal behavior or suspicious individuals. A prompt is input to the generative AI model, such as "Generate a warning audio." This prompts the server to generate a warning audio, which is then output to speakers via the network. Specifically, a speech synthesis engine converts the input text into natural-sounding speech.

[0276] Step 5:

[0277] The server notifies the user's terminal of detected suspicious individuals or anomalies. Anomaly information is provided as input and immediately sent to the terminal via a push notification service. The user receives this notification and has the opportunity to take necessary actions. The output is a notification message, and as a concrete action, the system-generated report is displayed in the terminal's notification center.

[0278] (Application Example 1)

[0279] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0280] In conventional security systems, it is difficult to detect abnormal behavior or suspicious persons in real time and to accurately notify such information. Further, there is a problem that it is difficult for an administrator who is at a remote location to immediately grasp the situation. Furthermore, even when an abnormality is detected, there is a problem that an appropriate warning cannot be immediately issued and effective countermeasures cannot be taken.

[0281] The specific processing by the specific processing unit 290 of the data processing apparatus 12 in Application Example 1 is realized by the following respective means.

[0282] In this invention, the server includes means for immediately analyzing video information received from a video acquisition device to detect abnormal behavior or a suspicious person, means for issuing a notification when the face information of the detected person is identified by comparison with an existing information set, and means for generating voice information using artificial intelligence to issue a warning. Thereby, when an abnormality is detected, an administrator or a person in charge at a remote location can quickly grasp the situation and immediately take appropriate countermeasures.

[0283] The "video acquisition device" is a device that can acquire surrounding video information and transmit it as digital data.

[0284] The "video information" refers to visual video data obtained from a video acquisition device, and useful information can be extracted by appropriately analyzing its content.

[0285] "Immediately analyze" refers to a method of processing the acquired information in real time so that the result can be quickly obtained.

[0286] "Abnormal behavior" refers to an action that deviates from the normal behavior pattern and is identified as a situation that should be vigilant.

[0287] A "suspicious person" refers to a person who takes unexpected or suspicious actions and is a target that should be vigilant from a security perspective.

[0288] "Detection" refers to the process of sensing or recognizing the presence or action of an object.

[0289] "Facial information" refers to identifiable data related to a person's face, which is used to identify individuals.

[0290] An "information set" refers to a collection of data that is an aggregate of existing databases or individual pieces of information.

[0291] "Notification" refers to the act of communicating detected information and informing relevant parties and systems.

[0292] "Artificial intelligence" refers to technology that enables machines to mimic human intellectual activity and perform actions with the aim of automatically executing specific functions.

[0293] "Auditory information" refers to synthesized sound data and is used as a means of conveying information through hearing.

[0294] A "warning" refers to an action taken to alert someone to a dangerous or suspicious situation and draw their attention.

[0295] "Administrator" refers to a person or organization responsible for monitoring the system and managing it to ensure its proper operation.

[0296] "Law enforcement agencies" refer to organizations that oversee the enforcement of laws in order to maintain public safety.

[0297] "External devices" refer to additional equipment or terminals that work in conjunction with a system to perform information processing or communication.

[0298] A "user terminal" refers to a device that a user directly operates and uses to receive information, and includes smartphones and computers.

[0299] An "information output device" is a device used to visualize or make information audible, and refers to a device that emits sound or images.

[0300] In the system implementing this invention, the server first receives video information in real time from a video acquisition device. The video acquisition device acquires video information of the surrounding environment and transmits it to the server as a digital signal. The server processes this video information immediately using an advanced video analysis algorithm and has the function of detecting abnormal behavior and suspicious persons. The main software used here is video analysis libraries such as TensorFlow and OpenCV. If an anomaly is detected as a result of the analysis, the server sends that information to the user terminal as a push notification. A smartphone or similar device is used as the user terminal, and the user can receive the notification of an anomaly detection immediately.

[0301] If a suspicious person is detected, the server utilizes a facial recognition API such as Amazon Rekognition to compare the person's facial information with an existing set of information. Once the facial information comparison is complete, the server generates an appropriate warning based on the detection result. The warning voice is synthesized as natural-sounding audio using a generative AI model. This synthesized voice is then sent again by the server to an external audio output device and broadcast as a warning to the surrounding area. For example, a speaker in the home might emit the voice to psychologically alert the suspicious person.

[0302] As a concrete example, if an unauthorized intrusion is detected in a home garden at night, the server will immediately recognize the anomaly and notify the user via their smartphone with a prompt message stating, "An anomaly has been detected. Please check the details." Upon receiving this notification, the user can check the situation via their smartphone and, if necessary, report it to law enforcement. This allows the user to respond to the situation efficiently and ensure their safety.

[0303] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0304] Step 1:

[0305] The server receives video information from the video acquisition device. The input is real-time digital video data transmitted from the video acquisition device, and using this data, the server prepares to perform processing for the next step.

[0306] Step 2:

[0307] The server analyzes the received video information using a video analysis library such as TensorFlow or OpenCV. In this step, to detect abnormal behaviors or suspicious persons deviating from normal behavior patterns, the video is analyzed using a machine learning model based on a dataset. The output is the detection result of abnormal behavior.

[0308] Step 3:

[0309] Based on the detection result, the server uses a face recognition service like Amazon Rekognition to match the face information of the suspicious person with an existing face information database. In this process, a search and match determination are performed based on the detected face information as input. The output is the presence or absence of a specific person obtained as the matching result.

[0310] Step 4:

[0311] The server uses a generative AI model to generate voice information when a warning should be issued. In this step, using the information that an abnormality or a suspicious person has been detected as input, through the process of converting text data into voice, the output is the synthesized warning voice.

[0312] Step 5:

[0313] The server transmits the generated voice information to an external voice output device to issue a warning to the surroundings. Specifically, the voice is played through a home speaker or an external output system to prompt vigilance. In this process, data processing is performed for transmission to the output device, and the output is the actual voice played back.

[0314] Step 6:

[0315] The server sends an anomaly detection notification to the user's terminal as a push notification. Using the detected anomaly information as input, the server notifies the user with the prompt message, "An anomaly has been detected. Please check the details." The output is the notification the user receives on their terminal.

[0316] Step 7:

[0317] The user checks the notification received on their smartphone and understands the situation. In this step, the user receives a push notification as input and then checks the log and video details based on it. In this process, the next action the user should take (e.g., contacting law enforcement) is determined. The output is the action taken based on the user's judgment.

[0318] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0319] This invention provides an advanced security system that can immediately respond to suspicious individuals and abnormal behavior by combining an emotion engine with a surveillance system. The server receives video data from surveillance cameras and analyzes it in real time. This identifies abnormal behavior or suspicious individuals. The server is equipped with an AI-powered facial recognition system, which identifies suspicious individuals by comparing detected facial images with a database. Furthermore, after detection, it generates warning voices using a generation AI and sends voice messages to the suspicious individuals.

[0320] The emotion engine analyzes the emotional state of detected individuals and users, enabling more flexible responses. Specifically, the server uses the emotion analysis results to optimize the content and tone of warning messages to suit the situation. For example, if an intruder is in the initial stages of an intrusion, a milder warning is issued to avoid unnecessary tension and prevent the situation from escalating. On the other hand, if stress or anxiety is detected in the user's emotional state, a more assertive warning and prompt notification to the administrator are issued.

[0321] In this system, the terminal notifies the user, providing details of abnormal activity and the results of sentiment analysis. Users can quickly understand the situation and take necessary actions through the terminal. Furthermore, because these notifications include information refined by the sentiment engine, they help victims or administrators make appropriate decisions.

[0322] As a concrete example, consider a scenario where a suspicious person is detected in a residential parking lot. The server identifies the suspicious person through video processing and facial recognition, and analyzes their emotions. If signs of anxiety or tension are detected, the emotion engine adjusts the warning voice based on that information, issuing a warning in an appropriate tone. This information is then notified to the user via their terminal, allowing them to take appropriate action. In this way, the present invention enables advanced security measures in a variety of situations.

[0323] The following describes the processing flow.

[0324] Step 1:

[0325] The server continuously receives video streams from surveillance cameras. These video streams are processed as digital data and subdivided into individual frames.

[0326] Step 2:

[0327] The server uses AI algorithms to analyze the received video frames. This allows for the rapid detection of abnormal behavior patterns and suspicious individuals.

[0328] Step 3:

[0329] The server compares the detected person's facial image with the database. If the person is registered in the database as a suspicious individual, a specific warning flag is set.

[0330] Step 4:

[0331] The emotion engine is activated and evaluates the emotional state of the suspect and the user from the video and audio data. For example, it identifies feelings such as anxiety, tension, and anger.

[0332] Step 5:

[0333] The server generates the optimal warning message based on the analysis results of the emotion engine. Speech synthesis technology adjusts the tone and content according to the situation.

[0334] Step 6:

[0335] The device plays the generated warning audio through speakers at the site. This audio serves as a warning message to suspicious individuals, attempting to deter any further activity at the scene.

[0336] Step 7:

[0337] The server sends notifications to administrators and police containing data on abnormal behavior and sentiment analysis. These notifications include video clips and analyzed sentiment information.

[0338] Step 8:

[0339] Users receive detailed notifications through their devices. Sentiment-based recommendations are displayed to help them make quick decisions.

[0340] (Example 2)

[0341] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0342] Modern security systems face the challenge of lacking the flexibility to quickly detect suspicious individuals or abnormal behavior and respond appropriately. In particular, there is a need to prevent escalation of situations and achieve more effective security measures by providing warnings and notifications that take emotional states into consideration.

[0343] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0344] In this invention, the server includes means for immediately processing image information acquired from a video acquisition device to identify abnormal behavior or suspicious individuals; means for matching the facial information of the identified individual with a prior set of information to issue an alarm if identified; and means for generating voice instructions using artificial intelligence to issue warnings to suspicious individuals. This enables a rapid and flexible response to suspicious individuals and abnormal behavior, tailored to their emotional state.

[0345] An "image acquisition device" refers to equipment used to capture image information and process or analyze that data.

[0346] "Image information" refers to visual data obtained from an image acquisition device, and includes the visual characteristics of the object or individual being analyzed.

[0347] "Abnormal behavior" refers to actions or behaviors that deviate from normal behavioral patterns and may lead to suspicious situations.

[0348] A "suspicious individual" refers to an individual that exhibits unusual or suspicious behavior or behavior, as detected by a security system.

[0349] "Facial information" refers to data containing facial features used for individual identification, and is usually stored as image data.

[0350] "Prior information sets" refer to reference datasets that have been collected or registered in the past and are used as criteria for identification.

[0351] An "alarm" refers to a notification method used to warn or alert people to detected anomalies or suspicious situations.

[0352] "Generative artificial intelligence" refers to a system that uses artificial intelligence technology to create newly generated content and information.

[0353] "Voice instructions" refer to voice messages generated using speech synthesis technology to give instructions or warnings to a target.

[0354] This invention provides an advanced security system that enables immediate response to suspicious individuals and abnormal behavior through the cooperation of servers, terminals, and users. This system is implemented using the following combination of hardware and software.

[0355] The server acquires real-time image information transmitted from the video acquisition device. Using image processing libraries such as OpenCV, the server analyzes this image information and detects motion and facial features. The detected facial information is compared with a pre-collected database, and a face recognition model using TensorFlow identifies suspicious individuals.

[0356] Furthermore, the server uses generative artificial intelligence to generate warning messages for suspicious individuals. Specifically, it utilizes natural language processing models to create warning texts, which are then converted into voice instructions using speech synthesis technology. For example, a warning message such as "There is a suspicious person in the parking lot. Suspicious behavior has been detected." is generated. This message is optimized according to the emotional state of the suspicious individual detected by an emotion analysis device. Amazon Comprehend, among others, is used for emotion analysis.

[0357] The device transmits generated voice instructions to the suspicious individual via its speaker. Simultaneously, it provides the user with a notification containing information about the suspicious person and sentiment analysis results. Based on the detailed information received through the device, the user can quickly assess the situation and take appropriate action, such as reporting to the police.

[0358] A concrete example of this system is intruder detection in a residential parking lot. The server analyzes video footage of the parking lot, identifies the intruder, and analyzes their emotions. An example of a prompt message might be, "What information is needed to generate a warning message if an intruder enters the parking lot?" In this way, the present invention enhances security measures in various environments.

[0359] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0360] Step 1:

[0361] The server receives image information from the video acquisition device. This input image information is video data consisting of multiple frames. The server uses OpenCV to extract frames from this video and executes a motion detection algorithm. Specifically, it uses methods such as background subtraction and optical flow analysis to identify moving parts. The output of this process is the frame image in which motion was detected.

[0362] Step 2:

[0363] The server detects face regions from frame images where motion has been detected. A face detection model using TensorFlow extracts regions containing facial features. This face region is then used as input data and compared against known faces in a database by a pre-trained face recognition model. The output is the face information of the identified suspicious individual.

[0364] Step 3:

[0365] The server performs emotion analysis on the identified facial information. This uses Amazon Comprehend to analyze emotional states and detect emotions such as anxiety and anger. The input data is the identified facial information, and the output data includes emotional states. This allows the server to understand the emotional state of the suspicious person.

[0366] Step 4:

[0367] The server utilizes a generative artificial intelligence model to generate warning messages for suspicious individuals. Specifically, it inputs a prompt into a natural language processing model to create a warning message such as, "There is a suspicious person in the parking lot." The output warning message is then converted into a voice command using speech synthesis technology.

[0368] Step 5:

[0369] The terminal transmits the generated voice instructions through its speaker. Simultaneously, the server sends a notification about the abnormal situation to the terminal and provides it to the user. The notification includes the results of sentiment analysis and suggests the next action the user should take. As output, the user receives an audio warning and a detailed explanation of the situation.

[0370] Step 6:

[0371] The user acts quickly based on the information provided by the device. This information includes transmitted voice instructions and device notifications. The user can use this information to take countermeasures against suspicious individuals. User actions include, for example, contacting the police or taking additional security measures.

[0372] (Application Example 2)

[0373] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0374] In recent years, security challenges in society have increased, with a particular need for early detection and appropriate response to suspicious individuals and abnormal behavior. However, current security systems struggle to respond flexibly, taking into account the emotional state of suspicious individuals, which can lead to false alarms and unnecessary tension. Therefore, there is a need to provide a system that improves detection accuracy and enables the generation and notification of warning messages based on appropriate situational judgment.

[0375] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0376] In this invention, the server includes means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious persons; means for matching the facial image of the detected person with an existing database and issuing a warning if identified; means for generating voice messages using artificial intelligence to warn suspicious persons; means for analyzing the emotional state of the detected person using an emotion analysis engine and optimizing the content and tone of the warning message; and means for providing notifications of abnormal activity based on the emotion analysis results to personal portable devices. This enables users to take appropriate action based on the emotional state of suspicious persons.

[0377] A "video capture device" is a device, such as a video camera or surveillance camera, that acquires and transmits video data in real time.

[0378] "Real-time analysis" refers to a method of processing input data immediately and outputting results without delay.

[0379] "Abnormal behavior" refers to movements or actions that are not normal or differ from the norm, and is an action that should be taken into consideration in crime prevention.

[0380] A "suspicious person" refers to an individual who does not fit in with the normal environment or circumstances, or who is considered suspicious.

[0381] A "face image" is visual data of a person's face captured by a digital camera or other image acquisition device.

[0382] A "database" is a digital archive organized to efficiently store and search large amounts of information.

[0383] A "voice message" is a recorded or synthesized message that is produced to convey information through sound.

[0384] An "emotion analysis engine" is a system that automatically determines a person's emotional state based on data obtained from video and audio.

[0385] A "personal portable device" refers to a digital device that a user personally owns and can carry with them, such as a smartphone or tablet.

[0386] To implement this invention, the server uses video data received from a high-resolution video capture device. The server uses the open-source image processing library "OpenCV" to perform video analysis. This makes it possible to detect abnormal behavior and suspicious individuals in real time.

[0387] Next, the server inputs the detected facial image into a facial recognition system (e.g., a facial recognition API) and compares it against an existing database. If a known suspicious person is identified through the comparison, the server generates a voice message using generative artificial intelligence (AI). A speech synthesis service (e.g., a speech-to-text API) is used to generate the voice message.

[0388] In addition, the server uses an emotion analysis engine to analyze the emotional state of the detected person. Based on this analysis, the content and tone of the warning message are optimized. For example, if a milder tone of warning is needed, unnecessary escalation can be prevented.

[0389] The server then sends a notification to the user's personal mobile device. This notification includes information about abnormal activity and sentiment analysis results, which will be displayed on the user's smartphone or smart glasses.

[0390] As a concrete example, consider a scenario where a suspicious person is detected at the entrance of a house. An example of a prompt message would be: "A suspicious person has been detected by the camera in front of the entrance. The person appears somewhat nervous. Generate a warning message in a calm tone and send a notification to the user." In this way, the user can respond to the situation appropriately and quickly.

[0391] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0392] Step 1:

[0393] The server receives video data in real time from the video capture device. Based on this video data, it performs motion analysis using OpenCV to detect unusual movements. The output is the information indicating the detection of abnormal behavior.

[0394] Step 2:

[0395] The server extracts faces from the detected image data and inputs them into the facial recognition system. The facial images are compared against an existing database to identify specific suspicious individuals. This matching result is then output.

[0396] Step 3:

[0397] The server generates a voice message using artificial intelligence based on the above matching results. The input information includes identifying the suspicious person and the circumstances surrounding them. A speech-to-text API is used to generate voice data with an appropriate warning tone. The voice data is the output.

[0398] Step 4:

[0399] The server utilizes an emotion analysis engine to analyze a person's emotional state from video data. The input includes video data, and the output includes the detected emotional state (e.g., stress, anxiety). Based on this analysis, the content and tone of the warning message generated in the previous step are adjusted.

[0400] Step 5:

[0401] The server ultimately sends a refined warning message to the device. The user's personal mobile device receives this notification and displays it on the screen. The notification includes information about the suspicious person and sentiment analysis results, which prompt the user to take a quick action.

[0402] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0403] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0404] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0405] [Third Embodiment]

[0406] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0407] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0408] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0409] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0410] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0411] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0412] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0413] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0414] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0415] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0416] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0417] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0418] This invention relates to a highly functional security system that analyzes surveillance footage in real time to identify, warn, and notify about abnormal behavior and suspicious individuals. In this system, video data is collected from various locations where surveillance cameras are installed, and a server is responsible for processing this data. The server utilizes artificial intelligence algorithms for video analysis to determine whether the behavior deviates from normal patterns. The server also performs facial recognition in real time and compares the results with an existing database of suspicious individuals.

[0419] For example, in an unmanned store, a server monitors the movements of customers in real time via surveillance cameras. If a customer takes an unusually long time to return an item to its original place, or if they are detected moving back and forth unnaturally within the store, this is treated as abnormal behavior. The server then broadcasts a warning sound to the customer through the store's speakers to deter further actions. Additionally, the facial data of any detected suspicious individuals is cross-referenced with a database sent to the server.

[0420] Even in the case of private residences, the server monitors the yard and parking lot, immediately detecting intruders when they enter the property. Once detected, the server uses AI to generate sounds that suggest a human presence or normal household noises, giving the intruder the impression that someone is home. Users can receive notifications on their devices to get information about the current status of the security system and any anomalies detected.

[0421] Through these embodiments, the present invention can contribute to crime prevention and provide a safer living environment.

[0422] The following describes the processing flow.

[0423] Step 1:

[0424] The server continuously receives video streams from surveillance cameras. The video is captured in real time for automated processing and divided into frames.

[0425] Step 2:

[0426] The server preprocesses each frame, removing unnecessary noise. This process includes adjusting the resolution and clearing the image. This improves the accuracy of the analysis.

[0427] Step 3:

[0428] The server uses artificial intelligence algorithms to detect abnormal behavior and suspicious movements from pre-processed frames. This analysis employs machine learning models to evaluate human movement and patterns.

[0429] Step 4:

[0430] The server extracts the facial image of the detected person and compares it against an existing database. This database contains pre-registered information on specific suspicious individuals.

[0431] Step 5:

[0432] If a suspicious person is identified, the server immediately generates an alarm sound. This sound is played through speakers in the store or home to warn the suspicious person and deter criminal activity.

[0433] Step 6:

[0434] The server issues a warning and simultaneously notifies administrators and the police. The notification includes details of the detected abnormal behavior and profile information of the suspicious person.

[0435] Step 7:

[0436] The device sends real-time notifications to the user. The user can check the details on their device and take prompt action, such as reporting to the police if necessary.

[0437] (Example 1)

[0438] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0439] In recent years, with the increase in criminal activity and intrusions by suspicious individuals, the importance of security systems has grown. However, conventional systems sometimes have difficulty detecting abnormal behavior in real time and taking appropriate action, and further technological advancements are needed. This invention aims to solve these problems and enable more accurate anomaly detection and rapid response.

[0440] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0441] In this invention, the server includes means for analyzing video information received from a video input device in real time to detect abnormal behavior or suspicious persons, means for matching the facial features of the detected person with an existing information aggregate to issue a warning if identified, and means for generating voice information using generative artificial intelligence to warn intruders. This makes it possible to immediately detect abnormal behavior and issue appropriate warnings, thereby preventing crimes from occurring.

[0442] A "video input device" refers to a device that acquires visual information and outputs it as digital data, such as a surveillance camera or sensor camera.

[0443] "Video information" refers to image and video data acquired from video input devices, which are used as the subject of analysis.

[0444] "Real-time analysis" refers to a processing method that performs data processing immediately after it is acquired and outputs the results.

[0445] "Abnormal behavior" refers to actions or movements that deviate from normal behavioral patterns or are unnatural, and these individuals or entities require vigilance from a security standpoint.

[0446] A "suspicious person" refers to an individual deemed suspicious from a crime prevention standpoint and requiring monitoring and response.

[0447] "Facial features" refer to the structural characteristics and patterns of a face that are extracted for facial recognition, and are data used to identify an individual.

[0448] An "information aggregate" refers to a collection of data gathered for a specific purpose, such as information on suspicious individuals or facial recognition data.

[0449] "Generative artificial intelligence" is a type of artificial intelligence that has the ability to generate new data or content based on input data.

[0450] "Audio information" refers to data that represents an audible message or warning, and is played back through a speaker.

[0451] The present invention is a security system that analyzes video information transmitted from a video input device in real time within a surveillance system, detects abnormal behavior and suspicious individuals, and enables a swift and accurate response to these.

[0452] The main components of this system are a server, user terminals, and video input devices, which work together to provide a secure environment. The server uses a computing device equipped with an NVIDIA GPU to process complex video data in real time. The server utilizes software libraries such as OpenCV and TensorFlow to run machine learning models that classify the movement of people in the video and determine whether it is abnormal. It also uses Amazon Rekognition for face recognition processing and quickly performs database matching with existing data collections.

[0453] When abnormal behavior is detected, the server uses the Azure Speech API to generate speech information and send a warning sound to a designated output device. Furthermore, it can utilize generative artificial intelligence to output ambient sounds to give intruders the impression that someone is inside the home. Using OpenAI's generative model, it generates natural conversations and ambient sounds. For example, when an intruder is detected, it can generate a voice message saying, "It's time to prepare dinner."

[0454] Users can receive push notifications on their devices when an anomaly is detected. The interface for this is provided through a standard smartphone application, allowing users to monitor the system status in real time and take appropriate action as needed.

[0455] Examples of prompt messages include, "Generate an announcement voice to be used when a customer exhibits suspicious behavior," and "Generate a voice that gives a potential intruder the impression that someone is home."

[0456] With a system configured in this manner, the present invention can contribute to crime prevention and provide a safer and more secure living environment.

[0457] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0458] Step 1:

[0459] The server receives video information in real time from the video input device. A video stream is provided as input, and the server divides it into frames and loads them into memory. Specifically, it acquires data via a network connection and prepares it for analysis.

[0460] Step 2:

[0461] The server performs video analysis on the video information for each frame loaded into memory. The input frames are first preprocessed, undergoing noise reduction and resolution adjustment. Next, a machine learning model using TensorFlow performs data calculations to detect abnormal behavior and suspicious individuals. The output is the coordinates of detected individuals and information on changes in their behavior patterns. Specifically, the AI model compares the detected behavior patterns with normal behavior patterns and identifies behaviors that are judged to be abnormal.

[0462] Step 3:

[0463] The server analyzes the faces of detected individuals using Amazon Rekognition. Face images from video frames are used as input, and feature points are extracted. Face features are compared with an existing database of information aggregates, and if a match is found, a risk assessment is performed. The output is information on whether a particular person has been identified as a suspicious person. Specifically, the face recognition algorithm analyzes the facial features and performs a database search.

[0464] Step 4:

[0465] The server generates audio information using generative artificial intelligence when it detects abnormal behavior or suspicious individuals. A prompt is input to the generative AI model, such as "Generate a warning audio." This prompts the server to generate a warning audio, which is then output to speakers via the network. Specifically, a speech synthesis engine converts the input text into natural-sounding speech.

[0466] Step 5:

[0467] The server notifies the user's terminal of detected suspicious individuals or anomalies. Anomaly information is provided as input and immediately sent to the terminal via a push notification service. The user receives this notification and has the opportunity to take necessary actions. The output is a notification message, and as a concrete action, the system-generated report is displayed in the terminal's notification center.

[0468] (Application Example 1)

[0469] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0470] Conventional security systems have difficulty detecting abnormal behavior or suspicious individuals in real time, and accurately notifying users of such information. Furthermore, it is difficult for administrators located remotely to immediately grasp the situation. Additionally, even when an anomaly is detected, they cannot immediately issue appropriate warnings, resulting in ineffective responses.

[0471] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0472] In this invention, the server includes means for instantly analyzing video information received from a video acquisition device to detect abnormal behavior or suspicious persons, means for issuing a notification when the facial information of the detected person is identified by comparing it with an existing information set, and means for generating voice information using artificial intelligence to issue a warning. This enables administrators and personnel in remote locations to quickly grasp the situation as soon as an anomaly is detected and take appropriate action immediately.

[0473] A "video acquisition device" is a device capable of acquiring video information from the surroundings and transmitting it as digital data.

[0474] "Video information" refers to visual video data obtained from video acquisition devices, and useful information can be extracted by appropriately analyzing its content.

[0475] "Instant analysis" refers to a method of processing acquired information in real time and obtaining results quickly.

[0476] "Abnormal behavior" refers to actions that deviate from normal behavioral patterns and is identified as a situation that warrants attention.

[0477] A "suspicious person" refers to an individual who engages in unexpected or questionable behavior and should be viewed with caution from a crime prevention perspective.

[0478] "Detection" refers to the process of sensing or recognizing the presence or action of an object.

[0479] "Facial information" refers to identifiable data related to a person's face, which is used to identify individuals.

[0480] An "information set" refers to a collection of data that is an aggregate of existing databases or individual pieces of information.

[0481] "Notification" refers to the act of communicating detected information and informing relevant parties and systems.

[0482] "Artificial intelligence" refers to technology that enables machines to mimic human intellectual activity and perform actions with the aim of automatically executing specific functions.

[0483] "Auditory information" refers to synthesized sound data and is used as a means of conveying information through hearing.

[0484] A "warning" refers to an action taken to alert someone to a dangerous or suspicious situation and draw their attention.

[0485] "Administrator" refers to a person or organization responsible for monitoring the system and managing it to ensure its proper operation.

[0486] "Law enforcement agencies" refer to organizations that oversee the enforcement of laws in order to maintain public safety.

[0487] "External devices" refer to additional equipment or terminals that work in conjunction with a system to perform information processing or communication.

[0488] A "user terminal" refers to a device that a user directly operates and uses to receive information, and includes smartphones and computers.

[0489] An "information output device" is a device used to visualize or make information audible, and refers to a device that emits sound or images.

[0490] In the system implementing this invention, the server first receives video information in real time from a video acquisition device. The video acquisition device acquires video information of the surrounding environment and transmits it to the server as a digital signal. The server processes this video information immediately using an advanced video analysis algorithm and has the function of detecting abnormal behavior and suspicious persons. The main software used here is video analysis libraries such as TensorFlow and OpenCV. If an anomaly is detected as a result of the analysis, the server sends that information to the user terminal as a push notification. A smartphone or similar device is used as the user terminal, and the user can receive the notification of an anomaly detection immediately.

[0491] If a suspicious person is detected, the server utilizes a facial recognition API such as Amazon Rekognition to compare the person's facial information with an existing set of information. Once the facial information comparison is complete, the server generates an appropriate warning based on the detection result. The warning voice is synthesized as natural-sounding audio using a generative AI model. This synthesized voice is then sent again by the server to an external audio output device and broadcast as a warning to the surrounding area. For example, a speaker in the home might emit the voice to psychologically alert the suspicious person.

[0492] As a concrete example, if an unauthorized intrusion is detected in a home garden at night, the server will immediately recognize the anomaly and notify the user via their smartphone with a prompt message stating, "An anomaly has been detected. Please check the details." Upon receiving this notification, the user can check the situation via their smartphone and, if necessary, report it to law enforcement. This allows the user to respond to the situation efficiently and ensure their safety.

[0493] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0494] Step 1:

[0495] The server receives video information from the video acquisition device. The input is real-time digital video data transmitted from the video acquisition device, and the server uses this data to prepare for processing in the next step.

[0496] Step 2:

[0497] The server analyzes the received video information using video analysis libraries such as TensorFlow and OpenCV. In this step, the video is analyzed using a machine learning model based on the dataset to detect abnormal behavior or suspicious individuals that deviate from normal behavioral patterns. The output is the result of detecting abnormal behavior.

[0498] Step 3:

[0499] Based on the detection results, the server compares the suspicious person's facial information with an existing facial recognition database using tools such as Amazon Rekognition. This process involves searching and determining matches based on the detected facial information as input. The output is whether or not a specific person is found as a result of the matching.

[0500] Step 4:

[0501] The server uses a generative AI model to generate audio information for when a warning should be issued. In this step, information indicating that an anomaly or suspicious person has been detected is used as input, and the process of converting text data into speech results in a synthesized warning voice as output.

[0502] Step 5:

[0503] The server transmits the generated audio information to an external audio output device to issue a warning to the surrounding area. Specifically, it plays the audio through a home speaker or external output system to alert people. In this process, the data is processed for transmission to the output device, and the output is played back as actual audio.

[0504] Step 6:

[0505] The server sends an anomaly detection notification to the user's terminal as a push notification. Using the detected anomaly information as input, the server notifies the user with the prompt message, "An anomaly has been detected. Please check the details." The output is the notification the user receives on their terminal.

[0506] Step 7:

[0507] The user checks the notification received on their smartphone and understands the situation. In this step, the user receives a push notification as input and then checks the log and video details based on it. In this process, the next action the user should take (e.g., contacting law enforcement) is determined. The output is the action taken based on the user's judgment.

[0508] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0509] This invention provides an advanced security system that can immediately respond to suspicious individuals and abnormal behavior by combining an emotion engine with a surveillance system. The server receives video data from surveillance cameras and analyzes it in real time. This identifies abnormal behavior or suspicious individuals. The server is equipped with an AI-powered facial recognition system, which identifies suspicious individuals by comparing detected facial images with a database. Furthermore, after detection, it generates warning voices using a generation AI and sends voice messages to the suspicious individuals.

[0510] The emotion engine analyzes the emotional state of detected individuals and users, enabling more flexible responses. Specifically, the server uses the emotion analysis results to optimize the content and tone of warning messages to suit the situation. For example, if an intruder is in the initial stages of an intrusion, a milder warning is issued to avoid unnecessary tension and prevent the situation from escalating. On the other hand, if stress or anxiety is detected in the user's emotional state, a more assertive warning and prompt notification to the administrator are issued.

[0511] In this system, the terminal notifies the user, providing details of abnormal activity and the results of sentiment analysis. Users can quickly understand the situation and take necessary actions through the terminal. Furthermore, because these notifications include information refined by the sentiment engine, they help victims or administrators make appropriate decisions.

[0512] As a concrete example, consider a scenario where a suspicious person is detected in a residential parking lot. The server identifies the suspicious person through video processing and facial recognition, and analyzes their emotions. If signs of anxiety or tension are detected, the emotion engine adjusts the warning voice based on that information, issuing a warning in an appropriate tone. This information is then notified to the user via their terminal, allowing them to take appropriate action. In this way, the present invention enables advanced security measures in a variety of situations.

[0513] The following describes the processing flow.

[0514] Step 1:

[0515] The server continuously receives video streams from surveillance cameras. These video streams are processed as digital data and subdivided into individual frames.

[0516] Step 2:

[0517] The server uses AI algorithms to analyze the received video frames. This allows for the rapid detection of abnormal behavior patterns and suspicious individuals.

[0518] Step 3:

[0519] The server compares the detected person's facial image with the database. If the person is registered in the database as a suspicious individual, a specific warning flag is set.

[0520] Step 4:

[0521] The emotion engine is activated and evaluates the emotional state of the suspect and the user from the video and audio data. For example, it identifies feelings such as anxiety, tension, and anger.

[0522] Step 5:

[0523] The server generates the optimal warning message based on the analysis results of the emotion engine. Speech synthesis technology adjusts the tone and content according to the situation.

[0524] Step 6:

[0525] The device plays the generated warning audio through speakers at the site. This audio serves as a warning message to suspicious individuals, attempting to deter any further activity at the scene.

[0526] Step 7:

[0527] The server sends notifications to administrators and police containing data on abnormal behavior and sentiment analysis. These notifications include video clips and analyzed sentiment information.

[0528] Step 8:

[0529] Users receive detailed notifications through their devices. Sentiment-based recommendations are displayed to help them make quick decisions.

[0530] (Example 2)

[0531] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0532] Modern security systems face the challenge of lacking the flexibility to quickly detect suspicious individuals or abnormal behavior and respond appropriately. In particular, there is a need to prevent escalation of situations and achieve more effective security measures by providing warnings and notifications that take emotional states into consideration.

[0533] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0534] In this invention, the server includes means for immediately processing image information acquired from a video acquisition device to identify abnormal behavior or suspicious individuals; means for matching the facial information of the identified individual with a prior set of information to issue an alarm if identified; and means for generating voice instructions using artificial intelligence to issue warnings to suspicious individuals. This enables a rapid and flexible response to suspicious individuals and abnormal behavior, tailored to their emotional state.

[0535] An "image acquisition device" refers to equipment used to capture image information and process or analyze that data.

[0536] "Image information" refers to visual data obtained from an image acquisition device, and includes the visual characteristics of the object or individual being analyzed.

[0537] "Abnormal behavior" refers to actions or behaviors that deviate from normal behavioral patterns and may lead to suspicious situations.

[0538] A "suspicious individual" refers to an individual that exhibits unusual or suspicious behavior or behavior, as detected by a security system.

[0539] "Facial information" refers to data containing facial features used for individual identification, and is usually stored as image data.

[0540] "Prior information sets" refer to reference datasets that have been collected or registered in the past and are used as criteria for identification.

[0541] An "alarm" refers to a notification method used to warn or alert people to detected anomalies or suspicious situations.

[0542] "Generative artificial intelligence" refers to a system that uses artificial intelligence technology to create newly generated content and information.

[0543] "Voice instructions" refer to voice messages generated using speech synthesis technology to give instructions or warnings to a target.

[0544] This invention provides an advanced security system that enables immediate response to suspicious individuals and abnormal behavior through the cooperation of servers, terminals, and users. This system is implemented using the following combination of hardware and software.

[0545] The server acquires real-time image information transmitted from the video acquisition device. Using image processing libraries such as OpenCV, the server analyzes this image information and detects motion and facial features. The detected facial information is compared with a pre-collected database, and a face recognition model using TensorFlow identifies suspicious individuals.

[0546] Furthermore, the server uses generative artificial intelligence to generate warning messages for suspicious individuals. Specifically, it utilizes natural language processing models to create warning texts, which are then converted into voice instructions using speech synthesis technology. For example, a warning message such as "There is a suspicious person in the parking lot. Suspicious behavior has been detected." is generated. This message is optimized according to the emotional state of the suspicious individual detected by an emotion analysis device. Amazon Comprehend, among others, is used for emotion analysis.

[0547] The device transmits generated voice instructions to the suspicious individual via its speaker. Simultaneously, it provides the user with a notification containing information about the suspicious person and sentiment analysis results. Based on the detailed information received through the device, the user can quickly assess the situation and take appropriate action, such as reporting to the police.

[0548] A concrete example of this system is intruder detection in a residential parking lot. The server analyzes video footage of the parking lot, identifies the intruder, and analyzes their emotions. An example of a prompt message might be, "What information is needed to generate a warning message if an intruder enters the parking lot?" In this way, the present invention enhances security measures in various environments.

[0549] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0550] Step 1:

[0551] The server receives image information from the video acquisition device. This input image information is video data consisting of multiple frames. The server uses OpenCV to extract frames from this video and executes a motion detection algorithm. Specifically, it uses methods such as background subtraction and optical flow analysis to identify moving parts. The output of this process is the frame image in which motion was detected.

[0552] Step 2:

[0553] The server detects face regions from frame images where motion has been detected. A face detection model using TensorFlow extracts regions containing facial features. This face region is then used as input data and compared against known faces in a database by a pre-trained face recognition model. The output is the face information of the identified suspicious individual.

[0554] Step 3:

[0555] The server performs emotion analysis on the identified facial information. This uses Amazon Comprehend to analyze emotional states and detect emotions such as anxiety and anger. The input data is the identified facial information, and the output data includes emotional states. This allows the server to understand the emotional state of the suspicious person.

[0556] Step 4:

[0557] The server utilizes a generative artificial intelligence model to generate warning messages for suspicious individuals. Specifically, it inputs a prompt into a natural language processing model to create a warning message such as, "There is a suspicious person in the parking lot." The output warning message is then converted into a voice command using speech synthesis technology.

[0558] Step 5:

[0559] The terminal transmits the generated voice instructions through its speaker. Simultaneously, the server sends a notification about the abnormal situation to the terminal and provides it to the user. The notification includes the results of sentiment analysis and suggests the next action the user should take. As output, the user receives an audio warning and a detailed explanation of the situation.

[0560] Step 6:

[0561] The user acts quickly based on the information provided by the device. This information includes transmitted voice instructions and device notifications. The user can use this information to take countermeasures against suspicious individuals. User actions include, for example, contacting the police or taking additional security measures.

[0562] (Application Example 2)

[0563] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0564] In recent years, security challenges in society have increased, with a particular need for early detection and appropriate response to suspicious individuals and abnormal behavior. However, current security systems struggle to respond flexibly, taking into account the emotional state of suspicious individuals, which can lead to false alarms and unnecessary tension. Therefore, there is a need to provide a system that improves detection accuracy and enables the generation and notification of warning messages based on appropriate situational judgment.

[0565] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0566] In this invention, the server includes means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious persons; means for matching the facial image of the detected person with an existing database and issuing a warning if identified; means for generating voice messages using artificial intelligence to warn suspicious persons; means for analyzing the emotional state of the detected person using an emotion analysis engine and optimizing the content and tone of the warning message; and means for providing notifications of abnormal activity based on the emotion analysis results to personal portable devices. This enables users to take appropriate action based on the emotional state of suspicious persons.

[0567] A "video capture device" is a device, such as a video camera or surveillance camera, that acquires and transmits video data in real time.

[0568] "Real-time analysis" refers to a method of processing input data immediately and outputting results without delay.

[0569] "Abnormal behavior" refers to movements or actions that are not normal or differ from the norm, and is an action that should be taken into consideration in crime prevention.

[0570] A "suspicious person" refers to an individual who does not fit in with the normal environment or circumstances, or who is considered suspicious.

[0571] A "face image" is visual data of a person's face captured by a digital camera or other image acquisition device.

[0572] A "database" is a digital archive organized to efficiently store and search large amounts of information.

[0573] A "voice message" is a recorded or synthesized message that is produced to convey information through sound.

[0574] An "emotion analysis engine" is a system that automatically determines a person's emotional state based on data obtained from video and audio.

[0575] A "personal portable device" refers to a digital device that a user personally owns and can carry with them, such as a smartphone or tablet.

[0576] To implement this invention, the server uses video data received from a high-resolution video capture device. The server uses the open-source image processing library "OpenCV" to perform video analysis. This makes it possible to detect abnormal behavior and suspicious individuals in real time.

[0577] Next, the server inputs the detected facial image into a facial recognition system (e.g., a facial recognition API) and compares it against an existing database. If a known suspicious person is identified through the comparison, the server generates a voice message using generative artificial intelligence (AI). A speech synthesis service (e.g., a speech-to-text API) is used to generate the voice message.

[0578] In addition, the server uses an emotion analysis engine to analyze the emotional state of the detected person. Based on this analysis, the content and tone of the warning message are optimized. For example, if a milder tone of warning is needed, unnecessary escalation can be prevented.

[0579] The server then sends a notification to the user's personal mobile device. This notification includes information about abnormal activity and sentiment analysis results, which will be displayed on the user's smartphone or smart glasses.

[0580] As a concrete example, consider a scenario where a suspicious person is detected at the entrance of a house. An example of a prompt message would be: "A suspicious person has been detected by the camera in front of the entrance. The person appears somewhat nervous. Generate a warning message in a calm tone and send a notification to the user." In this way, the user can respond to the situation appropriately and quickly.

[0581] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0582] Step 1:

[0583] The server receives video data in real time from the video capture device. Based on this video data, it performs motion analysis using OpenCV to detect unusual movements. The output is the information indicating the detection of abnormal behavior.

[0584] Step 2:

[0585] The server extracts faces from the detected image data and inputs them into the facial recognition system. The facial images are compared against an existing database to identify specific suspicious individuals. This matching result is then output.

[0586] Step 3:

[0587] The server generates a voice message using artificial intelligence based on the above matching results. The input information includes identifying the suspicious person and the circumstances surrounding them. A speech-to-text API is used to generate voice data with an appropriate warning tone. The voice data is the output.

[0588] Step 4:

[0589] The server utilizes an emotion analysis engine to analyze a person's emotional state from video data. The input includes video data, and the output includes the detected emotional state (e.g., stress, anxiety). Based on this analysis, the content and tone of the warning message generated in the previous step are adjusted.

[0590] Step 5:

[0591] The server ultimately sends a refined warning message to the device. The user's personal mobile device receives this notification and displays it on the screen. The notification includes information about the suspicious person and sentiment analysis results, which prompt the user to take a quick action.

[0592] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0593] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0594] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0595] [Fourth Embodiment]

[0596] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0597] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0598] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0599] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0600] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0601] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0602] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0603] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0604] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0605] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0606] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0607] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0608] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0609] This invention relates to a highly functional security system that analyzes surveillance footage in real time to identify, warn, and notify about abnormal behavior and suspicious individuals. In this system, video data is collected from various locations where surveillance cameras are installed, and a server is responsible for processing this data. The server utilizes artificial intelligence algorithms for video analysis to determine whether the behavior deviates from normal patterns. The server also performs facial recognition in real time and compares the results with an existing database of suspicious individuals.

[0610] For example, in an unmanned store, a server monitors the movements of customers in real time via surveillance cameras. If a customer takes an unusually long time to return an item to its original place, or if they are detected moving back and forth unnaturally within the store, this is treated as abnormal behavior. The server then broadcasts a warning sound to the customer through the store's speakers to deter further actions. Additionally, the facial data of any detected suspicious individuals is cross-referenced with a database sent to the server.

[0611] Even in the case of private residences, the server monitors the yard and parking lot, immediately detecting intruders when they enter the property. Once detected, the server uses AI to generate sounds that suggest a human presence or normal household noises, giving the intruder the impression that someone is home. Users can receive notifications on their devices to get information about the current status of the security system and any anomalies detected.

[0612] Through these embodiments, the present invention can contribute to crime prevention and provide a safer living environment.

[0613] The following describes the processing flow.

[0614] Step 1:

[0615] The server continuously receives video streams from surveillance cameras. The video is captured in real time for automated processing and divided into frames.

[0616] Step 2:

[0617] The server preprocesses each frame, removing unnecessary noise. This process includes adjusting the resolution and clearing the image. This improves the accuracy of the analysis.

[0618] Step 3:

[0619] The server uses artificial intelligence algorithms to detect abnormal behavior and suspicious movements from pre-processed frames. This analysis employs machine learning models to evaluate human movement and patterns.

[0620] Step 4:

[0621] The server extracts the facial image of the detected person and compares it against an existing database. This database contains pre-registered information on specific suspicious individuals.

[0622] Step 5:

[0623] If a suspicious person is identified, the server immediately generates an alarm sound. This sound is played through speakers in the store or home to warn the suspicious person and deter criminal activity.

[0624] Step 6:

[0625] The server issues a warning and simultaneously notifies administrators and the police. The notification includes details of the detected abnormal behavior and profile information of the suspicious person.

[0626] Step 7:

[0627] The device sends real-time notifications to the user. The user can check the details on their device and take prompt action, such as reporting to the police if necessary.

[0628] (Example 1)

[0629] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0630] In recent years, with the increase in criminal activity and intrusions by suspicious individuals, the importance of security systems has grown. However, conventional systems sometimes have difficulty detecting abnormal behavior in real time and taking appropriate action, and further technological advancements are needed. This invention aims to solve these problems and enable more accurate anomaly detection and rapid response.

[0631] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0632] In this invention, the server includes means for analyzing video information received from a video input device in real time to detect abnormal behavior or suspicious persons, means for matching the facial features of the detected person with an existing information aggregate to issue a warning if identified, and means for generating voice information using generative artificial intelligence to warn intruders. This makes it possible to immediately detect abnormal behavior and issue appropriate warnings, thereby preventing crimes from occurring.

[0633] A "video input device" refers to a device that acquires visual information and outputs it as digital data, such as a surveillance camera or sensor camera.

[0634] "Video information" refers to image and video data acquired from video input devices, which are used as the subject of analysis.

[0635] "Real-time analysis" refers to a processing method that performs data processing immediately after it is acquired and outputs the results.

[0636] "Abnormal behavior" refers to actions or movements that deviate from normal behavioral patterns or are unnatural, and these individuals or entities require vigilance from a security standpoint.

[0637] A "suspicious person" refers to an individual deemed suspicious from a crime prevention standpoint and requiring monitoring and response.

[0638] "Facial features" refer to the structural characteristics and patterns of a face that are extracted for facial recognition, and are data used to identify an individual.

[0639] An "information aggregate" refers to a collection of data gathered for a specific purpose, such as information on suspicious individuals or facial recognition data.

[0640] "Generative artificial intelligence" is a type of artificial intelligence that has the ability to generate new data or content based on input data.

[0641] "Audio information" refers to data that represents an audible message or warning, and is played back through a speaker.

[0642] The present invention is a security system that analyzes video information transmitted from a video input device in real time within a surveillance system, detects abnormal behavior and suspicious individuals, and enables a swift and accurate response to these.

[0643] The main components of this system are a server, user terminals, and video input devices, which work together to provide a secure environment. The server uses a computing device equipped with an NVIDIA GPU to process complex video data in real time. The server utilizes software libraries such as OpenCV and TensorFlow to run machine learning models that classify the movement of people in the video and determine whether it is abnormal. It also uses Amazon Rekognition for face recognition processing and quickly performs database matching with existing data collections.

[0644] When abnormal behavior is detected, the server uses the Azure Speech API to generate speech information and send a warning sound to a designated output device. Furthermore, it can utilize generative artificial intelligence to output ambient sounds to give intruders the impression that someone is inside the home. Using OpenAI's generative model, it generates natural conversations and ambient sounds. For example, when an intruder is detected, it can generate a voice message saying, "It's time to prepare dinner."

[0645] Users can receive push notifications on their devices when an anomaly is detected. The interface for this is provided through a standard smartphone application, allowing users to monitor the system status in real time and take appropriate action as needed.

[0646] Examples of prompt messages include, "Generate an announcement voice to be used when a customer exhibits suspicious behavior," and "Generate a voice that gives a potential intruder the impression that someone is home."

[0647] With a system configured in this manner, the present invention can contribute to crime prevention and provide a safer and more secure living environment.

[0648] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0649] Step 1:

[0650] The server receives video information in real time from the video input device. A video stream is provided as input, and the server divides it into frames and loads them into memory. Specifically, it acquires data via a network connection and prepares it for analysis.

[0651] Step 2:

[0652] The server performs video analysis on the video information for each frame loaded into memory. The input frames are first preprocessed, undergoing noise reduction and resolution adjustment. Next, a machine learning model using TensorFlow performs data calculations to detect abnormal behavior and suspicious individuals. The output is the coordinates of detected individuals and information on changes in their behavior patterns. Specifically, the AI model compares the detected behavior patterns with normal behavior patterns and identifies behaviors that are judged to be abnormal.

[0653] Step 3:

[0654] The server analyzes the faces of detected individuals using Amazon Rekognition. Face images from video frames are used as input, and feature points are extracted. Face features are compared with an existing database of information aggregates, and if a match is found, a risk assessment is performed. The output is information on whether a particular person has been identified as a suspicious person. Specifically, the face recognition algorithm analyzes the facial features and performs a database search.

[0655] Step 4:

[0656] The server generates audio information using generative artificial intelligence when it detects abnormal behavior or suspicious individuals. A prompt is input to the generative AI model, such as "Generate a warning audio." This prompts the server to generate a warning audio, which is then output to speakers via the network. Specifically, a speech synthesis engine converts the input text into natural-sounding speech.

[0657] Step 5:

[0658] The server notifies the user's terminal of detected suspicious individuals or anomalies. Anomaly information is provided as input and immediately sent to the terminal via a push notification service. The user receives this notification and has the opportunity to take necessary actions. The output is a notification message, and as a concrete action, the system-generated report is displayed in the terminal's notification center.

[0659] (Application Example 1)

[0660] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0661] Conventional security systems have difficulty detecting abnormal behavior or suspicious individuals in real time, and accurately notifying users of such information. Furthermore, it is difficult for administrators located remotely to immediately grasp the situation. Additionally, even when an anomaly is detected, they cannot immediately issue appropriate warnings, resulting in ineffective responses.

[0662] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0663] In this invention, the server includes means for instantly analyzing video information received from a video acquisition device to detect abnormal behavior or suspicious persons, means for issuing a notification when the facial information of the detected person is identified by comparing it with an existing information set, and means for generating voice information using artificial intelligence to issue a warning. This enables administrators and personnel in remote locations to quickly grasp the situation as soon as an anomaly is detected and take appropriate action immediately.

[0664] A "video acquisition device" is a device capable of acquiring video information from the surroundings and transmitting it as digital data.

[0665] "Video information" refers to visual video data obtained from video acquisition devices, and useful information can be extracted by appropriately analyzing its content.

[0666] "Instant analysis" refers to a method of processing acquired information in real time and obtaining results quickly.

[0667] "Abnormal behavior" refers to actions that deviate from normal behavioral patterns and is identified as a situation that warrants attention.

[0668] A "suspicious person" refers to an individual who engages in unexpected or questionable behavior and should be viewed with caution from a crime prevention perspective.

[0669] "Detection" refers to the process of sensing or recognizing the presence or action of an object.

[0670] "Facial information" refers to identifiable data related to a person's face, which is used to identify individuals.

[0671] An "information set" refers to a collection of data that is an aggregate of existing databases or individual pieces of information.

[0672] "Notification" refers to the act of communicating detected information and informing relevant parties and systems.

[0673] "Artificial intelligence" refers to technology that enables machines to mimic human intellectual activity and perform actions with the aim of automatically executing specific functions.

[0674] "Auditory information" refers to synthesized sound data and is used as a means of conveying information through hearing.

[0675] A "warning" refers to an action taken to alert someone to a dangerous or suspicious situation and draw their attention.

[0676] "Administrator" refers to a person or organization responsible for monitoring the system and managing it to ensure its proper operation.

[0677] "Law enforcement agencies" refer to organizations that oversee the enforcement of laws in order to maintain public safety.

[0678] "External devices" refer to additional equipment or terminals that work in conjunction with a system to perform information processing or communication.

[0679] A "user terminal" refers to a device that a user directly operates and uses to receive information, and includes smartphones and computers.

[0680] An "information output device" is a device used to visualize or make information audible, and refers to a device that emits sound or images.

[0681] In the system implementing this invention, the server first receives video information in real time from a video acquisition device. The video acquisition device acquires video information of the surrounding environment and transmits it to the server as a digital signal. The server processes this video information immediately using an advanced video analysis algorithm and has the function of detecting abnormal behavior and suspicious persons. The main software used here is video analysis libraries such as TensorFlow and OpenCV. If an anomaly is detected as a result of the analysis, the server sends that information to the user terminal as a push notification. A smartphone or similar device is used as the user terminal, and the user can receive the notification of an anomaly detection immediately.

[0682] If a suspicious person is detected, the server utilizes a facial recognition API such as Amazon Rekognition to compare the person's facial information with an existing set of information. Once the facial information comparison is complete, the server generates an appropriate warning based on the detection result. The warning voice is synthesized as natural-sounding audio using a generative AI model. This synthesized voice is then sent again by the server to an external audio output device and broadcast as a warning to the surrounding area. For example, a speaker in the home might emit the voice to psychologically alert the suspicious person.

[0683] As a concrete example, if an unauthorized intrusion is detected in a home garden at night, the server will immediately recognize the anomaly and notify the user via their smartphone with a prompt message stating, "An anomaly has been detected. Please check the details." Upon receiving this notification, the user can check the situation via their smartphone and, if necessary, report it to law enforcement. This allows the user to respond to the situation efficiently and ensure their safety.

[0684] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0685] Step 1:

[0686] The server receives video information from the video acquisition device. The input is real-time digital video data transmitted from the video acquisition device, and the server uses this data to prepare for processing in the next step.

[0687] Step 2:

[0688] The server analyzes the received video information using video analysis libraries such as TensorFlow and OpenCV. In this step, the video is analyzed using a machine learning model based on the dataset to detect abnormal behavior or suspicious individuals that deviate from normal behavioral patterns. The output is the result of detecting abnormal behavior.

[0689] Step 3:

[0690] Based on the detection results, the server compares the suspicious person's facial information with an existing facial recognition database using tools such as Amazon Rekognition. This process involves searching and determining matches based on the detected facial information as input. The output is whether or not a specific person is found as a result of the matching.

[0691] Step 4:

[0692] The server uses a generative AI model to generate audio information for when a warning should be issued. In this step, information indicating that an anomaly or suspicious person has been detected is used as input, and the process of converting text data into speech results in a synthesized warning voice as output.

[0693] Step 5:

[0694] The server transmits the generated audio information to an external audio output device to issue a warning to the surrounding area. Specifically, it plays the audio through a home speaker or external output system to alert people. In this process, the data is processed for transmission to the output device, and the output is played back as actual audio.

[0695] Step 6:

[0696] The server sends an anomaly detection notification to the user's terminal as a push notification. Using the detected anomaly information as input, the server notifies the user with the prompt message, "An anomaly has been detected. Please check the details." The output is the notification the user receives on their terminal.

[0697] Step 7:

[0698] The user checks the notification received on their smartphone and understands the situation. In this step, the user receives a push notification as input and then checks the log and video details based on it. In this process, the next action the user should take (e.g., contacting law enforcement) is determined. The output is the action taken based on the user's judgment.

[0699] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0700] This invention provides an advanced security system that can immediately respond to suspicious individuals and abnormal behavior by combining an emotion engine with a surveillance system. The server receives video data from surveillance cameras and analyzes it in real time. This identifies abnormal behavior or suspicious individuals. The server is equipped with an AI-powered facial recognition system, which identifies suspicious individuals by comparing detected facial images with a database. Furthermore, after detection, it generates warning voices using a generation AI and sends voice messages to the suspicious individuals.

[0701] The emotion engine analyzes the emotional state of detected individuals and users, enabling more flexible responses. Specifically, the server uses the emotion analysis results to optimize the content and tone of warning messages to suit the situation. For example, if an intruder is in the initial stages of an intrusion, a milder warning is issued to avoid unnecessary tension and prevent the situation from escalating. On the other hand, if stress or anxiety is detected in the user's emotional state, a more assertive warning and prompt notification to the administrator are issued.

[0702] In this system, the terminal notifies the user, providing details of abnormal activity and the results of sentiment analysis. Users can quickly understand the situation and take necessary actions through the terminal. Furthermore, because these notifications include information refined by the sentiment engine, they help victims or administrators make appropriate decisions.

[0703] As a concrete example, consider a scenario where a suspicious person is detected in a residential parking lot. The server identifies the suspicious person through video processing and facial recognition, and analyzes their emotions. If signs of anxiety or tension are detected, the emotion engine adjusts the warning voice based on that information, issuing a warning in an appropriate tone. This information is then notified to the user via their terminal, allowing them to take appropriate action. In this way, the present invention enables advanced security measures in a variety of situations.

[0704] The following describes the processing flow.

[0705] Step 1:

[0706] The server continuously receives video streams from surveillance cameras. These video streams are processed as digital data and subdivided into individual frames.

[0707] Step 2:

[0708] The server uses AI algorithms to analyze the received video frames. This allows for the rapid detection of abnormal behavior patterns and suspicious individuals.

[0709] Step 3:

[0710] The server compares the detected person's facial image with the database. If the person is registered in the database as a suspicious individual, a specific warning flag is set.

[0711] Step 4:

[0712] The emotion engine is activated and evaluates the emotional state of the suspect and the user from the video and audio data. For example, it identifies feelings such as anxiety, tension, and anger.

[0713] Step 5:

[0714] The server generates the optimal warning message based on the analysis results of the emotion engine. Speech synthesis technology adjusts the tone and content according to the situation.

[0715] Step 6:

[0716] The device plays the generated warning audio through speakers at the site. This audio serves as a warning message to suspicious individuals, attempting to deter any further activity at the scene.

[0717] Step 7:

[0718] The server sends notifications to administrators and police containing data on abnormal behavior and sentiment analysis. These notifications include video clips and analyzed sentiment information.

[0719] Step 8:

[0720] Users receive detailed notifications through their devices. Sentiment-based recommendations are displayed to help them make quick decisions.

[0721] (Example 2)

[0722] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0723] Modern security systems face the challenge of lacking the flexibility to quickly detect suspicious individuals or abnormal behavior and respond appropriately. In particular, there is a need to prevent escalation of situations and achieve more effective security measures by providing warnings and notifications that take emotional states into consideration.

[0724] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0725] In this invention, the server includes means for immediately processing image information acquired from a video acquisition device to identify abnormal behavior or suspicious individuals; means for matching the facial information of the identified individual with a prior set of information to issue an alarm if identified; and means for generating voice instructions using artificial intelligence to issue warnings to suspicious individuals. This enables a rapid and flexible response to suspicious individuals and abnormal behavior, tailored to their emotional state.

[0726] An "image acquisition device" refers to equipment used to capture image information and process or analyze that data.

[0727] "Image information" refers to visual data obtained from an image acquisition device, and includes the visual characteristics of the object or individual being analyzed.

[0728] "Abnormal behavior" refers to actions or behaviors that deviate from normal behavioral patterns and may lead to suspicious situations.

[0729] A "suspicious individual" refers to an individual that exhibits unusual or suspicious behavior or behavior, as detected by a security system.

[0730] "Facial information" refers to data containing facial features used for individual identification, and is usually stored as image data.

[0731] "Prior information sets" refer to reference datasets that have been collected or registered in the past and are used as criteria for identification.

[0732] An "alarm" refers to a notification method used to warn or alert people to detected anomalies or suspicious situations.

[0733] "Generative artificial intelligence" refers to a system that uses artificial intelligence technology to create newly generated content and information.

[0734] "Voice instructions" refer to voice messages generated using speech synthesis technology to give instructions or warnings to a target.

[0735] This invention provides an advanced security system that enables immediate response to suspicious individuals and abnormal behavior through the cooperation of servers, terminals, and users. This system is implemented using the following combination of hardware and software.

[0736] The server acquires real-time image information transmitted from the video acquisition device. Using image processing libraries such as OpenCV, the server analyzes this image information and detects motion and facial features. The detected facial information is compared with a pre-collected database, and a face recognition model using TensorFlow identifies suspicious individuals.

[0737] Furthermore, the server uses generative artificial intelligence to generate warning messages for suspicious individuals. Specifically, it utilizes natural language processing models to create warning texts, which are then converted into voice instructions using speech synthesis technology. For example, a warning message such as "There is a suspicious person in the parking lot. Suspicious behavior has been detected." is generated. This message is optimized according to the emotional state of the suspicious individual detected by an emotion analysis device. Amazon Comprehend, among others, is used for emotion analysis.

[0738] The device transmits generated voice instructions to the suspicious individual via its speaker. Simultaneously, it provides the user with a notification containing information about the suspicious person and sentiment analysis results. Based on the detailed information received through the device, the user can quickly assess the situation and take appropriate action, such as reporting to the police.

[0739] A concrete example of this system is intruder detection in a residential parking lot. The server analyzes video footage of the parking lot, identifies the intruder, and analyzes their emotions. An example of a prompt message might be, "What information is needed to generate a warning message if an intruder enters the parking lot?" In this way, the present invention enhances security measures in various environments.

[0740] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0741] Step 1:

[0742] The server receives image information from the video acquisition device. This input image information is video data consisting of multiple frames. The server uses OpenCV to extract frames from this video and executes a motion detection algorithm. Specifically, it uses methods such as background subtraction and optical flow analysis to identify moving parts. The output of this process is the frame image in which motion was detected.

[0743] Step 2:

[0744] The server detects face regions from frame images where motion has been detected. A face detection model using TensorFlow extracts regions containing facial features. This face region is then used as input data and compared against known faces in a database by a pre-trained face recognition model. The output is the face information of the identified suspicious individual.

[0745] Step 3:

[0746] The server performs emotion analysis on the identified facial information. This uses Amazon Comprehend to analyze emotional states and detect emotions such as anxiety and anger. The input data is the identified facial information, and the output data includes emotional states. This allows the server to understand the emotional state of the suspicious person.

[0747] Step 4:

[0748] The server utilizes a generative artificial intelligence model to generate warning messages for suspicious individuals. Specifically, it inputs a prompt into a natural language processing model to create a warning message such as, "There is a suspicious person in the parking lot." The output warning message is then converted into a voice command using speech synthesis technology.

[0749] Step 5:

[0750] The terminal transmits the generated voice instructions through its speaker. Simultaneously, the server sends a notification about the abnormal situation to the terminal and provides it to the user. The notification includes the results of sentiment analysis and suggests the next action the user should take. As output, the user receives an audio warning and a detailed explanation of the situation.

[0751] Step 6:

[0752] The user acts quickly based on the information provided by the device. This information includes transmitted voice instructions and device notifications. The user can use this information to take countermeasures against suspicious individuals. User actions include, for example, contacting the police or taking additional security measures.

[0753] (Application Example 2)

[0754] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0755] In recent years, security challenges in society have increased, with a particular need for early detection and appropriate response to suspicious individuals and abnormal behavior. However, current security systems struggle to respond flexibly, taking into account the emotional state of suspicious individuals, which can lead to false alarms and unnecessary tension. Therefore, there is a need to provide a system that improves detection accuracy and enables the generation and notification of warning messages based on appropriate situational judgment.

[0756] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0757] In this invention, the server includes means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious persons; means for matching the facial image of the detected person with an existing database and issuing a warning if identified; means for generating voice messages using artificial intelligence to warn suspicious persons; means for analyzing the emotional state of the detected person using an emotion analysis engine and optimizing the content and tone of the warning message; and means for providing notifications of abnormal activity based on the emotion analysis results to personal portable devices. This enables users to take appropriate action based on the emotional state of suspicious persons.

[0758] A "video capture device" is a device, such as a video camera or surveillance camera, that acquires and transmits video data in real time.

[0759] "Real-time analysis" refers to a method of processing input data immediately and outputting results without delay.

[0760] "Abnormal behavior" refers to movements or actions that are not normal or differ from the norm, and is an action that should be taken into consideration in crime prevention.

[0761] A "suspicious person" refers to an individual who does not fit in with the normal environment or circumstances, or who is considered suspicious.

[0762] A "face image" is visual data of a person's face captured by a digital camera or other image acquisition device.

[0763] A "database" is a digital archive organized to efficiently store and search large amounts of information.

[0764] A "voice message" is a recorded or synthesized message that is produced to convey information through sound.

[0765] An "emotion analysis engine" is a system that automatically determines a person's emotional state based on data obtained from video and audio.

[0766] A "personal portable device" refers to a digital device that a user personally owns and can carry with them, such as a smartphone or tablet.

[0767] To implement this invention, the server uses video data received from a high-resolution video capture device. The server uses the open-source image processing library "OpenCV" to perform video analysis. This makes it possible to detect abnormal behavior and suspicious individuals in real time.

[0768] Next, the server inputs the detected facial image into a facial recognition system (e.g., a facial recognition API) and compares it against an existing database. If a known suspicious person is identified through the comparison, the server generates a voice message using generative artificial intelligence (AI). A speech synthesis service (e.g., a speech-to-text API) is used to generate the voice message.

[0769] In addition, the server uses an emotion analysis engine to analyze the emotional state of the detected person. Based on this analysis, the content and tone of the warning message are optimized. For example, if a milder tone of warning is needed, unnecessary escalation can be prevented.

[0770] The server then sends a notification to the user's personal mobile device. This notification includes information about abnormal activity and sentiment analysis results, which will be displayed on the user's smartphone or smart glasses.

[0771] As a concrete example, consider a scenario where a suspicious person is detected at the entrance of a house. An example of a prompt message would be: "A suspicious person has been detected by the camera in front of the entrance. The person appears somewhat nervous. Generate a warning message in a calm tone and send a notification to the user." In this way, the user can respond to the situation appropriately and quickly.

[0772] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0773] Step 1:

[0774] The server receives video data in real time from the video capture device. Based on this video data, it performs motion analysis using OpenCV to detect unusual movements. The output is the information indicating the detection of abnormal behavior.

[0775] Step 2:

[0776] The server extracts faces from the detected image data and inputs them into the facial recognition system. The facial images are compared against an existing database to identify specific suspicious individuals. This matching result is then output.

[0777] Step 3:

[0778] The server generates a voice message using artificial intelligence based on the above matching results. The input information includes identifying the suspicious person and the circumstances surrounding them. A speech-to-text API is used to generate voice data with an appropriate warning tone. The voice data is the output.

[0779] Step 4:

[0780] The server utilizes an emotion analysis engine to analyze a person's emotional state from video data. The input includes video data, and the output includes the detected emotional state (e.g., stress, anxiety). Based on this analysis, the content and tone of the warning message generated in the previous step are adjusted.

[0781] Step 5:

[0782] The server ultimately sends a refined warning message to the device. The user's personal mobile device receives this notification and displays it on the screen. The notification includes information about the suspicious person and sentiment analysis results, which prompt the user to take a quick action.

[0783] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0784] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0785] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0786] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0787] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0788] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0789] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0790] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0791] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0792] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0793] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0794] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0795] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0796] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0797] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0798] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0799] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0800] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0801] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0802] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0803] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0804] The following is further disclosed regarding the embodiments described above.

[0805] (Claim 1)

[0806] A means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious individuals,

[0807] A means of issuing a warning if the detected person's facial image is identified by comparing it with an existing database,

[0808] A means of generating voice messages using generative artificial intelligence to warn suspicious individuals,

[0809] Means of notifying administrators or the police of information regarding suspicious persons or unusual behavior,

[0810] A system that includes this.

[0811] (Claim 2)

[0812] The system according to claim 1, comprising a method for integrating diverse data modals to achieve high-precision video analysis.

[0813] (Claim 3)

[0814] The system according to claim 1, further comprising means for providing a user terminal with a notification regarding the detection of an anomaly.

[0815] "Example 1"

[0816] (Claim 1)

[0817] A means for analyzing video information received from a video input device in real time to detect abnormal behavior or suspicious individuals,

[0818] A means of issuing a warning when the facial features of a detected person are identified by comparing them with an existing data collection,

[0819] A means of generating voice information using generative artificial intelligence to warn intruders,

[0820] Means of notifying administrators or public authorities of information regarding unusual behavior or suspicious individuals,

[0821] A means for generating everyday sounds to confuse suspicious individuals and playing them back within a designated area,

[0822] A means of providing a rapid notification regarding anomaly detection to the user terminal,

[0823] A system that includes this.

[0824] (Claim 2)

[0825] The system according to claim 1, comprising a method for achieving high-precision video analysis by integrating multiple information formats.

[0826] (Claim 3)

[0827] The system according to claim 1, further comprising means for determining an appropriate response to detected abnormal behavior using an artificial intelligence model.

[0828] "Application Example 1"

[0829] (Claim 1)

[0830] A means for immediately analyzing video information received from a video acquisition device to detect abnormal behavior or suspicious persons,

[0831] If the facial information of a detected person is identified by comparing it with an existing set of information, a means of issuing a notification is provided.

[0832] A means of generating voice information and issuing warnings using artificial intelligence,

[0833] Means of notifying administrators or law enforcement agencies of information regarding suspicious persons or unusual behavior,

[0834] A means of connecting to an external device and providing notifications to the user terminal when an anomaly is detected,

[0835] By providing information on anomaly detection in real time, a means for controlling surrounding output devices,

[0836] A system that includes this.

[0837] (Claim 2)

[0838] The system according to claim 1, comprising a method for integrating diverse information modals to achieve high-precision video analysis.

[0839] (Claim 3)

[0840] The system according to claim 1, further comprising means for generating an alarm sound by outputting external information.

[0841] "Example 2 of combining an emotion engine"

[0842] (Claim 1)

[0843] A means for immediately processing image information acquired from a video acquisition device to identify abnormal behavior or suspicious individuals,

[0844] A means for issuing an alarm when the facial information of an identified individual is matched with a set of prior information,

[0845] A means of generating voice instructions using generative artificial intelligence and issuing warnings to suspicious individuals,

[0846] A means for determining the emotional state of a detected individual using an emotion analysis device and optimizing the content and tone of the warning,

[0847] Means for reporting information about suspicious individuals or abnormal behavior to controllers or security agencies,

[0848] A system that includes this.

[0849] (Claim 2)

[0850] The system according to claim 1, comprising a method for achieving advanced image processing through multi-layer information analysis.

[0851] (Claim 3)

[0852] The system according to claim 1, further comprising means for providing a warning regarding anomaly detection to a user terminal.

[0853] "Application example 2 when combining with an emotional engine"

[0854] (Claim 1)

[0855] A means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious individuals,

[0856] A means of issuing a warning if the detected person's facial image is identified by comparing it with an existing database,

[0857] A means of generating voice messages using generative artificial intelligence to warn suspicious individuals,

[0858] Means of notifying administrators or the police of information regarding suspicious persons or unusual behavior,

[0859] A means of analyzing the emotional state of a person detected using an emotion analysis engine and optimizing the content and tone of warning messages,

[0860] A means for providing notifications of abnormal activity based on emotion analysis results to a personal portable device,

[0861] A system that includes this.

[0862] (Claim 2)

[0863] The system according to claim 1, comprising a method for integrating diverse data modals to achieve high-precision video analysis.

[0864] (Claim 3)

[0865] The system according to claim 1, further comprising means for providing a personal portable device with a notification regarding the detection of an anomaly. [Explanation of symbols]

[0866] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means for analyzing video data received from a video capture device in real time to detect abnormal behavior or suspicious individuals, A means of issuing a warning if the detected person's facial image is identified by comparing it with an existing database, A means of generating voice messages using generative artificial intelligence to warn suspicious individuals, Means of notifying administrators or the police of information regarding suspicious persons or unusual behavior, A system that includes this.

2. The system according to claim 1, comprising a method for integrating various data modals to achieve high-precision video analysis.

3. The system according to claim 1, further comprising means for providing a user terminal with a notification regarding the detection of an anomaly.