system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A system that converts voice and image data into text and image data to generate detailed manuals, automating tasks and reducing human intervention, addresses labor shortages and inefficiencies in small and medium-sized enterprises.

JP2026103513APending Publication Date: 2026-06-24SOFTBANK GROUP CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: SOFTBANK GROUP CORP
Filing Date: 2024-12-12
Publication Date: 2026-06-24

Application Information

Patent Timeline

12 Dec 2024

Application

24 Jun 2026

Publication

JP2026103513A

IPC: G05B23/02

AI Tagging

Application Domain

Electric testing/monitoring

Technology Topics

Engineering Video recording

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data set Descent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangement Heating and refrigeration combinations Heat flow Working fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories Engineering Sediment
An IGBT lifetime prediction method based on a GA-Elman-LSTM combined model
CN115964937BImprove forecast accuracySolve the problem of easy to fall into local minimumInternal combustion piston engines Biological models Engineering Data mining

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies face challenges in efficiently automating business processes in small and medium-sized enterprises due to labor shortages and difficulties in integrating voice explanations with specific operation procedures, leading to inefficiencies and instability in task execution.

Method used

A system that converts voice and image data into text and image data, generates detailed work manuals, and automates task execution using AI agents, minimizing human intervention.

Benefits of technology

The system enables efficient and practical automation of tasks by creating comprehensive manuals and allowing AI agents to execute procedures accurately and consistently, reducing the burden on human resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026103513000001_ABST

Patent Text Reader

Abstract

We provide the system. [Solution] A means of acquiring information about the work performed, including audio and video recordings. A conversion means for converting the aforementioned audio into text information, A generation means for generating a work instruction sheet that describes the work procedure based on the aforementioned video and text information, A recording means for storing the aforementioned work instructions, A control means that automatically performs work in accordance with the aforementioned work instructions, The control means includes a display means that has a function of presenting work procedures using a visual device, A system that includes this.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0004] , , , , ,

[0005] , , , , ,

[0003] , , , ,

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a character of the chatbot, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance that responds to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Based on the problems of labor shortage and difficulty in business succession in small and medium-sized enterprises, there is a current situation where business efficiency improvement is required. There is a need to build a system that can instruct an AI agent to perform tasks currently done by humans, quickly and accurately generate the content as a business manual, and then have the AI automatically execute the tasks from the next time. However, with the currently existing technologies, it is difficult to effectively integrate the voice explanation of the tasks and the specific operation procedures to generate a practical manual. Also, in the process of automating tasks based on the generated manual, there is a lack of a method that can perform tasks stably regardless of specific instructions or environments. It is necessary to solve these problems.

Means for Solving the Problems

[0005] This invention provides an input means for inputting work content using voice and images. The voice data is converted into text data by a conversion means. Subsequently, a work manual containing detailed work procedures is generated by a generation means based on the image and text data. This work manual is stored by a storage means and can be accessed by the user at any time. Furthermore, an execution means allows an AI agent to automatically perform the work according to the stored work manual from the next time onward. This system enables efficient and practical automation of tasks from the initial instruction, significantly reducing the burden on human resources.

[0006] "Input method" refers to a function that allows users to provide work details to the system via voice and images.

[0007] "Conversion means" refers to a function for converting input audio data into text data.

[0008] "Generation means" refers to a function that combines image data with text data to create a business manual that describes work procedures.

[0009] "Memory storage" refers to a function that saves the generated business manuals so that users can refer to them in the future.

[0010] "Execution means" refers to a function that automatically carries out tasks based on the generated work manual. [Brief explanation of the drawing]

[0011] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4]This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine. [Modes for carrying out the invention]

[0012] Hereinafter, an example of an embodiment of the system relating to the technology of this disclosure will be described with reference to the attached drawings.

[0013] First, let's explain the terminology used in the following explanation.

[0014] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0015] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0016] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs, various parameters, and the like. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0017] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor, an antenna, and the like. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0018] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0019] [First Embodiment]

[0020] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0021] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0022] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0023] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0024] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0025] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0026] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0027] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0028] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0029] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0030] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0031] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0032] An embodiment of the system of the present invention employs a process that includes the following procedures. In this process, the user, terminal, and server each play a specific role.

[0033] The user provides details of their work to the system via a terminal and verbally explains the series of operations involved. During this process, the terminal captures the user's screen in real time and simultaneously records the user's voice. This data is transmitted to the server as it progresses.

[0034] The server converts the audio data transmitted from the terminal into text data using speech recognition technology. This conversion process organizes the content explained in audio into text. The server also analyzes the image data acquired from the terminal and extracts keyframes and operation details necessary for performing the task.

[0035] The server, acting as the generation mechanism, integrates the obtained text and image data to generate a detailed operational manual. This manual includes step-by-step instructions to ensure the user can accurately reproduce the initial operation. The generated operational manual is securely stored on a cloud-based storage device, allowing the user to access and modify it as needed.

[0036] Subsequently, the next time a user wants to perform the same task, the server will automatically execute the procedure based on the saved task manual. As a means of execution, the server replicates the necessary steps on the terminal, automating the task. User intervention is minimized, resulting in increased efficiency.

[0037] For example, if a user is teaching an AI agent how to issue invoices, the user operates the invoice issuance software and provides explanations for each operation. The terminal collects this information and sends it to the server. The server converts the received audio into text and the screen operations into images, and integrates them to generate an invoice issuance manual. As a result, the server can then automatically perform the invoice issuance procedure according to this manual the next time around.

[0038] This approach integrates the entire process, from creating operational manuals to automating their execution, into a single workflow, providing users with an effective means to improve operational efficiency.

[0039] The following describes the processing flow.

[0040] Step 1:

[0041] The user uses a terminal and explains the task verbally while demonstrating it. The user operates the software on the screen while giving verbal instructions to demonstrate specific operations.

[0042] Step 2:

[0043] The device activates its screen capture function and records the user's screen activity. It also records the user's voice explanation via the microphone. This data is temporarily stored for later transmission to a server.

[0044] Step 3:

[0045] The device compresses the recorded screen and audio data and sends it to the server using a secure communication protocol. The data integrity is verified to ensure no data loss occurs.

[0046] Step 4:

[0047] The server converts the audio data received from the terminal into text data using speech recognition technology. In this process, appropriate dictionary data is used to handle specialized terminology and industry jargon that needs to be recognized.

[0048] Step 5:

[0049] The server identifies important keyframes from the screen data and extracts images related to user actions. These images will later become part of the operational manual that will be generated.

[0050] Step 6:

[0051] The server combines the converted text data with extracted images to automatically generate a detailed manual outlining the work procedures. This manual organizes the work procedures step by step and presents them in a user-friendly format.

[0052] Step 7:

[0053] The server generates and stores the work manuals in cloud storage. Users can access these manuals at any time to review or modify their work procedures.

[0054] Step 8:

[0055] The next time the user requests that the same task be performed by AI, the server will automatically execute the steps based on the stored task manual. The task will proceed autonomously, following instructions from the server to the terminal.

[0056] Step 9:

[0057] The server saves execution logs of tasks, making them accessible to users for future improvements and verification. This allows for more efficient task accuracy and troubleshooting.

[0058] (Example 1)

[0059] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0060] In traditional business processes, the creation of documentation and the automation of procedures were often done manually, which was time-consuming and labor-intensive. Furthermore, manual document creation was prone to human error, compromising efficiency and accuracy. In addition, there was a lack of comprehensive systems to streamline and automate these business processes.

[0061] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0062] In this invention, the server includes information gathering means for acquiring audio and visual information for users to describe work content, information conversion means for converting the audio information into text information, and manual generation means for constructing work procedures using the visual and text information. This enables the automation and efficient execution of work procedures.

[0063] "User" refers to an individual or organization that uses the system to input business details and enjoys the streamlined process.

[0064] "Job description" refers to information including procedures for tasks and operations performed with a specific purpose.

[0065] "Audio information" refers to sound data that records work-related instructions and content explained verbally by the user.

[0066] "Visual information" refers to digital image data, including user interfaces and visual elements in business processes.

[0067] "Information gathering means" refers to devices or functions used to collect audio and visual information obtained from users.

[0068] "Information conversion means" refers to technology or equipment for converting acquired audio information into text information.

[0069] "Textual information" refers to data in which audio information is represented as text.

[0070] "Manual generation means" refers to a device or function that documents business procedures based on collected visual and textual information.

[0071] "Business procedures" refer to information that describes the sequence of operations and actions necessary to perform a task.

[0072] A "data storage area" refers to a storage system that stores data such as business procedures and manuals, and makes them accessible as needed.

[0073] "Means of performing business operations" refers to a system that has the function of automatically reproducing or executing a process according to saved business procedures.

[0074] A "network environment" refers to a technological infrastructure that enables the transmission and reception of digital data via the internet and other communication methods.

[0075] To implement this invention, a system is constructed in which a user, a terminal, and a server work together to automate tasks. The user verbally explains the task and performs the procedure on the terminal. The terminal captures the user's screen in real time and records audio. This terminal is a computer equipped with an audio input device and a display capture function.

[0076] The server converts the audio data transmitted from the terminal into text using speech recognition technology. General-purpose speech recognition technologies (such as Google® Cloud Speech-to-Text) can be used for this purpose. The server also uses image recognition technology to analyze captured visual information and extract key points related to the business. Image processing libraries such as OpenCV can be used for this image analysis.

[0077] The server integrates the converted text information with the analyzed visual information to generate a manual outlining the work procedures. This manual is stored in a cloud storage system (e.g., Amazon S3) and is accessible at any time. Furthermore, users can modify this manual as needed.

[0078] To streamline actual operations, the server can automatically execute processes based on saved work procedures. This minimizes user intervention and improves the reproducibility and efficiency of operations. For example, when a user issues an invoice, they only need to be taught the necessary information input procedure once, and the server can then automatically reproduce that procedure the next time. This significantly reduces the burden of repetitive daily tasks.

[0079] An example of a prompt message could be: "Please describe the procedure for issuing an invoice. This procedure should include details of the required field entries and button clicks." Using such prompts generates operational manuals that contain specific and reproducible information, thus ensuring consistency in operations and improving operational efficiency.

[0080] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0081] Step 1:

[0082] The user verbally explains the task to the terminal and operates the software. Input includes the user's voice and actions. Specifically, the user launches invoice issuance software and explains the operation while entering the necessary information. During this process, the terminal utilizes both a voice input device and a screen capture function.

[0083] Step 2:

[0084] The terminal captures the user's screen and audio in real time and sends them to the server as audio and visual data. The input is the user's audio and screen, and the output is the audio and visual data sent to the server. Specifically, the terminal saves the displayed invoice issuance screen as a screenshot and records the audio explanation.

[0085] Step 3:

[0086] The server converts the transmitted audio data into text data using speech recognition technology. The input for this step is audio data, and the output is text data. Here, speech recognition software is used to organize the user's explanation into textual information.

[0087] Step 4:

[0088] The server analyzes visual data using image analysis technology and extracts keyframes to identify business procedures. The input is visual data, and the output is data indicating important operational points. Specifically, button clicks and data entry steps are identified from the image.

[0089] Step 5:

[0090] The server integrates the converted text data and parsed visual data to generate a manual outlining the business procedures. The input consists of text and visual data, while the output is a detailed business manual. The server combines these to clearly demonstrate the step-by-step process of invoice issuance.

[0091] Step 6:

[0092] The server saves the generated work manuals to cloud storage. The input is the work manual, and the output is the saved data accessible online. This allows users to access the manuals at any time and review or edit their contents.

[0093] Step 7:

[0094] In the next task, the server will initiate an automated execution process based on the saved work procedures. The input is the work manual, and the output is the execution of the automated work procedures. The server will reproduce the previously saved procedures and automatically perform the specific operations within the software.

[0095] (Application Example 1)

[0096] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0097] In logistics centers, it is crucial for new workers to quickly and efficiently learn the shipping process. However, current manual training methods are insufficient to address the complexity and diverse procedures involved, making efficient training difficult. There is a need to improve this situation, standardize work procedures, and support the smooth execution of operations.

[0098] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0099] In this invention, the server includes an acquisition means for collecting work content in audio and video format, a conversion means for converting the audio into text information, and a generation means for generating work instructions that describe the work procedures based on the video and text information. This enables the presentation of work procedures in real time using a visual device, facilitating efficient training of new employees and automated execution of tasks.

[0100] "Work content" refers to a series of operational activities, including the shipping process at the logistics center.

[0101] "Means of acquisition that collect information via audio and video" refers to equipment used to record workers' operating procedures and voice explanations as audio and video data.

[0102] "The conversion means for converting the aforementioned audio into text information" refers to a technology that analyzes the collected audio data and expresses it as text information.

[0103] "Generation means for generating work instructions that describe work procedures based on the aforementioned video and text information" refers to a function that integrates the acquired data and creates a document that clearly describes the work procedures.

[0104] "Display means for presenting work procedures using visual devices" refers to a function that shows procedures to workers in real time through visual devices such as smart glasses.

[0105] "Recording means" refers to equipment or technology for storing generated work instructions and keeping them accessible as needed.

[0106] "Control means" refers to mechanisms or technologies for automatically executing tasks based on work instructions.

[0107] A "shared data environment" is a system for storing and synchronizing data across multiple devices and users via a network such as the cloud.

[0108] The system for implementing this invention mainly consists of three components: a server, a terminal (including a visual device), and a user. Specifically, it is implemented in the following form.

[0109] When users perform shipping processes at a logistics center, they use visual devices such as smart glasses to collect operating procedures and audio instructions. The visual devices acquire work content in real time as audio and video. As a result, the devices, as acquisition means, can record work procedures in detail.

[0110] Audio data obtained from visual devices is converted into text information by a server. Specifically, speech recognition technology is used to convert the audio data into text data. This conversion uses speech analysis software (e.g., Google Speech-to-Text API).

[0111] Next, the server generates a work instruction sheet based on the acquired video data and converted text information. The work instruction sheet documents the procedure and presents it to the user step by step. Image processing technology (e.g., OpenCV) and data integration technology (e.g., Python data processing libraries) are used in this generation process.

[0112] The generated work instructions are stored in a shared data environment as a means of recording data. This environment utilizes a cloud platform for data storage and synchronization. Users can access these instructions as needed to review and execute their work.

[0113] As a concrete example, consider the shipping process for a new product at a logistics center. To learn this process, workers wear smart glasses, and the system captures and transmits visual and audio data to a server. The work instructions created from this data are then presented to the new worker through the visual device the next time they perform a similar task, facilitating rapid learning.

[0114] The following are examples of prompt messages:

[0115] "Please convert the audio explanation of the logistics center's shipping process into text, analyze the video footage of the operations, and output the results in the format of an operations manual."

[0116] This system, through its function of presenting work procedures using visual devices, can promote the standardization and efficiency of work processes and efficiently support new employee training in logistics centers.

[0117] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0118] Step 1:

[0119] The user wears a visual device and begins the shipping process at the logistics center. The user's visual device simultaneously collects audio and video during the work. The input consists of audio data emitted by the user and real-time video data, which are temporarily recorded in the visual device's storage.

[0120] Step 2:

[0121] The terminal sends the collected audio data to the server, which uses speech recognition technology to convert the audio into text. The input is audio data, and the server uses speech recognition software to convert the audio data into text strings; this text data is the output. In this way, the procedures explained in audio are recorded as text.

[0122] Step 3:

[0123] The server analyzes video data sent from the terminal and extracts key frames important for the task. The input is video data; it utilizes video frame analysis technology to detect important scenes, and the output is image data extracted as keyframes. Specifically, it uses computing resources to detect changes and patterns within the image.

[0124] Step 4:

[0125] The server integrates the converted text information and extracted keyframes to generate a work instruction document. The input consists of text data and keyframe image data, and the output is the generated work instruction document. The integration method utilizes natural language processing to format the procedure step by step.

[0126] Step 5:

[0127] The server saves the generated work orders to a shared data environment in the cloud. The input is the work order, and the output is the work order stored in cloud storage. The specific operation involves uploading data to the cloud via the internet and ensuring security.

[0128] Step 6:

[0129] The next time the user performs a similar task, they will put on the visual device, and the server will display the saved work instructions. The input is the work instructions retrieved from the cloud, and the output is the procedure displayed on the visual device. Specifically, the visual device's display sequentially presents the instructions for each step.

[0130] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0131] To implement the system of the present invention, a collaborative process involving the user, terminal, server, and emotion engine is employed. Through this process, it becomes possible not only to automate tasks but also to optimize work procedures based on the user's emotional state.

[0132] First, the user provides the details of their work to the system using a terminal. The user explains the work procedure verbally while performing operations on the screen. The terminal collects this audio and screen recording in real time. The recorded data is sent to the server.

[0133] The server analyzes the received data and converts the voice data of the business into text data using speech recognition technology. Meanwhile, it uses image analysis algorithms to extract important operational frames related to the business procedure from the screen data.

[0134] The emotion engine recognizes emotional data from the user's tone of voice and facial expressions. This emotional data reflects the user's psychological state during the work explanation and is used to improve the efficiency of work performance. For example, if the user is showing signs of stress, the emotion engine can add supplementary explanations to make the work procedure easier to understand.

[0135] The server integrates this data to generate a work manual. This manual includes text data, image data, and supplementary information based on the user's emotional state, supporting specific and efficient work execution. The generated manual is stored on a cloud-based storage system, which the user can access.

[0136] Subsequently, when the user performs the same task again, the server automatically executes the work procedure according to the stored manual. Based on past emotional data obtained from the emotion engine, the procedure is fine-tuned to ensure that the task is performed efficiently and stress-free.

[0137] As a concrete example, consider a scenario where a user teaches a system the customer support inquiry handling process. The user operates the customer management system while verbally explaining the appropriate response to the inquiry. The terminal records this and sends it to the server, which converts the audio to text and the screen to images, while an emotion engine monitors the user's stress level. The server integrates these elements and, if the stress level is high, generates a simplified manual. This allows another user to apply the system and handle inquiries stress-free the next time.

[0138] The following describes the processing flow.

[0139] Step 1:

[0140] The user uses a terminal and explains the procedure verbally while demonstrating the task. The user operates the relevant software via the screen, indicating the necessary actions.

[0141] Step 2:

[0142] The device captures the user's voice while they are working and records their screen activity. It also records the user's facial expressions using a built-in or connected camera.

[0143] Step 3:

[0144] The device sends recorded audio, screen data, and facial expression data to the server. These are compressed in real time and transferred using a secure communication method.

[0145] Step 4:

[0146] The server converts the audio data into text data using a speech recognition algorithm. This process utilizes a customized dictionary to handle specialized terminology specific to the business.

[0147] Step 5:

[0148] The server analyzes screen data, identifies keyframes and interactions of user actions, and extracts image data to clarify business procedures.

[0149] Step 6:

[0150] The emotion engine analyzes the user's emotional state from their voice tone and facial expressions. Based on this analysis, it determines at which step the user is experiencing stress.

[0151] Step 7:

[0152] The server generates a work manual based on text data, image data, and sentiment analysis results. The generated manual inserts supplementary information and warnings specifically tailored to the stress points indicated by the user in the sentiment analysis.

[0153] Step 8:

[0154] The server saves the completed business manual to cloud storage, making it accessible to users. The manual includes comments and suggestions for improvement based on user feedback.

[0155] Step 9:

[0156] The next time a user or another user performs a similar task, the server will refer to the stored manual and prepare to automatically execute the task. Sentimental data will be used to optimize procedures for smoother task execution.

[0157] Step 10:

[0158] The server saves sentiment analysis results and execution logs after task completion, which are then used to further improve operational efficiency and enhance the user experience. Through this process, user feedback can be incorporated into the design of business processes.

[0159] (Example 2)

[0160] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0161] In modern business, efficient and stress-free work execution is essential. However, when procedures are complex and not optimized to accommodate the emotional states of individual users, work efficiency decreases and stress increases. Solving this problem makes it possible to automate tasks and improve the user experience.

[0162] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0163] In this invention, the server includes input means for inputting work content in voice and image format, recognition means for extracting emotional information from the user's voice tone and facial expression data, and optimization means for creating a work manual that includes supplementary information to optimize work execution considering the emotional information. As a result, work procedures are optimized based on the user's emotional state, enabling efficient and stress-free work execution.

[0164] "Input means" refers to a device or technology for receiving work details from a user in audio and image format and inputting them into the system.

[0165] "Conversion means" refers to a device or technology for analyzing acquired audio data and converting it into text data.

[0166] "Generation means" refers to a device or technology for automatically generating a business manual that describes business procedures based on data obtained from audio and images.

[0167] "Recognition means" refers to a device or technology for analyzing a user's voice tone and facial expression data and extracting emotional information from them.

[0168] "Optimization means" refers to a device or technology for creating a work manual that takes extracted emotional information into account and includes supplementary information to optimize work performance.

[0169] "Memory means" refers to a device or technology for storing generated operational manuals and making them accessible as needed.

[0170] "Execution means" refers to a device or technology that automatically performs tasks according to stored operational manuals and further fine-tunes procedures based on emotional information.

[0171] In implementing the system of the present invention, close cooperation is necessary between the user, terminal, server, and emotion engine. Through this cooperation, automation of tasks and improvement of the user experience can be achieved.

[0172] First, the user explains their work verbally and uses the terminal's screen as needed. For example, they might explain the customer support inquiry procedure verbally while operating a customer management system. During this process, the terminal collects the user's voice via a microphone and captures the actions on the screen. Recording software is used to collect the voice data, and capture software is used to collect the screen data.

[0173] The terminal transmits collected audio and image data to the server in real time. On the server, an audio recognition engine is used to convert the audio data into text. Specifically, audio analysis software is used for audio recognition. Meanwhile, the image data is analyzed by an image analysis tool to extract important operational details related to business procedures.

[0174] Simultaneously, an emotion engine embedded in the server analyzes the user's voice tone and facial expressions to recognize emotional information. This emotional information is used to optimize work procedures, and supplementary information is added according to the user's stress level.

[0175] The generated work manuals are stored in cloud storage. For example, a cloud storage service is used to securely store and retrieve the data. Users can access the stored manuals at any time, enabling automated and efficient procedure execution based on the manuals when repeating tasks.

[0176] A concrete example of a prompt message could be: "Please provide a voice-based explanation of the customer support inquiry handling procedure, perform operations on the customer management system using the terminal, and generate a manual based on the stress level." By following this prompt message and giving instructions to the system, an automated process is achieved.

[0177] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0178] Step 1:

[0179] The user explains the work content verbally and uses the terminal's operation screen. Specifically, the user explains the work procedure verbally and displays or inputs relevant information using the terminal's screen. The input consists of voice data and screen operation data, which are used in the next processing step.

[0180] Step 2:

[0181] The device records audio data from the user and captures screen operations. Specifically, it records audio data through the microphone and records the screen using screen capture software. The audio data and screen operation data are sent to the server as output from the device.

[0182] Step 3:

[0183] The server converts the received audio data into text data using speech recognition technology. For example, it uses dedicated speech recognition software to process the audio waveform into text information. The converted text data is output and its contents are used to analyze business procedures.

[0184] Step 4:

[0185] The server analyzes the received screen operation data using an image analysis tool. Specifically, it extracts important frames from the captured screen and identifies the content of each operation. The image analysis algorithm processes the screen data, and the results of the operation content extraction are output.

[0186] Step 5:

[0187] The emotion engine on the server recognizes user emotional information from audio and video data. Based on voice tone and facial expression patterns, calculations are performed to extract emotional states such as stress levels. The recognized emotional information is then output and used as a reference for optimizing business operations.

[0188] Step 6:

[0189] The server integrates text data, extracted operational data, and sentiment information to create optimized operational manuals using a generative AI model. Using each input data, specific procedure manuals are generated to streamline work processes. These operational manuals become the output and are stored in the cloud.

[0190] Step 7:

[0191] Users can access saved operational manuals and check work procedures as needed. They perform actual tasks based on the information in the manuals, and the system provides efficient support. The operational manuals, as output information, assist the user's operations.

[0192] (Application Example 2)

[0193] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0194] In today's world, there is a demand for increased efficiency and reduced emotional burden in people's work. However, conventional systems have not adequately optimized the automation of work procedures, and in particular, they struggle to respond flexibly while considering the emotional state of users. Therefore, there is a need to achieve efficient work automation while reducing the stress and anxiety that users experience during work.

[0195] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0196] In this invention, the server includes an acquisition means for inputting work content in voice and image format, an emotion analysis means for analyzing emotional states, and a generation means for generating optimized work instruction manuals. This enables efficient and flexible automation of tasks in response to the user's emotions.

[0197] "Means of acquisition" refers to a system that has the function of effectively inputting work details in the form of voice and images.

[0198] "Conversion means" refers to the technology that converts input audio data into corresponding text data.

[0199] The "generation means" refers to a device that has the function of automatically creating a work instruction manual containing work procedures from text and image data.

[0200] "Emotional analysis means" refers to technology that determines an individual's emotional state from the user's voice and other data.

[0201] The "optimization method" refers to optimizing work instructions based on emotional data to match the emotional state of the user, thereby creating an efficient procedure.

[0202] "Memory means" refers to technologies for securely storing generated work instruction manuals and optimized data.

[0203] "Execution means" refers to a function that automatically carries out tasks based on the work instruction manual.

[0204] A "virtual environment" refers to a virtual space or infrastructure within a computer system used for performing business operations.

[0205] The system implementing this invention operates primarily through the cooperation of a server, a terminal, and a user. When a user begins work, the terminal acquires audio and image data in real time. This acquisition means has the function of converting the audio explanation of the work into text and the function of capturing the screen. The generated data is sent to the server.

[0206] The server uses speech recognition software to convert audio data into text data. The common speech recognition library "speech_recognition" is applied for this purpose. Furthermore, image analysis software is used to extract information necessary for business procedures from the acquired image data. The image processing library "OpenCV" is often used for this purpose.

[0207] In addition, the server performs sentiment analysis on the acquired data, using the "emotion_recognition" library to evaluate the user's emotional state. Based on this sentiment, the generation mechanism optimizes the work instructions, enabling users to perform their tasks more efficiently and with less stress. The optimized work instructions are stored in the memory mechanism and can be accessed at any time as needed.

[0208] One application of this system is a smart home robot that receives user instructions and performs cleaning and household chores. It can analyze the user's stress level and provide additional suggestions or support as needed. An example of a prompt message would be: "If the user instructs, 'Make dinner,' analyze the voice data to determine if the user is stressed and, if necessary, offer comforting suggestions to the user."

[0209] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0210] Step 1:

[0211] The terminal receives voice input for work-related tasks from the user and simultaneously captures screen operations. Both voice and image data are acquired as input. The terminal saves the acquired voice as digital data and records the screen capture in still image or video format.

[0212] Step 2:

[0213] The terminal sends audio data to the server. The server converts the received audio data into text data using the "speech_recognition" library. The output of this step is text data represented as character information.

[0214] Step 3:

[0215] The terminal sends image data to the server. The server uses image processing libraries such as "OpenCV" to analyze and extract important business operation frames. Specifically, it extracts image features related to a particular operation procedure and identifies them.

[0216] Step 4:

[0217] The server processes data obtained from audio and images using the "emotion_recognition" library to perform emotion analysis. It uses text data and metadata from images as input. Based on this analysis, it identifies the user's emotional state and calculates stress levels and satisfaction indicators.

[0218] Step 5:

[0219] The server integrates text data of acquired work content, image processing results, and emotional data to generate work instruction manuals. The generated manuals are optimized to adapt to the user's emotional state. These manuals provide guidance for efficiently executing specific work procedures.

[0220] Step 6:

[0221] The server stores optimized work instructions in a storage device, allowing users to access them later. By storing them in a cloud service, users can view the work instructions from any device.

[0222] Step 7:

[0223] When the user performs the same task again, the server provides data to execute the task according to the stored instructions. During this time, it makes fine adjustments based on sentiment analysis data to support the task execution.

[0224] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0225] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0226] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0227] [Second Embodiment]

[0228] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0229] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0230] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0231] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0232] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0233] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0234] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0235] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0236] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0237] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0238] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0239] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0240] An embodiment of the system of the present invention employs a process that includes the following procedures. In this process, the user, terminal, and server each play a specific role.

[0241] The user provides details of their work to the system via a terminal and verbally explains the series of operations involved. During this process, the terminal captures the user's screen in real time and simultaneously records the user's voice. This data is transmitted to the server as it progresses.

[0242] The server converts the audio data transmitted from the terminal into text data using speech recognition technology. This conversion process organizes the content explained in the audio into text. The server also analyzes the image data acquired from the terminal and extracts keyframes and operation details necessary for performing the task.

[0243] The server, acting as the generation mechanism, integrates the obtained text and image data to generate a detailed operational manual. This manual includes step-by-step instructions to ensure the user can accurately reproduce the initial operation. The generated operational manual is securely stored on a cloud-based storage device, allowing the user to access and modify it as needed.

[0244] Subsequently, the next time a user wants to perform the same task, the server will automatically execute the procedure based on the saved task manual. As a means of execution, the server replicates the necessary steps on the terminal, automating the task. User intervention is minimized, resulting in increased efficiency.

[0245] For example, if a user is teaching an AI agent how to issue invoices, the user operates the invoice issuance software and provides explanations for each operation. The terminal collects this information and sends it to the server. The server converts the received audio into text and the screen operations into images, and integrates them to generate an invoice issuance manual. As a result, the server can then automatically perform the invoice issuance procedure according to this manual the next time around.

[0246] This approach integrates the entire process, from creating operational manuals to automating their execution, into a single workflow, providing users with an effective means to improve operational efficiency.

[0247] The following describes the processing flow.

[0248] Step 1:

[0249] The user uses a terminal and explains the task verbally while demonstrating it. The user operates the software on the screen while giving verbal instructions to demonstrate specific operations.

[0250] Step 2:

[0251] The device activates its screen capture function and records the user's screen activity. It also records the user's voice explanation via the microphone. This data is temporarily stored for later transmission to a server.

[0252] Step 3:

[0253] The device compresses the recorded screen and audio data and sends it to the server using a secure communication protocol. The data integrity is verified to ensure no data loss occurs.

[0254] Step 4:

[0255] The server converts the audio data received from the terminal into text data using speech recognition technology. In this process, appropriate dictionary data is used to handle specialized terminology and industry jargon that needs to be recognized.

[0256] Step 5:

[0257] The server identifies important keyframes from the screen data and extracts images related to user actions. These images will later become part of the operational manual that will be generated.

[0258] Step 6:

[0259] The server combines the converted text data with extracted images to automatically generate a detailed manual outlining the work procedures. This manual organizes the work procedures step by step and presents them in a user-friendly format.

[0260] Step 7:

[0261] The server generates and stores the work manuals in cloud storage. Users can access these manuals at any time to review or modify their work procedures.

[0262] Step 8:

[0263] The next time the user requests that the same task be performed by AI, the server will automatically execute the steps based on the stored task manual. The task will proceed autonomously, following instructions from the server to the terminal.

[0264] Step 9:

[0265] The server saves execution logs of tasks, making them accessible to users for future improvements and verification. This allows for more efficient task accuracy and troubleshooting.

[0266] (Example 1)

[0267] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0268] In traditional business processes, the creation of documentation and the automation of procedures were often done manually, which was time-consuming and labor-intensive. Furthermore, manual document creation was prone to human error, compromising efficiency and accuracy. In addition, there was a lack of comprehensive systems to streamline and automate these business processes.

[0269] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0270] In this invention, the server includes information gathering means for acquiring audio and visual information for users to describe work content, information conversion means for converting the audio information into text information, and manual generation means for constructing work procedures using the visual and text information. This enables the automation and efficient execution of work procedures.

[0271] "User" refers to an individual or organization that uses the system to input business details and enjoys the streamlined process.

[0272] "Job description" refers to information including procedures for tasks and operations performed with a specific purpose.

[0273] "Audio information" refers to sound data that records work-related instructions and content explained verbally by the user.

[0274] "Visual information" refers to digital image data, including user interfaces and visual elements in business processes.

[0275] "Information gathering means" refers to devices or functions used to collect audio and visual information obtained from users.

[0276] "Information conversion means" refers to technology or equipment for converting acquired audio information into text information.

[0277] "Textual information" refers to data in which audio information is represented as text.

[0278] "Manual generation means" refers to a device or function that documents business procedures based on collected visual and textual information.

[0279] "Business procedures" refer to information that describes the sequence of operations and actions necessary to perform a task.

[0280] The "data storage area" refers to a storage system that stores data such as business procedures and manuals and enables access as needed.

[0281] The "business execution means" refers to a system that has the function of automatically reproducing or executing a process according to the stored business procedures.

[0282] The "network environment" refers to a technical infrastructure that enables the transmission and reception of digital data through the Internet or other communication means.

[0283] To implement this invention, a system is constructed in which a user, a terminal, and a server cooperate to achieve business automation. The user verbally explains the business content and performs the procedure on the terminal. The terminal captures the user's operation screen in real time and records the voice. A computer equipped with a voice input device and a display capture function is used for this terminal.

[0284] The server converts the voice data transmitted from the terminal into text by voice recognition technology. General-purpose voice recognition technology (for example, Google Cloud Speech-to-Text) can be used for this voice recognition. Also, the server uses image recognition technology to analyze the captured visual information and extract the key points of the business. Here, image processing libraries such as OpenCV can be used for image analysis.

[0285] The server integrates the converted character information and the analyzed visual information and generates a manual describing the business procedure. This manual is stored in a cloud storage system (for example, Amazon S3) and can be accessed at any time. Also, this manual can be modified by the user as needed.

[0286] To streamline actual operations, the server can automatically execute the process based on the stored operation procedures. This minimizes user intervention and improves the reproducibility and efficiency of operations. As a specific example, when a user issues an invoice, the server can automatically reproduce the procedure after being taught the input steps for the necessary information once. This significantly reduces the burden of repetitive daily operations.

[0287] As an example of the prompt text, a format such as "Please explain the operation procedure for issuing an invoice. Include details of necessary field inputs and button clicks in this explanation." can be used. The operation manual generated using such prompts contains specific and reproducible information, ensuring consistent operations and leading to improved operational efficiency.

[0288] The flow of the specific process in Example 1 will be described using Figure 11.

[0289] Step 1:

[0290] The user verbally explains the operation details to the terminal and operates the software. The input includes the user's voice and operation details. As a specific action, the user launches the invoice issuance software and explains the operation while entering the necessary items. At this time, the terminal uses a voice input device and a screen capture function.

[0291] Step 2:

[0292] The terminal captures the user's operation screen and voice in real-time and transmits them to the server as voice data and visual data. The input is the user's voice and operation screen, and the output is the voice data and visual data transmitted to the server. Specifically, the terminal saves the invoice issuance screen currently being displayed as a screenshot and records the spoken voice.

[0293] Step 3:

[0294] The server converts the transmitted audio data into text data using speech recognition technology. The input for this step is audio data, and the output is text data. Here, speech recognition software is used to organize the user's explanation into textual information.

[0295] Step 4:

[0296] The server analyzes visual data using image analysis technology and extracts keyframes to identify business procedures. The input is visual data, and the output is data indicating important operational points. Specifically, button clicks and data entry steps are identified from the image.

[0297] Step 5:

[0298] The server integrates the converted text data and parsed visual data to generate a manual outlining the business procedures. The input consists of text and visual data, while the output is a detailed business manual. The server combines these to clearly demonstrate the step-by-step process of invoice issuance.

[0299] Step 6:

[0300] The server saves the generated work manuals to cloud storage. The input is the work manual, and the output is the saved data accessible online. This allows users to access the manuals at any time and review or edit their contents.

[0301] Step 7:

[0302] In the next task, the server will initiate an automated execution process based on the saved work procedures. The input is the work manual, and the output is the execution of the automated work procedures. The server will reproduce the previously saved procedures and automatically perform the specific operations within the software.

[0303] (Application Example 1)

[0304] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0305] In a logistics center, it is important for new workers to quickly and efficiently acquire the shipping process. However, with the current manual guidance method, it is difficult to cope with the complexity of the operations and various procedures, and efficient training is difficult. There is a need to improve such a situation, standardize the operation procedures, and support the smooth execution of the actual work.

[0306] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0307] In this invention, the server includes an acquisition means for collecting work content as audio and video, a conversion means for converting the audio into character information, and a generation means for generating a work instruction manual describing the work procedure based on the video and the character information. Thereby, it is possible to present the work procedure in real time using a visual device, enabling efficient new worker training and automatic execution of the work.

[0308] The "work content" is a series of business activities including the shipping process in a logistics center.

[0309] The "acquisition means for collecting as audio and video" is a device for recording the operation procedures and audio explanations of workers as audio-video data.

[0310] The "conversion means for converting the audio into character information" is a technology for analyzing the collected audio data and expressing it as character information.

[0311] "Display means for presenting work procedures using visual devices" refers to a function that shows procedures to workers in real time through visual devices such as smart glasses.

[0313] "Recording means" refers to equipment or technology for storing generated work instructions and keeping them accessible as needed.

[0314] "Control means" refers to mechanisms or technologies for automatically executing tasks based on work instructions.

[0315] A "shared data environment" is a system for storing and synchronizing data across multiple devices and users via a network such as the cloud.

[0316] The system for implementing this invention mainly consists of three components: a server, a terminal (including a visual device), and a user. Specifically, it is implemented in the following form.

[0317] When users perform shipping processes at a logistics center, they use visual devices such as smart glasses to collect operating procedures and audio instructions. The visual devices acquire work content in real time as audio and video. As a result, the devices, as acquisition means, can record work procedures in detail.

[0318] Audio data obtained from visual devices is converted into text information by a server. Specifically, speech recognition technology is used to convert the audio data into text data. This conversion uses speech analysis software (e.g., Google Speech-to-Text API).

[0319] Next, the server generates a work instruction sheet based on the acquired video data and converted text information. The work instruction sheet documents the procedure and presents it to the user step by step. Image processing technology (e.g., OpenCV) and data integration technology (e.g., Python data processing libraries) are used in this generation process.

[0320] The generated work instructions are stored in a shared data environment as a means of recording data. This environment utilizes a cloud platform for data storage and synchronization. Users can access these instructions as needed to review and execute their work.

[0321] As a concrete example, consider the shipping process for a new product at a logistics center. To learn this process, workers wear smart glasses, and the system captures and transmits visual and audio data to a server. The work instructions created from this data are then presented to the new worker through the visual device the next time they perform a similar task, facilitating rapid learning.

[0322] The following are examples of prompt messages:

[0323] "Please convert the audio explanation of the logistics center's shipping process into text, analyze the video footage of the operations, and output the results in the format of an operations manual."

[0324] This system, through its function of presenting work procedures using visual devices, can promote the standardization and efficiency of work processes and efficiently support new employee training in logistics centers.

[0325] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0326] Step 1:

[0327] The user wears a visual device and begins the shipping process at the logistics center. The user's visual device simultaneously collects audio and video during the work. The input consists of audio data emitted by the user and real-time video data, which are temporarily recorded in the visual device's storage.

[0328] Step 2:

[0329] The terminal sends the collected audio data to the server, which uses speech recognition technology to convert the audio into text. The input is audio data, and the server uses speech recognition software to convert the audio data into text strings; this text data is the output. In this way, the procedures explained in audio are recorded as text.

[0330] Step 3:

[0331] The server analyzes video data sent from the terminal and extracts key frames important for the task. The input is video data; it utilizes video frame analysis technology to detect important scenes, and the output is image data extracted as keyframes. Specifically, it uses computing resources to detect changes and patterns within the image.

[0332] Step 4:

[0333] The server integrates the converted text information and extracted keyframes to generate a work instruction document. The input consists of text data and keyframe image data, and the output is the generated work instruction document. The integration method utilizes natural language processing to format the procedure step by step.

[0334] Step 5:

[0335] The server saves the generated work orders to a shared data environment in the cloud. The input is the work order, and the output is the work order stored in cloud storage. The specific operation involves uploading data to the cloud via the internet and ensuring security.

[0336] Step 6:

[0337] The next time the user performs a similar task, they will put on the visual device, and the server will display the saved work instructions. The input is the work instructions retrieved from the cloud, and the output is the procedure displayed on the visual device. Specifically, the visual device's display sequentially presents the instructions for each step.

[0338] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0339] To implement the system of the present invention, a collaborative process involving the user, terminal, server, and emotion engine is employed. Through this process, it becomes possible not only to automate tasks but also to optimize work procedures based on the user's emotional state.

[0340] First, the user provides the details of their work to the system using a terminal. The user explains the work procedure verbally while performing operations on the screen. The terminal collects this audio and screen recording in real time. The recorded data is sent to the server.

[0341] The server analyzes the received data and converts the voice data of the business into text data using speech recognition technology. Meanwhile, it uses image analysis algorithms to extract important operational frames related to the business procedure from the screen data.

[0342] The emotion engine recognizes emotional data from the user's tone of voice and facial expressions. This emotional data reflects the user's psychological state during the work explanation and is used to improve the efficiency of work performance. For example, if the user is showing signs of stress, the emotion engine can add supplementary explanations to make the work procedure easier to understand.

[0343] The server integrates this data to generate a work manual. This manual includes text data, image data, and supplementary information based on the user's emotional state, supporting specific and efficient work execution. The generated manual is stored on a cloud-based storage system, which the user can access.

[0344] Subsequently, when the user performs the same task again, the server automatically executes the work procedure according to the stored manual. Based on past emotional data obtained from the emotion engine, the procedure is fine-tuned to ensure that the task is performed efficiently and stress-free.

[0345] As a concrete example, consider a scenario where a user teaches a system the customer support inquiry handling process. The user operates the customer management system while verbally explaining the appropriate response to the inquiry. The terminal records this and sends it to the server, which converts the audio to text and the screen to images, while an emotion engine monitors the user's stress level. The server integrates these elements and, if the stress level is high, generates a simplified manual. This allows another user to apply the system and handle inquiries stress-free the next time.

[0346] The following describes the processing flow.

[0347] Step 1:

[0348] The user uses a terminal and explains the procedure verbally while demonstrating the task. The user operates the relevant software via the screen, indicating the necessary actions.

[0349] Step 2:

[0350] The device captures the user's voice while they are working and records their screen activity. It also records the user's facial expressions using a built-in or connected camera.

[0351] Step 3:

[0352] The device sends recorded audio, screen data, and facial expression data to the server. These are compressed in real time and transferred using a secure communication method.

[0353] Step 4:

[0354] The server converts the audio data into text data using a speech recognition algorithm. This process utilizes a customized dictionary to handle specialized terminology specific to the business.

[0355] Step 5:

[0356] The server analyzes screen data, identifies keyframes and interactions of user actions, and extracts image data to clarify business procedures.

[0357] Step 6:

[0358] The emotion engine analyzes the user's emotional state from their voice tone and facial expressions. Based on this analysis, it determines at which step the user is experiencing stress.

[0359] Step 7:

[0360] The server generates a work manual based on text data, image data, and sentiment analysis results. The generated manual inserts supplementary information and warnings specifically tailored to the stress points indicated by the user in the sentiment analysis.

[0361] Step 8:

[0362] The server saves the completed business manual to cloud storage, making it accessible to users. The manual includes comments and suggestions for improvement based on user feedback.

[0363] Step 9:

[0364] The next time a user or another user performs a similar task, the server will refer to the stored manual and prepare to automatically execute the task. Sentimental data will be used to optimize procedures for smoother task execution.

[0365] Step 10:

[0366] The server saves sentiment analysis results and execution logs after task completion, which are then used to further improve operational efficiency and enhance the user experience. Through this process, user feedback can be incorporated into the design of business processes.

[0367] (Example 2)

[0368] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0369] In modern business, efficient and stress-free work execution is essential. However, when procedures are complex and not optimized to accommodate the emotional states of individual users, work efficiency decreases and stress increases. Solving this problem makes it possible to automate tasks and improve the user experience.

[0370] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0371] In this invention, the server includes input means for inputting work content in voice and image format, recognition means for extracting emotional information from the user's voice tone and facial expression data, and optimization means for creating a work manual that includes supplementary information to optimize work execution considering the emotional information. As a result, work procedures are optimized based on the user's emotional state, enabling efficient and stress-free work execution.

[0372] "Input means" refers to a device or technology for receiving work details from a user in audio and image format and inputting them into the system.

[0373] "Conversion means" refers to a device or technology for analyzing acquired audio data and converting it into text data.

[0374] "Generation means" refers to a device or technology for automatically generating a business manual that describes business procedures based on data obtained from audio and images.

[0375] "Recognition means" refers to a device or technology for analyzing a user's voice tone and facial expression data and extracting emotional information from them.

[0376] "Optimization means" refers to a device or technology for creating a work manual that takes extracted emotional information into account and includes supplementary information to optimize work performance.

[0377] "Memory means" refers to a device or technology for storing generated operational manuals and making them accessible as needed.

[0378] "Execution means" refers to a device or technology that automatically performs tasks according to stored operational manuals and further fine-tunes procedures based on emotional information.

[0379] In implementing the system of the present invention, close cooperation is necessary between the user, terminal, server, and emotion engine. Through this cooperation, automation of tasks and improvement of the user experience can be achieved.

[0380] First, the user explains their work verbally and uses the terminal's screen as needed. For example, they might explain the customer support inquiry procedure verbally while operating a customer management system. During this process, the terminal collects the user's voice via a microphone and captures the actions on the screen. Recording software is used to collect the voice data, and capture software is used to collect the screen data.

[0381] The terminal transmits collected audio and image data to the server in real time. On the server, an audio recognition engine is used to convert the audio data into text. Specifically, audio analysis software is used for audio recognition. Meanwhile, the image data is analyzed by an image analysis tool to extract important operational details related to business procedures.

[0382] Simultaneously, an emotion engine embedded in the server analyzes the user's voice tone and facial expressions to recognize emotional information. This emotional information is used to optimize work procedures, and supplementary information is added according to the user's stress level.

[0383] The generated work manuals are stored in cloud storage. For example, a cloud storage service is used to securely store and retrieve the data. Users can access the stored manuals at any time, enabling automated and efficient procedure execution based on the manuals when repeating tasks.

[0384] A concrete example of a prompt message could be: "Please provide a voice-based explanation of the customer support inquiry handling procedure, perform operations on the customer management system using the terminal, and generate a manual based on the stress level." By following this prompt message and giving instructions to the system, an automated process is achieved.

[0385] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0386] Step 1:

[0387] The user explains the work content verbally and uses the terminal's operation screen. Specifically, the user explains the work procedure verbally and displays or inputs relevant information using the terminal's screen. The input consists of voice data and screen operation data, which are used in the next processing step.

[0388] Step 2:

[0389] The device records audio data from the user and captures screen operations. Specifically, it records audio data through the microphone and records the screen using screen capture software. The audio data and screen operation data are sent to the server as output from the device.

[0390] Step 3:

[0391] The server converts the received audio data into text data using speech recognition technology. For example, it uses dedicated speech recognition software to process the audio waveform into text information. The converted text data is output and its contents are used to analyze business procedures.

[0392] Step 4:

[0393] The server analyzes the received screen operation data using an image analysis tool. Specifically, it extracts important frames from the captured screen and identifies the content of each operation. The image analysis algorithm processes the screen data, and the results of the operation content extraction are output.

[0394] Step 5:

[0395] The emotion engine on the server recognizes user emotional information from audio and video data. Based on voice tone and facial expression patterns, calculations are performed to extract emotional states such as stress levels. The recognized emotional information is then output and used as a reference for optimizing business operations.

[0396] Step 6:

[0397] The server integrates text data, extracted operational data, and sentiment information to create optimized operational manuals using a generative AI model. Using each input data, specific procedure manuals are generated to streamline work processes. These operational manuals become the output and are stored in the cloud.

[0398] Step 7:

[0399] Users can access saved operational manuals and check work procedures as needed. They perform actual tasks based on the information in the manuals, and the system provides efficient support. The operational manuals, as output information, assist users in their operations.

[0400] (Application Example 2)

[0401] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0402] In today's world, there is a demand for increased efficiency and reduced emotional burden in people's work. However, conventional systems have not adequately optimized the automation of work procedures, and in particular, they struggle to respond flexibly while considering the emotional state of users. Therefore, there is a need to achieve efficient work automation while reducing the stress and anxiety that users experience during work.

[0403] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0404] In this invention, the server includes an acquisition means for inputting work content in voice and image format, an emotion analysis means for analyzing emotional states, and a generation means for generating optimized work instruction manuals. This enables efficient and flexible automation of tasks in response to the user's emotions.

[0405] "Means of acquisition" refers to a system that has the function of effectively inputting work details in the form of voice and images.

[0406] "Conversion means" refers to the technology that converts input audio data into corresponding text data.

[0407] The "generation means" refers to a device that has the function of automatically creating a work instruction manual containing work procedures from text and image data.

[0408] "Emotional analysis means" refers to technology that determines an individual's emotional state from the user's voice and other data.

[0409] The "optimization method" refers to optimizing work instructions based on emotional data to match the emotional state of the user, thereby creating an efficient procedure.

[0410] "Memory means" refers to technologies for securely storing generated work instruction manuals and optimized data.

[0411] "Execution means" refers to a function that automatically carries out tasks based on the work instruction manual.

[0412] A "virtual environment" refers to a virtual space or infrastructure within a computer system used for performing business operations.

[0413] The system implementing this invention operates primarily through the cooperation of a server, a terminal, and a user. When a user begins work, the terminal acquires audio and image data in real time. This acquisition means has the function of converting the audio explanation of the work into text and the function of capturing the screen. The generated data is sent to the server.

[0414] The server uses speech recognition software to convert audio data into text data. The common speech recognition library "speech_recognition" is applied for this purpose. Furthermore, image analysis software is used to extract information necessary for business procedures from the acquired image data. The image processing library "OpenCV" is often used for this purpose.

[0415] In addition, the server performs sentiment analysis on the acquired data, using the "emotion_recognition" library to evaluate the user's emotional state. Based on this sentiment, the generation mechanism optimizes the work instructions, enabling users to perform their tasks more efficiently and with less stress. The optimized work instructions are stored in the memory mechanism and can be accessed at any time as needed.

[0416] One application of this system is a smart home robot that receives user instructions and performs cleaning and household chores. It can analyze the user's stress level and provide additional suggestions or support as needed. An example of a prompt message would be: "If the user instructs, 'Make dinner,' analyze the voice data to determine if the user is stressed and, if necessary, offer comforting suggestions to the user."

[0417] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0418] Step 1:

[0419] The terminal receives voice input for work-related tasks from the user and simultaneously captures screen operations. Both voice and image data are acquired as input. The terminal saves the acquired voice as digital data and records the screen capture in still image or video format.

[0420] Step 2:

[0421] The terminal sends audio data to the server. The server converts the received audio data into text data using the "speech_recognition" library. The output of this step is text data represented as character information.

[0422] Step 3:

[0423] The terminal sends image data to the server. The server uses image processing libraries such as "OpenCV" to analyze and extract important business operation frames. Specifically, it extracts image features related to a particular operation procedure and identifies them.

[0424] Step 4:

[0425] The server processes data obtained from audio and images using the "emotion_recognition" library to perform emotion analysis. It uses text data and metadata from images as input. Based on this analysis, it identifies the user's emotional state and calculates stress levels and satisfaction indicators.

[0426] Step 5:

[0427] The server integrates text data of acquired work content, image processing results, and emotional data to generate work instruction manuals. The generated manuals are optimized to adapt to the user's emotional state. These manuals provide guidance for efficiently executing specific work procedures.

[0428] Step 6:

[0429] The server stores optimized work instructions in a storage device, allowing users to access them later. By storing them in a cloud service, users can view the work instructions from any device.

[0430] Step 7:

[0431] When the user performs the same task again, the server provides data to execute the task according to the stored instructions. During this time, it makes fine adjustments based on sentiment analysis data to support the task execution.

[0432] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0433] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0434] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0435] [Third Embodiment]

[0436] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0437] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0438] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0439] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0440] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0441] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0442] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0443] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0444] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0445] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0446] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0447] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0448] An embodiment of the system of the present invention employs a process that includes the following procedures. In this process, the user, terminal, and server each play a specific role.

[0449] The user provides details of their work to the system via a terminal and verbally explains the series of operations involved. During this process, the terminal captures the user's screen in real time and simultaneously records the user's voice. This data is transmitted to the server as it progresses.

[0450] The server converts the audio data transmitted from the terminal into text data using speech recognition technology. This conversion process organizes the content explained in the audio into text. The server also analyzes the image data acquired from the terminal and extracts keyframes and operation details necessary for performing the task.

[0451] The server, acting as the generation mechanism, integrates the obtained text and image data to generate a detailed operational manual. This manual includes step-by-step instructions to ensure the user can accurately reproduce the initial operation. The generated operational manual is securely stored on a cloud-based storage device, allowing the user to access and modify it as needed.

[0452] Subsequently, the next time a user wants to perform the same task, the server will automatically execute the procedure based on the saved task manual. As a means of execution, the server replicates the necessary steps on the terminal, automating the task. User intervention is minimized, resulting in increased efficiency.

[0453] For example, if a user is teaching an AI agent how to issue invoices, the user operates the invoice issuance software and provides explanations for each operation. The terminal collects this information and sends it to the server. The server converts the received audio into text and the screen operations into images, and integrates them to generate an invoice issuance manual. As a result, the server can then automatically perform the invoice issuance procedure according to this manual the next time around.

[0454] This approach integrates the entire process, from creating operational manuals to automating their execution, into a single workflow, providing users with an effective means to improve operational efficiency.

[0455] The following describes the processing flow.

[0456] Step 1:

[0457] The user uses a terminal and explains the task verbally while demonstrating it. The user operates the software on the screen while giving verbal instructions to demonstrate specific operations.

[0458] Step 2:

[0459] The device activates its screen capture function and records the user's screen activity. It also records the user's voice explanation via the microphone. This data is temporarily stored for later transmission to a server.

[0460] Step 3:

[0461] The device compresses the recorded screen and audio data and sends it to the server using a secure communication protocol. The data integrity is verified to ensure no data loss occurs.

[0462] Step 4:

[0463] The server converts the audio data received from the terminal into text data using speech recognition technology. In this process, appropriate dictionary data is used to handle specialized terminology and industry jargon that needs to be recognized.

[0464] Step 5:

[0465] The server identifies important keyframes from the screen data and extracts images related to user actions. These images will later become part of the operational manual that will be generated.

[0466] Step 6:

[0467] The server combines the converted text data with extracted images to automatically generate a detailed manual outlining the work procedures. This manual organizes the work procedures step by step and presents them in a user-friendly format.

[0468] Step 7:

[0469] The server generates and stores the work manuals in cloud storage. Users can access these manuals at any time to review or modify their work procedures.

[0470] Step 8:

[0471] The next time the user requests that the same task be performed by AI, the server will automatically execute the steps based on the stored task manual. The task will proceed autonomously, following instructions from the server to the terminal.

[0472] Step 9:

[0473] The server saves execution logs of tasks, making them accessible to users for future improvements and verification. This allows for more efficient task accuracy and troubleshooting.

[0474] (Example 1)

[0475] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0476] In traditional business processes, the creation of documentation and the automation of procedures were often done manually, which was time-consuming and labor-intensive. Furthermore, manual document creation was prone to human error, compromising efficiency and accuracy. In addition, there was a lack of comprehensive systems to streamline and automate these business processes.

[0477] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0478] In this invention, the server includes information gathering means for acquiring audio and visual information for users to describe work content, information conversion means for converting the audio information into text information, and manual generation means for constructing work procedures using the visual and text information. This enables the automation and efficient execution of work procedures.

[0479] "User" refers to an individual or organization that uses the system to input business details and enjoys the streamlined process.

[0480] "Job description" refers to information including procedures for tasks and operations performed with a specific purpose.

[0481] "Audio information" refers to sound data that records work-related instructions and content explained verbally by the user.

[0482] "Visual information" refers to digital image data, including user interfaces and visual elements in business processes.

[0483] "Information gathering means" refers to devices or functions used to collect audio and visual information obtained from users.

[0484] "Information conversion means" refers to technology or equipment for converting acquired audio information into text information.

[0485] "Textual information" refers to data in which audio information is represented as text.

[0486] "Manual generation means" refers to a device or function that documents business procedures based on collected visual and textual information.

[0487] "Business procedures" refer to information that describes the sequence of operations and actions necessary to perform a task.

[0488] A "data storage area" refers to a storage system that stores data such as business procedures and manuals, and makes them accessible as needed.

[0489] "Means of performing business operations" refers to a system that has the function of automatically reproducing or executing a process according to saved business procedures.

[0490] A "network environment" refers to a technological infrastructure that enables the transmission and reception of digital data via the internet and other communication methods.

[0491] To implement this invention, a system is constructed in which a user, a terminal, and a server work together to automate tasks. The user verbally explains the task and performs the procedure on the terminal. The terminal captures the user's screen in real time and records audio. This terminal is a computer equipped with an audio input device and a display capture function.

[0492] The server converts the audio data sent from the terminal into text using speech recognition technology. General-purpose speech recognition technologies (such as Google Cloud Speech-to-Text) can be used for this purpose. The server also uses image recognition technology to analyze captured visual information and extract key points related to the business. Image processing libraries such as OpenCV can be used for this image analysis.

[0493] The server integrates the converted text information with the analyzed visual information to generate a manual outlining the work procedures. This manual is stored in a cloud storage system (e.g., Amazon S3) and is accessible at any time. Furthermore, users can modify this manual as needed.

[0494] To streamline actual operations, the server can automatically execute processes based on saved work procedures. This minimizes user intervention and improves the reproducibility and efficiency of operations. For example, when a user issues an invoice, they only need to be taught the necessary information input procedure once, and the server can then automatically reproduce that procedure the next time. This significantly reduces the burden of repetitive daily tasks.

[0495] An example of a prompt message could be: "Please describe the procedure for issuing an invoice. This procedure should include details of the required field entries and button clicks." Using such prompts generates operational manuals that contain specific and reproducible information, thus ensuring consistency in operations and improving operational efficiency.

[0496] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0497] Step 1:

[0498] The user verbally explains the task to the terminal and operates the software. Input includes the user's voice and actions. Specifically, the user launches invoice issuance software and explains the operation while entering the necessary information. During this process, the terminal utilizes both a voice input device and a screen capture function.

[0499] Step 2:

[0500] The terminal captures the user's screen and audio in real time and sends them to the server as audio and visual data. The input is the user's audio and screen, and the output is the audio and visual data sent to the server. Specifically, the terminal saves the displayed invoice issuance screen as a screenshot and records the audio explanation.

[0501] Step 3:

[0502] The server converts the transmitted audio data into text data using speech recognition technology. The input for this step is audio data, and the output is text data. Here, speech recognition software is used to organize the user's explanation into textual information.

[0503] Step 4:

[0504] The server analyzes visual data using image analysis technology and extracts keyframes to identify business procedures. The input is visual data, and the output is data indicating important operational points. Specifically, button clicks and data entry steps are identified from the image.

[0505] Step 5:

[0506] The server integrates the converted text data and parsed visual data to generate a manual outlining the business procedures. The input consists of text and visual data, while the output is a detailed business manual. The server combines these to clearly demonstrate the step-by-step process of invoice issuance.

[0507] Step 6:

[0508] The server saves the generated work manuals to cloud storage. The input is the work manual, and the output is the saved data accessible online. This allows users to access the manuals at any time and review or edit their contents.

[0509] Step 7:

[0510] In the next task, the server will initiate an automated execution process based on the saved work procedures. The input is the work manual, and the output is the execution of the automated work procedures. The server will reproduce the previously saved procedures and automatically perform the specific operations within the software.

[0511] (Application Example 1)

[0512] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0513] In logistics centers, it is crucial for new workers to quickly and efficiently learn the shipping process. However, current manual training methods are insufficient to address the complexity and diverse procedures involved, making efficient training difficult. There is a need to improve this situation, standardize work procedures, and support the smooth execution of operations.

[0514] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0515] In this invention, the server includes an acquisition means for collecting work content in audio and video format, a conversion means for converting the audio into text information, and a generation means for generating work instructions that describe the work procedures based on the video and text information. This enables the presentation of work procedures in real time using a visual device, facilitating efficient training of new employees and automated execution of tasks.

[0516] "Work content" refers to a series of operational activities, including the shipping process at the logistics center.

[0517] "Means of acquisition that collect information via audio and video" refers to equipment used to record workers' operating procedures and voice explanations as audio and video data.

[0518] "The conversion means for converting the aforementioned audio into text information" refers to a technology that analyzes the collected audio data and expresses it as text information.

[0519] "Generation means for generating work instructions that describe work procedures based on the aforementioned video and text information" refers to a function that integrates the acquired data and creates a document that clearly describes the work procedures.

[0520] "Display means for presenting work procedures using visual devices" refers to a function that shows procedures to workers in real time through visual devices such as smart glasses.

[0521] "Recording means" refers to equipment or technology for storing generated work instructions and keeping them accessible as needed.

[0522] "Control means" refers to mechanisms or technologies for automatically executing tasks based on work instructions.

[0523] A "shared data environment" is a system for storing and synchronizing data across multiple devices and users via a network such as the cloud.

[0524] The system for implementing this invention mainly consists of three components: a server, a terminal (including a visual device), and a user. Specifically, it is implemented in the following form.

[0525] When users perform shipping processes at a logistics center, they use visual devices such as smart glasses to collect operating procedures and audio instructions. The visual devices acquire work content in real time as audio and video. As a result, the devices, as acquisition means, can record work procedures in detail.

[0526] Audio data obtained from visual devices is converted into text information by a server. Specifically, speech recognition technology is used to convert the audio data into text data. This conversion uses speech analysis software (e.g., Google Speech-to-Text API).

[0527] Next, the server generates a work instruction sheet based on the acquired video data and converted text information. The work instruction sheet documents the procedure and presents it to the user step by step. Image processing technology (e.g., OpenCV) and data integration technology (e.g., Python data processing libraries) are used in this generation process.

[0528] The generated work instructions are stored in a shared data environment as a means of recording data. This environment utilizes a cloud platform for data storage and synchronization. Users can access these instructions as needed to review and execute their work.

[0529] As a concrete example, consider the shipping process for a new product at a logistics center. To learn this process, workers wear smart glasses, and the system captures and transmits visual and audio data to a server. The work instructions created from this data are then presented to the new worker through the visual device the next time they perform a similar task, facilitating rapid learning.

[0530] The following are examples of prompt messages:

[0531] "Please convert the audio explanation of the logistics center's shipping process into text, analyze the video footage of the operations, and output the results in the format of an operations manual."

[0532] This system, through its function of presenting work procedures using visual devices, can promote the standardization and efficiency of work processes and efficiently support new employee training in logistics centers.

[0533] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0534] Step 1:

[0535] The user wears a visual device and begins the shipping process at the logistics center. The user's visual device simultaneously collects audio and video during the work. The input consists of audio data emitted by the user and real-time video data, which are temporarily recorded in the visual device's storage.

[0536] Step 2:

[0537] The terminal sends the collected audio data to the server, which uses speech recognition technology to convert the audio into text. The input is audio data, and the server uses speech recognition software to convert the audio data into text strings; this text data is the output. In this way, the procedures explained in audio are recorded as text.

[0538] Step 3:

[0539] The server analyzes video data sent from the terminal and extracts key frames important for the task. The input is video data; it utilizes video frame analysis technology to detect important scenes, and the output is image data extracted as keyframes. Specifically, it uses computing resources to detect changes and patterns within the image.

[0540] Step 4:

[0541] The server integrates the converted text information and extracted keyframes to generate a work instruction document. The input consists of text data and keyframe image data, and the output is the generated work instruction document. The integration method utilizes natural language processing to format the procedure step by step.

[0542] Step 5:

[0543] The server saves the generated work orders to a shared data environment in the cloud. The input is the work order, and the output is the work order stored in cloud storage. The specific operation involves uploading data to the cloud via the internet and ensuring security.

[0544] Step 6:

[0545] The next time the user performs a similar task, they will put on the visual device, and the server will display the saved work instructions. The input is the work instructions retrieved from the cloud, and the output is the procedure displayed on the visual device. Specifically, the visual device's display sequentially presents the instructions for each step.

[0546] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0547] To implement the system of the present invention, a collaborative process involving the user, terminal, server, and emotion engine is employed. Through this process, it becomes possible not only to automate tasks but also to optimize work procedures based on the user's emotional state.

[0548] First, the user provides the details of their work to the system using a terminal. The user explains the work procedure verbally while performing operations on the screen. The terminal collects this audio and screen recording in real time. The recorded data is sent to the server.

[0549] The server analyzes the received data and converts the voice data of the business into text data using speech recognition technology. Meanwhile, it uses image analysis algorithms to extract important operational frames related to the business procedure from the screen data.

[0550] The emotion engine recognizes emotional data from the user's tone of voice and facial expressions. This emotional data reflects the user's psychological state during the work explanation and is used to improve the efficiency of work performance. For example, if the user is showing signs of stress, the emotion engine can add supplementary explanations to make the work procedure easier to understand.

[0551] The server integrates this data to generate a work manual. This manual includes text data, image data, and supplementary information based on the user's emotional state, supporting specific and efficient work execution. The generated manual is stored on a cloud-based storage system, which the user can access.

[0552] Subsequently, when the user performs the same task again, the server automatically executes the work procedure according to the stored manual. Based on past emotional data obtained from the emotion engine, the procedure is fine-tuned to ensure that the task is performed efficiently and stress-free.

[0553] As a concrete example, consider a scenario where a user teaches a system the customer support inquiry handling process. The user operates the customer management system while verbally explaining the appropriate response to the inquiry. The terminal records this and sends it to the server, which converts the audio to text and the screen to images, while an emotion engine monitors the user's stress level. The server integrates these elements and, if the stress level is high, generates a simplified manual. This allows another user to apply the system and handle inquiries stress-free the next time.

[0554] The following describes the processing flow.

[0555] Step 1:

[0556] The user uses a terminal and explains the procedure verbally while demonstrating the task. The user operates the relevant software via the screen, indicating the necessary actions.

[0557] Step 2:

[0558] The device captures the user's voice while they are working and records their screen activity. It also records the user's facial expressions using a built-in or connected camera.

[0559] Step 3:

[0560] The device sends recorded audio, screen data, and facial expression data to the server. These are compressed in real time and transferred using a secure communication method.

[0561] Step 4:

[0562] The server converts the audio data into text data using a speech recognition algorithm. This process utilizes a customized dictionary to handle specialized terminology specific to the business.

[0563] Step 5:

[0564] The server analyzes screen data, identifies keyframes and interactions of user actions, and extracts image data to clarify business procedures.

[0565] Step 6:

[0566] The emotion engine analyzes the user's emotional state from their voice tone and facial expressions. Based on this analysis, it determines at which step the user is experiencing stress.

[0567] Step 7:

[0568] The server generates a work manual based on text data, image data, and sentiment analysis results. The generated manual inserts supplementary information and warnings specifically tailored to the stress points indicated by the user in the sentiment analysis.

[0569] Step 8:

[0570] The server saves the completed business manual to cloud storage, making it accessible to users. The manual includes comments and suggestions for improvement based on user feedback.

[0571] Step 9:

[0572] The next time a user or another user performs a similar task, the server will refer to the stored manual and prepare to automatically execute the task. Sentimental data will be used to optimize procedures for smoother task execution.

[0573] Step 10:

[0574] The server saves sentiment analysis results and execution logs after task completion, which are then used to further improve operational efficiency and enhance the user experience. Through this process, user feedback can be incorporated into the design of business processes.

[0575] (Example 2)

[0576] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0577] In modern business, efficient and stress-free work execution is essential. However, when procedures are complex and not optimized to accommodate the emotional states of individual users, work efficiency decreases and stress increases. Solving this problem makes it possible to automate tasks and improve the user experience.

[0578] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0579] In this invention, the server includes input means for inputting work content in voice and image format, recognition means for extracting emotional information from the user's voice tone and facial expression data, and optimization means for creating a work manual that includes supplementary information to optimize work execution considering the emotional information. As a result, work procedures are optimized based on the user's emotional state, enabling efficient and stress-free work execution.

[0580] "Input means" refers to a device or technology for receiving work details from a user in audio and image format and inputting them into the system.

[0581] "Conversion means" refers to a device or technology for analyzing acquired audio data and converting it into text data.

[0582] "Generation means" refers to a device or technology for automatically generating a business manual that describes business procedures based on data obtained from audio and images.

[0583] "Recognition means" refers to a device or technology for analyzing a user's voice tone and facial expression data and extracting emotional information from them.

[0584] "Optimization means" refers to a device or technology for creating a work manual that takes extracted emotional information into account and includes supplementary information to optimize work performance.

[0585] "Memory means" refers to a device or technology for storing generated operational manuals and making them accessible as needed.

[0586] "Execution means" refers to a device or technology that automatically performs tasks according to stored operational manuals and further fine-tunes procedures based on emotional information.

[0587] In implementing the system of the present invention, close cooperation is necessary between the user, terminal, server, and emotion engine. Through this cooperation, automation of tasks and improvement of the user experience can be achieved.

[0588] First, the user explains their work verbally and uses the terminal's screen as needed. For example, they might explain the customer support inquiry procedure verbally while operating a customer management system. During this process, the terminal collects the user's voice via a microphone and captures the actions on the screen. Recording software is used to collect the voice data, and capture software is used to collect the screen data.

[0589] The terminal transmits collected audio and image data to the server in real time. On the server, an audio recognition engine is used to convert the audio data into text. Specifically, audio analysis software is used for audio recognition. Meanwhile, the image data is analyzed by an image analysis tool to extract important operational details related to business procedures.

[0590] Simultaneously, an emotion engine embedded in the server analyzes the user's voice tone and facial expressions to recognize emotional information. This emotional information is used to optimize work procedures, and supplementary information is added according to the user's stress level.

[0591] The generated work manuals are stored in cloud storage. For example, a cloud storage service is used to securely store and retrieve the data. Users can access the stored manuals at any time, enabling automated and efficient procedure execution based on the manuals when repeating tasks.

[0592] A concrete example of a prompt message could be: "Please provide a voice-based explanation of the customer support inquiry handling procedure, perform operations on the customer management system using the terminal, and generate a manual based on the stress level." By following this prompt message and giving instructions to the system, an automated process is achieved.

[0593] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0594] Step 1:

[0595] The user explains the work content verbally and uses the terminal's operation screen. Specifically, the user explains the work procedure verbally and displays or inputs relevant information using the terminal's screen. The input consists of voice data and screen operation data, which are used in the next processing step.

[0596] Step 2:

[0597] The device records audio data from the user and captures screen operations. Specifically, it records audio data through the microphone and records the screen using screen capture software. The audio data and screen operation data are sent to the server as output from the device.

[0598] Step 3:

[0599] The server converts the received audio data into text data using speech recognition technology. For example, it uses dedicated speech recognition software to process the audio waveform into text information. The converted text data is output and its contents are used to analyze business procedures.

[0600] Step 4:

[0601] The server analyzes the received screen operation data using an image analysis tool. Specifically, it extracts important frames from the captured screen and identifies the content of each operation. The image analysis algorithm processes the screen data, and the results of the operation content extraction are output.

[0602] Step 5:

[0603] The emotion engine on the server recognizes user emotional information from audio and video data. Based on voice tone and facial expression patterns, calculations are performed to extract emotional states such as stress levels. The recognized emotional information is then output and used as a reference for optimizing business operations.

[0604] Step 6:

[0605] The server integrates text data, extracted operational data, and sentiment information to create optimized operational manuals using a generative AI model. Using each input data, specific procedure manuals are generated to streamline work processes. These operational manuals become the output and are stored in the cloud.

[0606] Step 7:

[0607] Users can access saved operational manuals and check work procedures as needed. They perform actual tasks based on the information in the manuals, and the system provides efficient support. The operational manuals, as output information, assist the user's operations.

[0608] (Application Example 2)

[0609] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0610] In today's world, there is a demand for increased efficiency and reduced emotional burden in people's work. However, conventional systems have not adequately optimized the automation of work procedures, and in particular, they struggle to respond flexibly while considering the emotional state of users. Therefore, there is a need to achieve efficient work automation while reducing the stress and anxiety that users experience during work.

[0611] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0612] In this invention, the server includes an acquisition means for inputting work content in voice and image format, an emotion analysis means for analyzing emotional states, and a generation means for generating optimized work instruction manuals. This enables efficient and flexible automation of tasks in response to the user's emotions.

[0613] "Means of acquisition" refers to a system that has the function of effectively inputting work details in the form of voice and images.

[0614] "Conversion means" refers to the technology that converts input audio data into corresponding text data.

[0615] The "generation means" refers to a device that has the function of automatically creating a work instruction manual containing work procedures from text and image data.

[0616] "Emotional analysis means" refers to technology that determines an individual's emotional state from the user's voice and other data.

[0617] The "optimization method" refers to optimizing work instructions based on emotional data to match the emotional state of the user, thereby creating an efficient procedure.

[0618] "Memory means" refers to technologies for securely storing generated work instruction manuals and optimized data.

[0619] "Execution means" refers to a function that automatically carries out tasks based on the work instruction manual.

[0620] A "virtual environment" refers to a virtual space or infrastructure within a computer system used for performing business operations.

[0621] The system implementing this invention operates primarily through the cooperation of a server, a terminal, and a user. When a user begins work, the terminal acquires audio and image data in real time. This acquisition means has the function of converting the audio explanation of the work into text and the function of capturing the screen. The generated data is sent to the server.

[0622] The server uses speech recognition software to convert audio data into text data. The common speech recognition library "speech_recognition" is applied for this purpose. Furthermore, image analysis software is used to extract information necessary for business procedures from the acquired image data. The image processing library "OpenCV" is often used for this purpose.

[0623] In addition, the server performs sentiment analysis on the acquired data, using the "emotion_recognition" library to evaluate the user's emotional state. Based on this sentiment, the generation mechanism optimizes the work instructions, enabling users to perform their tasks more efficiently and with less stress. The optimized work instructions are stored in the memory mechanism and can be accessed at any time as needed.

[0624] One application of this system is a smart home robot that receives user instructions and performs cleaning and household chores. It can analyze the user's stress level and provide additional suggestions or support as needed. An example of a prompt message would be: "If the user instructs, 'Make dinner,' analyze the voice data to determine if the user is stressed and, if necessary, offer comforting suggestions to the user."

[0625] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0626] Step 1:

[0627] The terminal receives voice input for work-related tasks from the user and simultaneously captures screen operations. Both voice and image data are acquired as input. The terminal saves the acquired voice as digital data and records the screen capture in still image or video format.

[0628] Step 2:

[0629] The terminal sends audio data to the server. The server converts the received audio data into text data using the "speech_recognition" library. The output of this step is text data represented as character information.

[0630] Step 3:

[0631] The terminal sends image data to the server. The server uses image processing libraries such as "OpenCV" to analyze and extract important business operation frames. Specifically, it extracts image features related to a particular operation procedure and identifies them.

[0632] Step 4:

[0633] The server processes data obtained from audio and images using the "emotion_recognition" library to perform emotion analysis. It uses text data and metadata from images as input. Based on this analysis, it identifies the user's emotional state and calculates indicators of stress level and satisfaction.

[0634] Step 5:

[0635] The server integrates text data of acquired work content, image processing results, and emotional data to generate work instruction manuals. The generated manuals are optimized to adapt to the user's emotional state. These manuals provide guidance for efficiently executing specific work procedures.

[0636] Step 6:

[0637] The server stores optimized work instructions in a storage device, allowing users to access them later. By storing them in a cloud service, users can view the work instructions from any device.

[0638] Step 7:

[0639] When the user performs the same task again, the server provides data to perform the task according to the stored instructions. During this time, it makes fine adjustments based on sentiment analysis data to support the performance of the task.

[0640] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0641] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0642] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0643] [Fourth Embodiment]

[0644] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0645] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0646] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0647] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0648] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0649] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0650] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0651] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0652] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0653] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0654] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0655] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0656] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0657] An embodiment of the system of the present invention employs a process that includes the following procedures. In this process, the user, terminal, and server each play a specific role.

[0658] The user provides details of their work to the system via a terminal and verbally explains the series of operations involved. During this process, the terminal captures the user's screen in real time and simultaneously records the user's voice. This data is transmitted to the server as it progresses.

[0659] The server converts the audio data transmitted from the terminal into text data using speech recognition technology. This conversion process organizes the content explained in the audio into text. The server also analyzes the image data acquired from the terminal and extracts keyframes and operation details necessary for performing the task.

[0660] The server, acting as the generation mechanism, integrates the obtained text and image data to generate a detailed operational manual. This manual includes step-by-step instructions to ensure the user can accurately reproduce the initial operation. The generated operational manual is securely stored on a cloud-based storage device, allowing the user to access and modify it as needed.

[0661] Subsequently, the next time a user wants to perform the same task, the server will automatically execute the procedure based on the saved task manual. As a means of execution, the server replicates the necessary steps on the terminal, automating the task. User intervention is minimized, resulting in increased efficiency.

[0662] For example, if a user is teaching an AI agent how to issue invoices, the user operates the invoice issuance software and provides explanations for each operation. The terminal collects this information and sends it to the server. The server converts the received audio into text and the screen operations into images, and integrates them to generate an invoice issuance manual. As a result, the server can then automatically perform the invoice issuance procedure according to this manual the next time around.

[0663] This approach integrates the entire process, from creating operational manuals to automating their execution, into a single workflow, providing users with an effective means to improve operational efficiency.

[0664] The following describes the processing flow.

[0665] Step 1:

[0666] The user uses a terminal and explains the task verbally while demonstrating it. The user operates the software on the screen while giving verbal instructions to demonstrate specific operations.

[0667] Step 2:

[0668] The device activates its screen capture function and records the user's screen activity. It also records the user's voice explanation via the microphone. This data is temporarily stored for later transmission to a server.

[0669] Step 3:

[0670] The device compresses the recorded screen and audio data and sends it to the server using a secure communication protocol. The data integrity is verified to ensure no data loss occurs.

[0671] Step 4:

[0672] The server converts the audio data received from the terminal into text data using speech recognition technology. In this process, appropriate dictionary data is used to handle specialized terminology and industry jargon that needs to be recognized.

[0673] Step 5:

[0674] The server identifies important keyframes from the screen data and extracts images related to user actions. These images will later become part of the operational manual that will be generated.

[0675] Step 6:

[0676] The server combines the converted text data with extracted images to automatically generate a detailed manual outlining the work procedures. This manual organizes the work procedures step by step and presents them in a user-friendly format.

[0677] Step 7:

[0678] The server generates and stores the work manuals in cloud storage. Users can access these manuals at any time to review or modify their work procedures.

[0679] Step 8:

[0680] The next time the user requests that the same task be performed by AI, the server will automatically execute the steps based on the stored task manual. The task will proceed autonomously, following instructions from the server to the terminal.

[0681] Step 9:

[0682] The server saves execution logs of tasks, making them accessible to users for future improvements and verification. This allows for more efficient task accuracy and troubleshooting.

[0683] (Example 1)

[0684] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0685] In traditional business processes, the creation of documentation and the automation of procedures were often done manually, which was time-consuming and labor-intensive. Furthermore, manual document creation was prone to human error, compromising efficiency and accuracy. In addition, there was a lack of comprehensive systems to streamline and automate these business processes.

[0686] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0687] In this invention, the server includes information gathering means for acquiring audio and visual information for users to describe work content, information conversion means for converting the audio information into text information, and manual generation means for constructing work procedures using the visual and text information. This enables the automation and efficient execution of work procedures.

[0688] "User" refers to an individual or organization that uses the system to input business details and enjoys the streamlined process.

[0689] "Job description" refers to information including procedures for tasks and operations performed with a specific purpose.

[0690] "Audio information" refers to sound data that records work-related instructions and content explained verbally by the user.

[0691] "Visual information" refers to digital image data, including user interfaces and visual elements in business processes.

[0692] "Information gathering means" refers to devices or functions used to collect audio and visual information obtained from users.

[0693] "Information conversion means" refers to technology or equipment for converting acquired audio information into text information.

[0694] "Textual information" refers to data in which audio information is represented as text.

[0695] "Manual generation means" refers to a device or function that documents business procedures based on collected visual and textual information.

[0696] "Business procedures" refer to information that describes the sequence of operations and actions necessary to perform a task.

[0697] A "data storage area" refers to a storage system that stores data such as business procedures and manuals, and makes them accessible as needed.

[0698] "Means of performing business operations" refers to a system that has the function of automatically reproducing or executing a process according to saved business procedures.

[0699] A "network environment" refers to a technological infrastructure that enables the transmission and reception of digital data via the internet and other communication methods.

[0700] To implement this invention, a system is constructed in which a user, a terminal, and a server work together to automate tasks. The user verbally explains the task and performs the procedure on the terminal. The terminal captures the user's screen in real time and records audio. This terminal is a computer equipped with an audio input device and a display capture function.

[0701] The server converts the audio data sent from the terminal into text using speech recognition technology. General-purpose speech recognition technologies (such as Google Cloud Speech-to-Text) can be used for this purpose. The server also uses image recognition technology to analyze captured visual information and extract key points related to the business. Image processing libraries such as OpenCV can be used for this image analysis.

[0702] The server integrates the converted text information with the analyzed visual information to generate a manual outlining the work procedures. This manual is stored in a cloud storage system (e.g., Amazon S3) and is accessible at any time. Furthermore, users can modify this manual as needed.

[0703] To streamline actual operations, the server can automatically execute processes based on saved work procedures. This minimizes user intervention and improves the reproducibility and efficiency of operations. For example, when a user issues an invoice, they only need to be taught the necessary information input procedure once, and the server can then automatically reproduce that procedure the next time. This significantly reduces the burden of repetitive daily tasks.

[0704] An example of a prompt message could be: "Please describe the procedure for issuing an invoice. This procedure should include details of the required field entries and button clicks." Using such prompts generates operational manuals that contain specific and reproducible information, thus ensuring consistency in operations and improving operational efficiency.

[0705] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0706] Step 1:

[0707] The user verbally explains the task to the terminal and operates the software. Input includes the user's voice and actions. Specifically, the user launches invoice issuance software and explains the operation while entering the necessary information. During this process, the terminal utilizes both a voice input device and a screen capture function.

[0708] Step 2:

[0709] The terminal captures the user's screen and audio in real time and sends them to the server as audio and visual data. The input is the user's audio and screen, and the output is the audio and visual data sent to the server. Specifically, the terminal saves the displayed invoice issuance screen as a screenshot and records the audio explanation.

[0710] Step 3:

[0711] The server converts the transmitted audio data into text data using speech recognition technology. The input for this step is audio data, and the output is text data. Here, speech recognition software is used to organize the user's explanation into textual information.

[0712] Step 4:

[0713] The server analyzes visual data using image analysis technology and extracts keyframes to identify business procedures. The input is visual data, and the output is data indicating important operational points. Specifically, button clicks and data entry steps are identified from the image.

[0714] Step 5:

[0715] The server integrates the converted text data and parsed visual data to generate a manual outlining the business procedures. The input consists of text and visual data, while the output is a detailed business manual. The server combines these to clearly demonstrate the step-by-step process of invoice issuance.

[0716] Step 6:

[0717] The server saves the generated work manuals to cloud storage. The input is the work manual, and the output is the saved data accessible online. This allows users to access the manuals at any time and review or edit their contents.

[0718] Step 7:

[0719] In the next task, the server will initiate an automated execution process based on the saved work procedures. The input is the work manual, and the output is the execution of the automated work procedures. The server will reproduce the previously saved procedures and automatically perform the specific operations within the software.

[0720] (Application Example 1)

[0721] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0722] In logistics centers, it is crucial for new workers to quickly and efficiently learn the shipping process. However, current manual training methods are insufficient to address the complexity and diverse procedures involved, making efficient training difficult. There is a need to improve this situation, standardize work procedures, and support the smooth execution of operations.

[0723] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0724] In this invention, the server includes an acquisition means for collecting work content in audio and video format, a conversion means for converting the audio into text information, and a generation means for generating work instructions that describe the work procedures based on the video and text information. This enables the presentation of work procedures in real time using a visual device, facilitating efficient training of new employees and automated execution of tasks.

[0725] "Work content" refers to a series of operational activities, including the shipping process at the logistics center.

[0726] "Means of acquisition that collect information via audio and video" refers to equipment used to record workers' operating procedures and voice explanations as audio and video data.

[0727] "The conversion means for converting the aforementioned audio into text information" refers to a technology that analyzes the collected audio data and expresses it as text information.

[0728] "Generation means for generating work instructions that describe work procedures based on the aforementioned video and text information" refers to a function that integrates the acquired data and creates a document that clearly describes the work procedures.

[0729] "Display means for presenting work procedures using visual devices" refers to a function that shows procedures to workers in real time through visual devices such as smart glasses.

[0730] "Recording means" refers to equipment or technology for storing generated work instructions and keeping them accessible as needed.

[0731] "Control means" refers to mechanisms or technologies for automatically executing tasks based on work instructions.

[0732] A "shared data environment" is a system for storing and synchronizing data across multiple devices and users via a network such as the cloud.

[0733] The system for implementing this invention mainly consists of three components: a server, a terminal (including a visual device), and a user. Specifically, it is implemented in the following form.

[0734] When users perform shipping processes at a logistics center, they use visual devices such as smart glasses to collect operating procedures and audio instructions. The visual devices acquire work content in real time as audio and video. As a result, the devices, as acquisition means, can record work procedures in detail.

[0735] Audio data obtained from visual devices is converted into text information by a server. Specifically, speech recognition technology is used to convert the audio data into text data. This conversion uses speech analysis software (e.g., Google Speech-to-Text API).

[0736] Next, the server generates a work instruction sheet based on the acquired video data and converted text information. The work instruction sheet documents the procedure and presents it to the user step by step. Image processing technology (e.g., OpenCV) and data integration technology (e.g., Python data processing libraries) are used in this generation process.

[0737] The generated work instructions are stored in a shared data environment as a means of recording data. This environment utilizes a cloud platform for data storage and synchronization. Users can access these instructions as needed to review and execute their work.

[0738] As a concrete example, consider the shipping process for a new product at a logistics center. To learn this process, workers wear smart glasses, and the system captures and transmits visual and audio data to a server. The work instructions created from this data are then presented to the new worker through the visual device the next time they perform a similar task, facilitating rapid learning.

[0739] The following are examples of prompt messages:

[0740] "Please convert the audio explanation of the logistics center's shipping process into text, analyze the video footage of the operations, and output the results in the format of an operations manual."

[0741] This system, through its function of presenting work procedures using visual devices, can promote the standardization and efficiency of work processes and efficiently support new employee training in logistics centers.

[0742] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0743] Step 1:

[0744] The user wears a visual device and begins the shipping process at the logistics center. The user's visual device simultaneously collects audio and video during the work. The input consists of audio data emitted by the user and real-time video data, which are temporarily recorded in the visual device's storage.

[0745] Step 2:

[0746] The terminal sends the collected audio data to the server, which uses speech recognition technology to convert the audio into text. The input is audio data, and the server uses speech recognition software to convert the audio data into text strings; this text data is the output. In this way, the procedures explained in audio are recorded as text.

[0747] Step 3:

[0748] The server analyzes video data sent from the terminal and extracts key frames important for the task. The input is video data; it utilizes video frame analysis technology to detect important scenes, and the output is image data extracted as keyframes. Specifically, it uses computing resources to detect changes and patterns within the image.

[0749] Step 4:

[0750] The server integrates the converted text information and extracted keyframes to generate a work instruction document. The input consists of text data and keyframe image data, and the output is the generated work instruction document. The integration method utilizes natural language processing to format the procedure step by step.

[0751] Step 5:

[0752] The server saves the generated work orders to a shared data environment in the cloud. The input is the work order, and the output is the work order stored in cloud storage. The specific operation involves uploading data to the cloud via the internet and ensuring security.

[0753] Step 6:

[0754] The next time the user performs a similar task, they will put on the visual device, and the server will display the saved work instructions. The input is the work instructions retrieved from the cloud, and the output is the procedure displayed on the visual device. Specifically, the visual device's display sequentially presents the instructions for each step.

[0755] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0756] To implement the system of the present invention, a collaborative process involving the user, terminal, server, and emotion engine is employed. Through this process, it becomes possible not only to automate tasks but also to optimize work procedures based on the user's emotional state.

[0757] First, the user provides the details of their work to the system using a terminal. The user explains the work procedure verbally while performing operations on the screen. The terminal collects this audio and screen recording in real time. The recorded data is sent to the server.

[0758] The server analyzes the received data and converts the voice data of the business into text data using speech recognition technology. Meanwhile, it uses image analysis algorithms to extract important operational frames related to the business procedure from the screen data.

[0759] The emotion engine recognizes emotional data from the user's tone of voice and facial expressions. This emotional data reflects the user's psychological state during the work explanation and is used to improve the efficiency of work performance. For example, if the user is showing signs of stress, the emotion engine can add supplementary explanations to make the work procedure easier to understand.

[0760] The server integrates this data to generate a work manual. This manual includes text data, image data, and supplementary information based on the user's emotional state, supporting specific and efficient work execution. The generated manual is stored on a cloud-based storage system, which the user can access.

[0761] Subsequently, when the user performs the same task again, the server automatically executes the work procedure according to the stored manual. Based on past emotional data obtained from the emotion engine, the procedure is fine-tuned to ensure that the task is performed efficiently and stress-free.

[0762] As a concrete example, consider a scenario where a user teaches a system the customer support inquiry handling process. The user operates the customer management system while verbally explaining the appropriate response to the inquiry. The terminal records this and sends it to the server, which converts the audio to text and the screen to images, while an emotion engine monitors the user's stress level. The server integrates these elements and, if the stress level is high, generates a simplified manual. This allows another user to apply the system and handle inquiries stress-free the next time.

[0763] The following describes the processing flow.

[0764] Step 1:

[0765] The user uses a terminal and explains the procedure verbally while demonstrating the task. The user operates the relevant software via the screen, indicating the necessary actions.

[0766] Step 2:

[0767] The device captures the user's voice while they are working and records their screen activity. It also records the user's facial expressions using a built-in or connected camera.

[0768] Step 3:

[0769] The device sends recorded audio, screen data, and facial expression data to the server. These are compressed in real time and transferred using a secure communication method.

[0770] Step 4:

[0771] The server converts the audio data into text data using a speech recognition algorithm. This process utilizes a customized dictionary to handle specialized terminology specific to the business.

[0772] Step 5:

[0773] The server analyzes screen data, identifies keyframes and interactions of user actions, and extracts image data to clarify business procedures.

[0774] Step 6:

[0775] The emotion engine analyzes the user's emotional state from their voice tone and facial expressions. Based on this analysis, it determines at which step the user is experiencing stress.

[0776] Step 7:

[0777] The server generates a work manual based on text data, image data, and sentiment analysis results. The generated manual inserts supplementary information and warnings specifically tailored to the stress points indicated by the user in the sentiment analysis.

[0778] Step 8:

[0779] The server saves the completed business manual to cloud storage, making it accessible to users. The manual includes comments and suggestions for improvement based on user feedback.

[0780] Step 9:

[0781] The next time a user or another user performs a similar task, the server will refer to the stored manual and prepare to automatically execute the task. Sentimental data will be used to optimize procedures for smoother task execution.

[0782] Step 10:

[0783] The server saves sentiment analysis results and execution logs after task completion, which are then used to further improve operational efficiency and enhance the user experience. Through this process, user feedback can be incorporated into the design of business processes.

[0784] (Example 2)

[0785] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0786] In modern business, efficient and stress-free work execution is essential. However, when procedures are complex and not optimized to accommodate the emotional states of individual users, work efficiency decreases and stress increases. Solving this problem makes it possible to automate tasks and improve the user experience.

[0787] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0788] In this invention, the server includes input means for inputting work content in voice and image format, recognition means for extracting emotional information from the user's voice tone and facial expression data, and optimization means for creating a work manual that includes supplementary information to optimize work execution considering the emotional information. As a result, work procedures are optimized based on the user's emotional state, enabling efficient and stress-free work execution.

[0789] "Input means" refers to a device or technology for receiving work details from a user in audio and image format and inputting them into the system.

[0790] "Conversion means" refers to a device or technology for analyzing acquired audio data and converting it into text data.

[0791] "Generation means" refers to a device or technology for automatically generating a business manual that describes business procedures based on data obtained from audio and images.

[0792] "Recognition means" refers to a device or technology for analyzing a user's voice tone and facial expression data and extracting emotional information from them.

[0793] "Optimization means" refers to a device or technology for creating a work manual that takes extracted emotional information into account and includes supplementary information to optimize work performance.

[0794] "Memory means" refers to a device or technology for storing generated operational manuals and making them accessible as needed.

[0795] "Execution means" refers to a device or technology that automatically performs tasks according to stored operational manuals and further fine-tunes procedures based on emotional information.

[0796] In implementing the system of the present invention, close cooperation is necessary between the user, terminal, server, and emotion engine. Through this cooperation, automation of tasks and improvement of the user experience can be achieved.

[0797] First, the user explains their work verbally and uses the terminal's screen as needed. For example, they might explain the customer support inquiry procedure verbally while operating a customer management system. During this process, the terminal collects the user's voice via a microphone and captures the actions on the screen. Recording software is used to collect the voice data, and capture software is used to collect the screen data.

[0798] The terminal transmits collected audio and image data to the server in real time. On the server, an audio recognition engine is used to convert the audio data into text. Specifically, audio analysis software is used for audio recognition. Meanwhile, the image data is analyzed by an image analysis tool to extract important operational details related to business procedures.

[0799] Simultaneously, an emotion engine embedded in the server analyzes the user's voice tone and facial expressions to recognize emotional information. This emotional information is used to optimize work procedures, and supplementary information is added according to the user's stress level.

[0800] The generated work manuals are stored in cloud storage. For example, a cloud storage service is used to securely store and retrieve the data. Users can access the stored manuals at any time, enabling automated and efficient procedure execution based on the manuals when repeating tasks.

[0801] A concrete example of a prompt message could be: "Please provide a voice-based explanation of the customer support inquiry handling procedure, perform operations on the customer management system using the terminal, and generate a manual based on the stress level." By following this prompt message and giving instructions to the system, an automated process is achieved.

[0802] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0803] Step 1:

[0804] The user explains the work content verbally and uses the terminal's operation screen. Specifically, the user explains the work procedure verbally and displays or inputs relevant information using the terminal's screen. The input consists of voice data and screen operation data, which are used in the next processing step.

[0805] Step 2:

[0806] The device records audio data from the user and captures screen operations. Specifically, it records audio data through the microphone and records the screen using screen capture software. The audio data and screen operation data are sent to the server as output from the device.

[0807] Step 3:

[0808] The server converts the received audio data into text data using speech recognition technology. For example, it uses dedicated speech recognition software to process the audio waveform into text information. The converted text data is output and its contents are used to analyze business procedures.

[0809] Step 4:

[0810] The server analyzes the received screen operation data using an image analysis tool. Specifically, it extracts important frames from the captured screen and identifies the content of each operation. The image analysis algorithm processes the screen data, and the results of the operation content extraction are output.

[0811] Step 5:

[0812] The emotion engine on the server recognizes user emotional information from audio and video data. Based on voice tone and facial expression patterns, calculations are performed to extract emotional states such as stress levels. The recognized emotional information is then output and used as a reference for optimizing business operations.

[0813] Step 6:

[0814] The server integrates text data, extracted operational data, and sentiment information to create optimized operational manuals using a generative AI model. Using each input data, specific procedure manuals are generated to streamline work processes. These operational manuals become the output and are stored in the cloud.

[0815] Step 7:

[0816] Users can access saved operational manuals and check work procedures as needed. They perform actual tasks based on the information in the manuals, and the system provides efficient support. The operational manuals, as output information, assist the user's operations.

[0817] (Application Example 2)

[0818] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0819] In today's world, there is a demand for increased efficiency and reduced emotional burden in people's work. However, conventional systems have not adequately optimized the automation of work procedures, and in particular, they struggle to respond flexibly while considering the emotional state of users. Therefore, there is a need to achieve efficient work automation while reducing the stress and anxiety that users experience during work.

[0820] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0821] In this invention, the server includes an acquisition means for inputting work content in voice and image format, an emotion analysis means for analyzing emotional states, and a generation means for generating optimized work instruction manuals. This enables efficient and flexible automation of tasks in response to the user's emotions.

[0822] "Means of acquisition" refers to a system that has the function of effectively inputting work details in the form of voice and images.

[0823] "Conversion means" refers to the technology that converts input audio data into corresponding text data.

[0824] The "generation means" refers to a device that has the function of automatically creating a work instruction manual containing work procedures from text and image data.

[0825] "Emotional analysis means" refers to technology that determines an individual's emotional state from the user's voice and other data.

[0826] The "optimization method" refers to optimizing work instructions based on emotional data to match the emotional state of the user, thereby creating an efficient procedure.

[0827] "Memory means" refers to technologies for securely storing generated work instruction manuals and optimized data.

[0828] "Execution means" refers to a function that automatically carries out tasks based on the work instruction manual.

[0829] A "virtual environment" refers to a virtual space or infrastructure within a computer system used for performing business operations.

[0830] The system implementing this invention operates primarily through the cooperation of a server, a terminal, and a user. When a user begins work, the terminal acquires audio and image data in real time. This acquisition means has the function of converting the audio explanation of the work into text and the function of capturing the screen. The generated data is sent to the server.

[0831] The server uses speech recognition software to convert audio data into text data. The common speech recognition library "speech_recognition" is applied for this purpose. Furthermore, image analysis software is used to extract information necessary for business procedures from the acquired image data. The image processing library "OpenCV" is often used for this purpose.

[0832] In addition, the server performs sentiment analysis on the acquired data, using the "emotion_recognition" library to evaluate the user's emotional state. Based on this sentiment, the generation mechanism optimizes the work instructions, enabling users to perform their tasks more efficiently and with less stress. The optimized work instructions are stored in the memory mechanism and can be accessed at any time as needed.

[0833] One application of this system is a smart home robot that receives user instructions and performs cleaning and household chores. It can analyze the user's stress level and provide additional suggestions or support as needed. An example of a prompt message would be: "If the user instructs, 'Make dinner,' analyze the voice data to determine if the user is stressed and, if necessary, offer comforting suggestions to the user."

[0834] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0835] Step 1:

[0836] The terminal receives voice input for work-related tasks from the user and simultaneously captures screen operations. Both voice and image data are acquired as input. The terminal saves the acquired voice as digital data and records the screen capture in still image or video format.

[0837] Step 2:

[0838] The terminal sends audio data to the server. The server converts the received audio data into text data using the "speech_recognition" library. The output of this step is text data represented as character information.

[0839] Step 3:

[0840] The terminal sends image data to the server. The server uses image processing libraries such as "OpenCV" to analyze and extract important business operation frames. Specifically, it extracts image features related to a particular operation procedure and identifies them.

[0841] Step 4:

[0842] The server processes data obtained from audio and images using the "emotion_recognition" library to perform emotion analysis. It uses text data and metadata from images as input. Based on this analysis, it identifies the user's emotional state and calculates indicators of stress level and satisfaction.

[0843] Step 5:

[0844] The server integrates text data of acquired work content, image processing results, and emotional data to generate work instruction manuals. The generated manuals are optimized to adapt to the user's emotional state. These manuals provide guidance for efficiently executing specific work procedures.

[0845] Step 6:

[0846] The server stores optimized work instructions in a storage device, allowing users to access them later. By storing them in a cloud service, users can view the work instructions from any device.

[0847] Step 7:

[0848] When the user performs the same task again, the server provides data to perform the task according to the stored instructions. During this time, it makes fine adjustments based on sentiment analysis data to support the performance of the task.

[0849] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0850] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0851] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0852] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0853] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0854] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0855] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0856] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0857] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0858] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0859] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0860] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0861] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0862] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0863] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0864] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0865] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0866] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0867] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0868] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0869] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted as being incorporated by reference.

[0870] The following is further disclosed regarding the embodiments described above.

[0871] (Claim 1)

[0872] An input method for entering work details using voice and images,

[0873] A conversion means for converting the aforementioned audio into text data,

[0874] A generation means for generating a work manual that describes work procedures based on the aforementioned image and text data,

[0875] A storage means for storing the aforementioned business manual,

[0876] An execution means that automatically performs tasks in accordance with the aforementioned work manual,

[0877] A system that includes this.

[0878] (Claim 2)

[0879] The input means has the function of receiving an audio explanation of the work and capturing the screen.

[0880] The system according to claim 1.

[0881] (Claim 3)

[0882] The aforementioned execution means has the function of performing business in a cloud environment.

[0883] The system according to claim 1.

[0884] "Example 1"

[0885] (Claim 1)

[0886] Information gathering means for acquiring audio and visual information for users to describe their work content,

[0887] Information conversion means for converting the aforementioned audio information into text information,

[0888] A manual generation means for constructing business procedures using the aforementioned visual and textual information,

[0889] A data management means for storing the aforementioned business procedures in a data storage area,

[0890] A means for performing tasks to reproduce and automate tasks based on the aforementioned business procedures,

[0891] A system that includes this.

[0892] (Claim 2)

[0893] The aforementioned information gathering means has the function of recording an audio explanation of the work and acquiring the operation screen as video.

[0894] The system according to claim 1.

[0895] (Claim 3)

[0896] The aforementioned business execution means has a processing function that executes business in a network environment.

[0897] The system according to claim 1.

[0898] "Application Example 1"

[0899] (Claim 1)

[0900] A means of acquiring information about the work performed, including audio and video recordings.

[0901] A conversion means for converting the aforementioned audio into text information,

[0902] A generation means for generating a work instruction sheet that describes the work procedure based on the aforementioned video and text information,

[0903] A recording means for storing the aforementioned work instructions,

[0904] A control means that automatically performs work in accordance with the aforementioned work instructions,

[0905] The control means includes a display means that has a function of presenting work procedures using a visual device,

[0906] A system that includes this.

[0907] (Claim 2)

[0908] The acquisition means has the function of receiving audio instructions for the work and collecting visual information.

[0909] The system according to claim 1.

[0910] (Claim 3)

[0911] The control means has the function of performing tasks in a shared data environment.

[0912] The system according to claim 1.

[0913] "Example 2 of combining an emotion engine"

[0914] (Claim 1)

[0915] An input method for entering work details using voice and images,

[0916] A conversion means for converting the aforementioned audio into text data,

[0917] A generation means for generating a work manual that describes work procedures based on the aforementioned image and text data,

[0918] A recognition means for extracting emotional information from the user's voice tone and facial expression data,

[0919] An optimization means for creating a work manual that includes supplementary information to optimize work execution, taking into account the aforementioned emotional information,

[0920] A storage means for storing the aforementioned business manual,

[0921] An execution means that automatically performs tasks according to the aforementioned work manual and fine-tunes the procedures based on emotional information,

[0922] A system that includes this.

[0923] (Claim 2)

[0924] The system according to claim 1, wherein the input means has the function of receiving an audio explanation of a task, capturing a screen, and collecting emotional information.

[0925] (Claim 3)

[0926] The system according to claim 1, wherein the execution means has a function to optimize business procedures using a generated AI model while performing business in a cloud environment.

[0927] "Application example 2 when combining with an emotional engine"

[0928] (Claim 1)

[0929] A means of acquiring information about work content through voice and images,

[0930] A conversion means for converting the aforementioned audio into text data,

[0931] A generation means for generating a work instruction manual that describes work procedures based on the aforementioned image and text data,

[0932] An emotion analysis tool for analyzing the emotional state of users,

[0933] An optimization means for optimizing the work instruction manual based on the emotional data obtained by the emotion analysis means,

[0934] A storage means for storing the optimized work instruction manual,

[0935] An execution means that automatically performs the work in accordance with the aforementioned work instruction manual,

[0936] A system that includes this.

[0937] (Claim 2)

[0938] The acquisition means has the function of receiving voice explanations of the work, capturing the display device, and adding the function of analyzing emotional data.

[0939] The system according to claim 1.

[0940] (Claim 3)

[0941] The aforementioned execution means has the function of performing tasks in a virtual environment and adds the function of fine-tuning work procedures based on emotional state.

[0942] The system according to claim 1. [Explanation of Symbols]

[0943] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of acquiring information about the work performed, including audio and video recordings. A conversion means for converting the aforementioned audio into text information, A generation means for generating a work instruction sheet that describes the work procedure based on the aforementioned video and text information, A recording means for storing the aforementioned work instructions, A control means that automatically performs work in accordance with the aforementioned work instructions, The control means includes a display means that has a function of presenting work procedures using a visual device, A system that includes this.

2. The acquisition means has the function of receiving audio instructions for the work and collecting visual information. The system according to claim 1.

3. The control means has the function of performing tasks in a shared data environment. The system according to claim 1.

Citation Information

Patent Citations

Persona chatbot control method and system
JP2022180282A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Persona chatbot control method and system