system

The automated interview system addresses the inefficiencies and subjectivity of traditional interviews by objectively evaluating applicants' emotions and skills, ensuring fair and efficient matching through real-time data analysis and personalized feedback.

JP2026104619APending Publication Date: 2026-06-25SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-13
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Conventional interview processes are costly in labor and often result in unfair evaluations due to interviewer subjectivity, leading to oppressive interviews and mismatches between companies and job seekers, with applicants unable to demonstrate their true abilities efficiently.

Method used

An automated interview system that objectively evaluates applicants' emotions and motivation by converting voice and image data into text in real time, analyzing emotional states, and calculating a suitability score to improve the fairness and efficiency of the interview process.

Benefits of technology

Enables fair and efficient interviews by accurately assessing applicants' skills and characteristics, providing feedback, and suggesting next steps, thereby improving the matching process between companies and job seekers.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026104619000001_ABST
    Figure 2026104619000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A means of conducting automated interviews for the purpose of data acquisition, A means of converting audio data to text in real time, means for analyzing audio and image data to infer emotional states, A means for calculating the degree of fit between the applicant and the organization based on the analysis results, A means of performing facial expression analysis using video equipment for the purpose of data visualization, A method for evaluating the psychological state of applicants by performing voice analysis, A system that includes this.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of the chatbot's character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] Conventional interview processes are costly in labor and often result in unfair evaluations due to the interviewer's subjectivity. There is also a risk of oppressive interviews and power harassment. As a result, there are cases where job seekers cannot demonstrate their true abilities, and there may be a mismatch between companies and job seekers. Therefore, there is a need to realize a fair and efficient interview process and support better matching between companies and job seekers.

Means for Solving the Problems

[0005] This invention provides an automated interview system for data acquisition. Specifically, it includes a means for objectively evaluating the applicant's emotions and motivation by converting voice data into text in real time and further analyzing the emotional state from the voice and image data. Based on the analysis results, it calculates the degree of matching between the applicant and the organization and provides feedback on the interview results, thereby improving the fairness and efficiency of the interview process. Furthermore, by making suggestions for the next interview stage, it is possible to smoothly advance the entire recruitment process.

[0006] "Automated interviews" refer to the process of conducting interviews and evaluations using artificial intelligence and automation technology, without the involvement of human interviewers.

[0007] "Audio data" refers to conversational and auditory information collected during an interview, which serves as the foundational data that is converted into text and sentiment information for analysis.

[0008] "Text conversion" is the process of converting audio data into text information, and is carried out using natural language processing technology.

[0009] "Emotional state" refers to the applicant's psychological state and emotions, and is information inferred through the analysis of voice and facial expressions.

[0010] "Analysis methods" refer to technologies and algorithms used to process audio and image data in order to understand the characteristics and trends of the data.

[0011] "Fit" is an evaluation index that quantifies the degree to which an applicant's skills and characteristics match the company's job requirements.

[0012] "Feedback" refers to the information and advice provided to applicants through interview results and evaluations.

[0013] A "proposal" is the act of recommending the next steps or actions to the applicant based on the analysis results. [Brief explanation of the drawing]

[0014] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

MODE FOR CARRYING OUT THE INVENTION

[0015] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0016] First, the terms used in the following description will be explained.

[0017] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0018] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0019] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0020] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0022] [First Embodiment]

[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0035] This invention is implemented in the form of an automated interview system and mainly consists of three main components: a server, a terminal, and a user. The functions performed by each component of the system and their specific roles are described below.

[0036] 1. Server Functions and Roles

[0037] The server acts as the central hub of the system, managing interviews, processing and analyzing data, and providing results. Specifically, the server prepares an optimal set of questions for each applicant and sends them to the terminal in real time. Furthermore, the server quickly converts the received audio data into text and analyzes the text using natural language processing technology. Emotional state inference is achieved by analyzing facial expressions from the tone and tempo of the voice, as well as from image data (if the interview is in video format). This allows the server to quantify the applicant's suitability and skills and calculate their fit with the company.

[0038] 2. Functions and roles of the device

[0039] The terminal functions as an interface with the user. Specifically, the terminal presents the user with questions sent from the server and records the user's responses as audio data. It also sends the recorded audio data to the server in real time, maintaining a data pipeline to ensure the smooth progress of the interview. The terminal provides an environment where the user can behave naturally and without stress.

[0040] 3. User Roles

[0041] As the main participant in the interview process, the user verbally answers questions presented via a device. The user's responses are recorded in real time and sent to a server as audio data for analysis. Users are expected to present their skills and experience without feeling nervous.

[0042] These components work together organically to enable fair and efficient automated interviews. For example, the server sends a question to the terminal saying, "Please introduce yourself," and the user answers. Through this series of interactions, the system can evaluate the applicant's abilities and characteristics with high accuracy.

[0043] The following describes the processing flow.

[0044] Step 1:

[0045] The server initiates the interview session and selects the first question from a pre-configured list. The selected question is then sent to the terminal.

[0046] Step 2:

[0047] The terminal presents the user with questions received from the server, either verbally or in text. The interface is appropriately adjusted to ensure the user can comfortably answer the questions.

[0048] Step 3:

[0049] The user answers questions presented by the device verbally. The answers are recorded in real time by the device.

[0050] Step 4:

[0051] The device sends the recorded audio data to the server. The transmission is performed without delay, ensuring high-quality data.

[0052] Step 5:

[0053] The server inputs the received audio data into a natural language processing (NLP) engine and converts it into text. The converted text becomes the basis for analyzing interview responses.

[0054] Step 6:

[0055] The server analyzes text data, extracts keywords, and evaluates the content of the responses. Furthermore, it infers the user's emotional state through analysis of tone and pace.

[0056] Step 7:

[0057] Based on the analysis results, the server matches the user's skill set with the skills required by the company and calculates a suitability score.

[0058] Step 8:

[0059] The server sends the goodness-of-fit score and analysis results to the terminal, while simultaneously selecting the next question and continuing the evaluation as needed.

[0060] Step 9:

[0061] The terminal displays the feedback and analysis results received from the server to the user and, if necessary, presents the following questions.

[0062] Step 10:

[0063] Once the process is complete, the server sends the final evaluation results to the company and also provides the user with information about the interview results and the next steps.

[0064] (Example 1)

[0065] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0066] In traditional interview processes, questions were standardized, making it difficult to evaluate applicants based on their individual characteristics. Furthermore, real-time data processing was insufficient, posing a challenge in realizing an efficient applicant evaluation system. Additionally, the assessment of applicants' emotional states and the calculation of suitability were cumbersome, making it difficult to achieve appropriate matching between companies and applicants.

[0067] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0068] In this invention, the server includes means for dynamically creating personalized questions based on generated information, means for transmitting the question set to a terminal in real time, and means for recording the user's responses as audio information and transmitting the information to the server. This enables an interview process optimized for each applicant and allows for real-time data analysis to evaluate suitability.

[0069] "A means of dynamically creating personalized questions based on generated information" refers to a method for generating appropriate questions in real time based on information that differs for each applicant, thereby providing a customized interview experience.

[0070] "A means of sending question sets to terminals in real time" refers to a method of instantly distributing questions generated on a server to terminals, enabling smooth interview progress.

[0071] "Means for recording the user's response as audio information and transmitting said information to a server" refers to means for saving the content of the applicant's verbal response as audio data and promptly transmitting that data to a server.

[0072] "Means for converting audio information into text information using automatic conversion means" refers to means that use speech recognition technology to convert recorded audio data into text.

[0073] "A means of analyzing textual information converted using natural language processing technology to evaluate emotional states" refers to a method of extracting emotions and intentions from interview content obtained as text using natural language processing technology, and evaluating the psychological state of the applicant.

[0074] "Methods for evaluating the suitability of applicants to the organization based on analyzed information" refer to methods for quantifying applicant characteristics and skills and calculating the degree of suitability by comparing them with the organization's requirements.

[0075] This invention is implemented as an automated interview system and mainly consists of three components: a server, a terminal, and a user. This system is designed to efficiently evaluate the suitability of each applicant, and its details are described below.

[0076] Server Functions

[0077] The server functions as the central device of the system. Based on data from applicants, it generates personalized questions using a generative AI model. For example, it dynamically generates questions based on the applicant's history using natural language generation technology. To achieve this, it uses natural language processing libraries such as TENSORFLOW® and PyTorch to manage question generation.

[0078] The server also receives voice information transmitted from the user via the terminal and converts it into text. This conversion process uses speech recognition software such as Google® Cloud Speech-to-Text API. The converted text information is analyzed using natural language processing libraries such as nltk and spaCy. This analysis extracts and quantifies the applicant's emotions and skills to calculate a degree of relevance.

[0079] Device functions

[0080] The terminal acts as an interface connecting the user and the server. It visually displays questions sent from the server for the user to review. Furthermore, it can also present questions audibly using Text-to-Speech technology. During the question-and-answer session, the terminal records the user's responses using a high-quality microphone and sends the audio data to the server in real time. This communication takes place in a stable internet environment.

[0081] User roles

[0082] The user is the subject of the interview, answering questions presented via the device verbally. This allows the user's skills and experience to be collected as audio data. Users are expected to be as natural as possible during the interview, and creating a stress-free environment is crucial.

[0083] Examples of specific cases and prompt statements

[0084] As a concrete example, if a user is asked "Please introduce yourself," they might respond with "I have been working in software development for five years." This information is then analyzed sequentially to provide appropriate feedback.

[0085] An example of a prompt message might be, "Create questions to elicit self-introductions from applicants during interviews. These questions should be designed to facilitate the assessment of the applicant's skills and experience." This allows the system to present applicants with personalized questions, enabling more efficient interviews.

[0086] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0087] Step 1:

[0088] The server receives the applicant's data and inputs prompts into a generative AI model to generate personalized questions. Specifically, it generates prompts based on the applicant's history and work experience, and inputs them into the generative AI model to create the most appropriate questions. As output, a personalized set of questions is generated and used in the next step.

[0089] Step 2:

[0090] The server sends the generated set of questions to the terminal in real time. The terminal either displays them visually or presents them aloud using speech synthesis technology. For example, it might display the question "Please introduce yourself" on the screen and simultaneously read it aloud. In this step, the generated set of questions is used as input, and the output is a question in a format that the user can review.

[0091] Step 3:

[0092] The user verbally answers questions presented through the device. The device records these voice responses using a high-quality microphone and generates audio data. The input is the user's verbal responses, and the output is the generated audio data. This audio data is immediately sent to the server.

[0093] Step 4:

[0094] The server converts the received audio data into text using speech recognition software. Specifically, it passes the audio file to a service like Google Cloud Speech-to-Text and obtains text data as the conversion result. The input is audio data, and the output is parseable text data.

[0095] Step 5:

[0096] The server applies natural language processing techniques to the converted text data to evaluate the applicant's emotional state and skills. This analysis utilizes libraries such as nltk and spaCy to analyze extracted keywords and context. The input is text data, and the output is a numerical evaluation result.

[0097] Step 6:

[0098] The server calculates the applicant's fit with the organization based on the analyzed information. Using a machine learning model, it calculates a fit score by comparing skills with required qualifications. The input is the evaluation result, and the output is the fit score. This step involves data manipulation to compare the applicant's skills with the company's hiring requirements.

[0099] Step 7:

[0100] The server generates the final evaluation results and provides feedback to the user via the terminal. Specifically, it generates an evaluation report and makes it available for viewing on the terminal. At this stage, the goodness-of-fit score is used as input, and a detailed feedback report is provided as output.

[0101] (Application Example 1)

[0102] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0103] Traditional interview systems require considerable time and effort to accurately assess an applicant's suitability, and the interviewer's subjectivity can influence the outcome. Furthermore, it is difficult to accurately grasp an applicant's emotions and psychological state, resulting in a problem where their fit with the organization cannot be properly determined.

[0104] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0105] In this invention, the server includes means for performing facial expression analysis using video equipment for the purpose of data visualization, means for performing voice analysis to evaluate the applicant's psychological state, and means for converting voice data into text in real time. This makes it possible to more accurately evaluate the applicant's emotions and aptitude, and to objectively and efficiently calculate their suitability for the organization.

[0106] "Data acquisition" refers to the process by which the system collects information related to interviews in real time.

[0107] "Audio data" refers to recordings of applicants' statements and saving the waveform of those sounds in digital format.

[0108] "Real-time text conversion" refers to the process of converting audio data into text data the moment it is received.

[0109] "Emotional state estimation" involves analyzing audio and image data to evaluate the applicant's emotions and psychological state.

[0110] "Analysis from audio and image data" refers to a method of measuring and analyzing data such as the applicant's voice tone and facial expressions.

[0111] "Fit" refers to the degree of compatibility or matching between an applicant and an organization, expressed using numerical values ​​or indicators.

[0112] A "server" is a central device that manages and analyzes interview data.

[0113] "Data visualization" is a technique that displays collected data in the form of graphs, charts, and other diagrams to make the content easier to understand intuitively.

[0114] "Facial expression analysis using video equipment" is a method that uses devices such as cameras to analyze the facial expressions of applicants and evaluate their emotional state.

[0115] "Voice analysis" is the process of analyzing voice data in detail to extract characteristics of speech content and emotions.

[0116] This invention is a system that evaluates applicants' aptitude and psychological state through automated interviews. The server receives applicant voice data in real time and converts it to text using advanced speech recognition software. This process utilizes speech recognition technologies such as the Google Speech-to-Text API. Furthermore, the server analyzes the received text data through natural language processing to evaluate emotions and psychological state. For this purpose, an emotion analysis model using the Transformers library is employed.

[0117] The terminal functions as an interface with the user (applicant), and is equipped with a camera and microphone to capture the applicant's video and audio. During the interview, the terminal performs facial expression analysis based on the video data and sends the results to the server. This facial expression analysis uses image processing technology based on OpenCV. This allows the server to comprehensively analyze the data obtained from the audio and images and calculate the applicant's suitability.

[0118] For example, if a security company uses this system to interview new security guards, the interviewer can monitor the applicant's psychological state and confidence in their answers in real time through smart glasses, and ask appropriate follow-up questions. This makes it possible to objectively and efficiently evaluate the applicant's characteristics.

[0119] An example of a prompt for a generative AI model is, "Please tell me what kind of facial expression the applicant has, and whether it indicates stress or confidence." This prompt enables deeper emotional analysis.

[0120] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0121] Step 1:

[0122] The device activates its camera and microphone as soon as the user begins preparing for the interview. The user's voice and video data are acquired as input and transmitted to the server in real time. During this process, data compression is performed to optimize transmission to the server.

[0123] Step 2:

[0124] The server converts received audio data into text in real time using speech recognition software. It receives compressed audio data as input and generates text data of applicants' responses as output. In this process, it uses the Google Speech-to-Text API to convert the audio signal into text format.

[0125] Step 3:

[0126] The server processes the converted text data through a natural language processing model (e.g., Transformers) to perform sentiment analysis. The input is text data, and the output is the sentiment evaluation result. It evaluates the applicant's emotional state based on the text content and provides the evaluation result to the interviewer.

[0127] Step 4:

[0128] The device analyzes received video data in real time using OpenCV to recognize and analyze the user's facial expressions. It extracts facial features from the input video data and generates numerical data of the user's emotional state based on their facial expressions as output. In this process, it performs image processing to capture subtle changes in facial expressions.

[0129] Step 5:

[0130] The server integrates the results of sentiment analysis based on audio and video data to calculate an overall suitability score. The input is the results of each data analysis, and the output is an indicator of the applicant's suitability score. This process allows for a comprehensive assessment of the applicant's emotions and psychological state, and evaluates their compatibility with the organization.

[0131] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0132] The present invention is implemented as an automated interview system incorporating an emotion engine. In addition to the configuration of a server, terminal, and user, the system has the function of recognizing the user's emotions using the emotion engine and integrating them into the analysis data.

[0133] 1. Server Functions and Roles

[0134] The server manages the entire interview process, particularly the analysis and integration of data obtained from the emotion engine. The server processes both audio and image data in real time, utilizing the emotion engine to infer the user's emotional state. Furthermore, it integrates this emotional data into existing analysis algorithms to provide a more comprehensive evaluation of the applicant's skills and suitability.

[0135] 2. Functions and roles of the device

[0136] The terminal acts as an interface, connecting the user and the system. It presents the user with questions sent from the server, records the user's responses, and sends them back to the server. It also captures the user's facial expressions through its built-in camera, providing data to the emotion engine in real time.

[0137] 3. Functions and roles of the emotional engine

[0138] The emotion engine analyzes nonverbal information obtained from the user's voice and facial expressions to identify their emotional state. This engine utilizes facial recognition technology and acoustic analysis to quantify the user's tension, anxiety, excitement, and other emotional states. The identified emotional state is then adjusted to avoid affecting the user's performance evaluation and sent to the server.

[0139] 4. User Roles

[0140] Users respond to questions presented via their device and participate in the interview in a natural manner. The introduction of an emotion engine includes the user's emotional aspects in the interview evaluation; however, this is intended to complement the accurate assessment of abilities, ensuring that the user's true capabilities are fairly evaluated.

[0141] As a concrete example of the system, the server sends the question "What are your professional strengths?" to the terminal, and the user answers, "I have strong analytical skills." Subsequently, the emotion engine detects that the user is stating this answer with confidence and incorporates this information into the overall evaluation. This creates an interview environment where the user can relax and demonstrate their true worth.

[0142] The following describes the processing flow.

[0143] Step 1:

[0144] The server prepares to begin the interview session, selects the first question from a pre-configured list, and sends it to the terminal.

[0145] Step 2:

[0146] The terminal displays and audibly presents the questions received from the server to the user. It then activates voice input mode to allow the user to answer.

[0147] Step 3:

[0148] The user answers questions from the device verbally. The device records the user's voice in high quality and sends the audio data to the server.

[0149] Step 4:

[0150] The server processes the received audio data using a natural language processing engine to convert it into text. During this process, it prepares to analyze the keywords and content of the response.

[0151] Step 5:

[0152] The device captures the user's facial expressions with its built-in camera and sends that video data to an emotion engine. This allows for real-time analysis of emotional information.

[0153] Step 6:

[0154] The emotion engine analyzes the user's voice tone and facial expressions to identify their emotional state. This information is used to estimate the user's level of tension and confidence.

[0155] Step 7:

[0156] The server integrates text and sentiment data to calculate a comprehensive score that adjusts for the user's skill fit and the influence of emotions. This allows for an assessment of the applicant's match with the company's needs.

[0157] Step 8:

[0158] The server returns the analysis results to the terminal, which then displays the results to the user as feedback. If necessary, it displays the next interview question and continues the interview process.

[0159] Step 9:

[0160] Once the user interviews are complete, the server compiles all the analysis results and sends an overall evaluation of the applicant to the company. The company then uses this information to make hiring decisions.

[0161] (Example 2)

[0162] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0163] Traditional interview methods have limitations in measuring applicants' skills and aptitudes, and it was particularly difficult to adequately assess their emotional aspects. Furthermore, the lack of a means to suggest the next steps based on the interview's progress made it difficult to properly evaluate candidates and achieve an efficient interview process.

[0164] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0165] In this invention, the server includes means for conducting automated interviews for the purpose of data acquisition, means for converting voice data into text in real time, means for analyzing the user's emotional state from voice and image data using an emotion engine, and means for integrating the analysis results into a comprehensive evaluation system. This enables a comprehensive evaluation that includes the applicant's emotions and also allows for suggestions for the next interview stage based on their aptitude.

[0166] "Automated interviews" refer to the process of conducting interviews with applicants using mechanical means and collecting data from them.

[0167] "Means of real-time text conversion" refers to technologies or devices that instantly convert audio data into written text.

[0168] An "emotion engine" is an algorithm or system that analyzes and identifies a user's emotional state based on audio and image data.

[0169] "Means for integrating analysis results into a comprehensive evaluation system" refers to a method or apparatus for unifying obtained emotional data and other evaluation data and incorporating them into a comprehensive evaluation.

[0170] "Means for comprehensively evaluating suitability" refers to a process or device that uses analytical data to comprehensively assess an applicant's abilities and suitability.

[0171] This invention relates to a system having an automated interview format. This system consists of a server, a terminal, an emotion engine, and a user. The roles of each component and specific embodiments are described below.

[0172] The server is the core of the system, responsible for preparing interview questions and analyzing data. Specifically, the server selects appropriate questions from the interview question database and sends them to the terminal. It also receives audio and image data sent from the terminal and analyzes the data using an emotion engine. The analysis results are integrated into the comprehensive evaluation system and used to evaluate the user's skills and aptitude from multiple perspectives.

[0173] The terminal functions as an interface connecting the user and the system. When the terminal receives a question from the server, it presents it to the user. It can record the user's responses and capture their facial expressions with its built-in camera. The collected data is transmitted to the server in real time.

[0174] The user answers questions presented by the device. This process allows the user to behave naturally, and the emotion engine employed captures the user's intonation and facial expressions to identify their emotional state.

[0175] The emotion engine is software that analyzes audio and image data. This engine utilizes facial recognition and acoustic analysis technologies to quantify the user's emotions, such as tension and confidence. This enables accurate emotion analysis, which is then reflected in the user's overall evaluation.

[0176] As a concrete example, consider a scenario where a server sends the question "What are your professional strengths?" to a terminal, and the user responds, "I have strong analytical skills." In this case, the emotion engine analyzes the user's confidence and integrates the results into the overall evaluation system. As a result, a more accurate evaluation is made that includes the user's emotions.

[0177] An example of a prompt message would be, "Please analyze what emotions the user is expressing," which would allow the generative AI model to assist in the analysis.

[0178] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0179] Step 1:

[0180] The server prepares the interview questions. The server selects the most appropriate questions from the interview question database based on the applied position. The input to this process is the applicant's profile information, and the output is the question data sent to the terminal.

[0181] Step 2:

[0182] The terminal presents the user with pre-prepared questions. The terminal displays the received question data in audio or text format to help the user understand the questions. The input is the question data from the server, and the output is the presentation of the questions to the user.

[0183] Step 3:

[0184] The user answers the presented questions. The user provides voice answers to the questions through the device. The user's facial expressions are also observed. In this step, the input is the questions from the device, and the output is the user's voice responses and facial expression information.

[0185] Step 4:

[0186] The device records the user's responses and facial expressions. The device records the user's voice using its built-in microphone and captures their facial expressions with its camera. This audio and visual data is transmitted to the server in real time. The input is the user's voice and facial expressions, and the output is the data transfer to the server.

[0187] Step 5:

[0188] The emotion engine operates on a server and analyzes received audio and facial expression data. It uses facial expression recognition algorithms and acoustic analysis techniques to quantify the user's emotional state. Input is audio and image data from the device, and output is data representing the identified emotional state.

[0189] Step 6:

[0190] The server integrates the analysis results into the comprehensive evaluation system. The server takes in the emotional data obtained from the emotion engine along with other evaluation data to form the final evaluation. The input is the output data from the emotion engine, and the output is a comprehensive applicant evaluation.

[0191] (Application Example 2)

[0192] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0193] In modern autonomous vehicles, it is difficult to grasp passengers' psychological comfort in real time and provide individualized feedback and adjustments. Furthermore, the inability to provide optimal travel environments and routes that respond to passengers' emotions limits the quality of the travel experience.

[0194] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0195] In this invention, the server includes means for performing automated screening for the purpose of data collection, means for converting acoustic data into text information in real time, means for analyzing audio and image data to infer emotional states, means for adjusting the environment to optimize vehicle control based on the passenger's emotional state, and means for proposing the optimal travel route to the passenger based on the analysis results. This makes it possible to provide a travel environment and route that matches the passenger's psychological state in real time.

[0196] "Data collection" refers to the act of collecting information and data necessary to implement automated review processes.

[0197] "Audio data" refers to digital or analog data information that includes audio signals.

[0198] "Textual information" refers to information expressed in text format.

[0199] "Emotional state" is a concept that refers to a person's psychological state or emotional tendencies.

[0200] "Analytical means" refers to methods or devices used to derive specific conclusions or information based on data.

[0201] "Mobile object control" refers to the act of managing the operation and movement of vehicles and other mobile objects.

[0202] "Environmental adjustment means" refers to technologies or devices for adjusting the physical or digital environment to match the psychological state of passengers.

[0203] A "travel route" refers to the optimal path or route from the starting point to the destination.

[0204] "Passenger" refers to a person who travels using an autonomous vehicle or other means of transportation.

[0205] The system that realizes this application consists of multiple components. The server collects and analyzes passenger data by converting acoustic data into text information in real time and performing analysis to identify emotional states. In the emotion analysis, a specific generative AI model is used to infer the passenger's psychological state from audio and image data, and based on the results, optimal vehicle control and environmental adjustments are made.

[0206] The terminal works in conjunction with the server to display the optimal travel route for passengers and adjusts environmental settings as needed. Specifically, it provides feedback and suggestions to passengers through the terminal's display and speaker. This enables appropriate interaction that responds to the passenger's emotions.

[0207] Passengers, as users, can customize their environment and select routes according to their preferences through the terminal's interface. This enables a personalized travel experience.

[0208] As a concrete example, when the server detects a passenger's level of stress, relaxation music could be played on the terminal. This series of actions would allow passengers to enjoy a more comfortable and reassuring travel experience.

[0209] Example of a prompt:

[0210] "The emotion recognition system captures passenger voice and facial expression data to determine whether the passenger is relaxed or stressed. Based on that determination, please suggest appropriate adjustments to the in-car environment."

[0211] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0212] Step 1:

[0213] The server receives passenger audio and image data from the terminal. This input data is collected via microphone and camera and sent to the server in real time. Once the data reaches the server, it is processed using a generative AI model to convert the audio into text information. This calculation outputs useful textual information from the acoustic data.

[0214] Step 2:

[0215] The server utilizes a generated AI model to analyze passengers' emotional states from audio and image data. Based on the input audio and image data, the server uses an analysis algorithm to quantify passengers' emotions and measure their levels of tension and relaxation. This analysis result is generated as the server's output.

[0216] Step 3:

[0217] The terminal receives emotion analysis results from the server and adjusts the environment based on the passenger's psychological state. Based on the analysis results received as input, the terminal's control program operates, for example, activating a speaker to play relaxation music. It also displays emotionally appropriate suggestions on the display.

[0218] Step 4:

[0219] Passengers, as users, can review feedback from their devices and make choices to adjust their travel experience. By viewing relaxation music and route suggestions output from the device, passengers can feel more at ease. Further adjustments can be made based on new data as users re-enter their preferences.

[0220] Step 5:

[0221] The server receives new user input and repeats the same process as the previous step, continuously adjusting the environment and route accordingly. This cycle ensures that an optimized travel experience is always provided.

[0222] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0223] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0224] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0225] [Second Embodiment]

[0226] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0227] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0228] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0229] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0230] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0231] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0232] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0233] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0234] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0235] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0236] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0237] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0238] This invention is implemented in the form of an automated interview system and mainly consists of three main components: a server, a terminal, and a user. The functions performed by each component of the system and their specific roles are described below.

[0239] 1. Server Functions and Roles

[0240] The server acts as the central hub of the system, managing interviews, processing and analyzing data, and providing results. Specifically, the server prepares an optimal set of questions for each applicant and sends them to the terminal in real time. Furthermore, the server quickly converts the received audio data into text and analyzes the text using natural language processing technology. Emotional state inference is achieved by analyzing facial expressions from the tone and tempo of the voice, as well as from image data (if the interview is in video format). This allows the server to quantify the applicant's suitability and skills and calculate their fit with the company.

[0241] 2. Functions and roles of the device

[0242] The terminal functions as an interface with the user. Specifically, the terminal presents the user with questions sent from the server and records the user's responses as audio data. It also sends the recorded audio data to the server in real time, maintaining a data pipeline to ensure the smooth progress of the interview. The terminal provides an environment where the user can behave naturally and without stress.

[0243] 3. User Roles

[0244] As the main participant in the interview process, the user verbally answers questions presented via a device. The user's responses are recorded in real time and sent to a server as audio data for analysis. Users are expected to present their skills and experience without feeling nervous.

[0245] These components work together organically to enable fair and efficient automated interviews. For example, the server sends a question to the terminal saying, "Please introduce yourself," and the user answers. Through this series of interactions, the system can evaluate the applicant's abilities and characteristics with high accuracy.

[0246] The following describes the processing flow.

[0247] Step 1:

[0248] The server initiates the interview session and selects the first question from a pre-configured list. The selected question is then sent to the terminal.

[0249] Step 2:

[0250] The terminal presents the user with questions received from the server, either verbally or in text. The interface is appropriately adjusted to ensure the user can comfortably answer the questions.

[0251] Step 3:

[0252] The user answers questions presented by the device verbally. The answers are recorded in real time by the device.

[0253] Step 4:

[0254] The device sends the recorded audio data to the server. The transmission is performed without delay, ensuring high-quality data.

[0255] Step 5:

[0256] The server inputs the received audio data into a natural language processing (NLP) engine and converts it into text. The converted text becomes the basis for analyzing interview responses.

[0257] Step 6:

[0258] The server analyzes text data, extracts keywords, and evaluates the content of the responses. Furthermore, it infers the user's emotional state through analysis of tone and pace.

[0259] Step 7:

[0260] Based on the analysis results, the server matches the user's skill set with the skills required by the company and calculates a suitability score.

[0261] Step 8:

[0262] The server sends the goodness-of-fit score and analysis results to the terminal, while simultaneously selecting the next question and continuing the evaluation as needed.

[0263] Step 9:

[0264] The terminal displays the feedback and analysis results received from the server to the user and, if necessary, presents the following questions.

[0265] Step 10:

[0266] Once the process is complete, the server sends the final evaluation results to the company and also provides the user with information about the interview results and the next steps.

[0267] (Example 1)

[0268] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0269] In traditional interview processes, questions were standardized, making it difficult to evaluate applicants based on their individual characteristics. Furthermore, real-time data processing was insufficient, posing a challenge in realizing an efficient applicant evaluation system. Additionally, the assessment of applicants' emotional states and the calculation of suitability were cumbersome, making it difficult to achieve appropriate matching between companies and applicants.

[0270] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0271] In this invention, the server includes means for dynamically creating personalized questions based on generated information, means for transmitting the question set to a terminal in real time, and means for recording the user's responses as audio information and transmitting the information to the server. This enables an interview process optimized for each applicant and allows for real-time data analysis to evaluate suitability.

[0272] "A means of dynamically creating personalized questions based on generated information" refers to a method for generating appropriate questions in real time based on information that differs for each applicant, thereby providing a customized interview experience.

[0273] "A means of sending question sets to terminals in real time" refers to a method of instantly distributing questions generated on a server to terminals, enabling smooth interview progress.

[0274] "Means for recording the user's response as audio information and transmitting said information to a server" refers to means for saving the content of the applicant's verbal response as audio data and promptly transmitting that data to a server.

[0275] "Means for converting audio information into text information using automatic conversion means" refers to means that use speech recognition technology to convert recorded audio data into text.

[0276] "A means of analyzing textual information converted using natural language processing technology to evaluate emotional states" refers to a method of extracting emotions and intentions from interview content obtained as text using natural language processing technology, and evaluating the psychological state of the applicant.

[0277] "Methods for evaluating the suitability of applicants to the organization based on analyzed information" refer to methods for quantifying applicant characteristics and skills and calculating the degree of suitability by comparing them with the organization's requirements.

[0278] This invention is implemented as an automated interview system and mainly consists of three components: a server, a terminal, and a user. This system is designed to efficiently evaluate the suitability of each applicant, and its details are described below.

[0279] Server Functions

[0280] The server functions as the central device of the system. Based on the data from the applicant, it creates individualized questions using a generative AI model. For example, it dynamically generates questions based on the applicant's history using natural language generation technology. To do this, it uses libraries such as TensorFlow or PyTorch as natural language processing libraries to manage question generation.

[0281] The server also receives voice information sent from the user via the terminal and converts it into character information. In this conversion process, it uses software such as the Google Cloud Speech-to-Text API as voice recognition software. The converted character information is analyzed using natural language processing libraries such as nltk or spaCy. Through this analysis, the applicant's emotions and skills are extracted, quantified, and the degree of fit is calculated.

[0282] Functions of the terminal

[0283] The terminal serves as an interface connecting the user and the server. It visually displays the questions sent from the server so that the user can view them. Furthermore, it is also possible to present them audibly using Text-to-Speech technology. On the terminal, during the question-and-answer process, the user's responses are recorded using a high-quality microphone and sent to the server in real time as voice data. This communication is carried out in a stable Internet environment.

[0284] Role of the user

[0285] The user is the subject of the interview and verbally answers the questions presented through the terminal. As a result, the user's skills and experiences are collected as voice data. The user is required to approach the interview as naturally as possible, and creating a stress-free environment is important.

[0286] Examples of specific cases and prompt sentences

[0287] As a concrete example, if a user is asked "Please introduce yourself," they might respond with "I have been working in software development for five years." This information is then analyzed sequentially to provide appropriate feedback.

[0288] An example of a prompt message might be, "Create questions to elicit self-introductions from applicants during interviews. These questions should be designed to facilitate the assessment of the applicant's skills and experience." This allows the system to present applicants with personalized questions, enabling more efficient interviews.

[0289] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0290] Step 1:

[0291] The server receives the applicant's data and inputs prompts into a generative AI model to generate personalized questions. Specifically, it generates prompts based on the applicant's history and work experience, and inputs them into the generative AI model to create the most appropriate questions. As output, a personalized set of questions is generated and used in the next step.

[0292] Step 2:

[0293] The server sends the generated set of questions to the terminal in real time. The terminal either displays them visually or presents them aloud using speech synthesis technology. For example, it might display the question "Please introduce yourself" on the screen and simultaneously read it aloud. In this step, the generated set of questions is used as input, and the output is a question in a format that the user can review.

[0294] Step 3:

[0295] The user verbally answers questions presented through the device. The device records these voice responses using a high-quality microphone and generates audio data. The input is the user's verbal responses, and the output is the generated audio data. This audio data is immediately sent to the server.

[0296] Step 4:

[0297] The server converts the received audio data into text using speech recognition software. Specifically, it passes the audio file to a service like Google Cloud Speech-to-Text and obtains text data as the conversion result. The input is audio data, and the output is parseable text data.

[0298] Step 5:

[0299] The server applies natural language processing techniques to the converted text data to evaluate the applicant's emotional state and skills. This analysis utilizes libraries such as nltk and spaCy to analyze extracted keywords and context. The input is text data, and the output is a numerical evaluation result.

[0300] Step 6:

[0301] The server calculates the applicant's fit with the organization based on the analyzed information. Using a machine learning model, it calculates a fit score by comparing skills with required qualifications. The input is the evaluation result, and the output is the fit score. This step involves data manipulation to compare the applicant's skills with the company's hiring requirements.

[0302] Step 7:

[0303] The server generates the final evaluation results and provides feedback to the user via the terminal. Specifically, it generates an evaluation report and makes it available for viewing on the terminal. At this stage, the goodness-of-fit score is used as input, and a detailed feedback report is provided as output.

[0304] (Application Example 1)

[0305] Next, Application Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0306] In a conventional interview system, a great deal of time and effort are required to accurately evaluate the suitability of applicants, and moreover, the subjective judgment of the interviewer may affect the results. In addition, it is difficult to accurately grasp the emotions and psychological states of applicants, and as a result, there is a problem that the degree of fit with the organization cannot be appropriately judged.

[0307] The specific processing by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following respective means.

[0308] In this invention, the server includes means for performing facial expression analysis using a video device for the purpose of data visualization, means for performing voice analysis to evaluate the psychological state of the applicant, and means for converting voice data into text in real time. Thereby, it becomes possible to more accurately evaluate the emotions and suitability of the applicant and objectively and efficiently calculate the degree of fit with the organization.

[0309] "Data acquisition" is a process in which the system collects information related to the interview in real time.

[0310] "Voice data" is obtained by recording the speech of the applicant and storing the waveform of the voice in digital format.

[0311] "Converting to text in real time" is a process of converting the voice into character data at the moment when the voice data is received.

[0312] "Inference of emotional state" is to analyze voice and image data and evaluate the emotions and psychological state of the applicant.

[0313] "Analysis from audio and image data" refers to a method of measuring and analyzing data such as the applicant's voice tone and facial expressions.

[0314] "Fit" refers to the degree of compatibility or matching between an applicant and an organization, expressed using numerical values ​​or indicators.

[0315] A "server" is a central device that manages and analyzes interview data.

[0316] "Data visualization" is a technique that displays collected data in the form of graphs, charts, and other diagrams to make the content easier to understand intuitively.

[0317] "Facial expression analysis using video equipment" is a method that uses devices such as cameras to analyze the facial expressions of applicants and evaluate their emotional state.

[0318] "Voice analysis" is the process of analyzing voice data in detail to extract characteristics of speech content and emotions.

[0319] This invention is a system that evaluates applicants' aptitude and psychological state through automated interviews. The server receives applicant voice data in real time and converts it to text using advanced speech recognition software. This process utilizes speech recognition technologies such as the Google Speech-to-Text API. Furthermore, the server analyzes the received text data through natural language processing to evaluate emotions and psychological state. For this purpose, an emotion analysis model using the Transformers library is employed.

[0320] The terminal functions as an interface with the user (applicant), and is equipped with a camera and microphone to capture the applicant's video and audio. During the interview, the terminal performs facial expression analysis based on the video data and sends the results to the server. This facial expression analysis uses image processing technology based on OpenCV. This allows the server to comprehensively analyze the data obtained from the audio and images and calculate the applicant's suitability.

[0321] For example, if a security company uses this system to interview new security guards, the interviewer can monitor the applicant's psychological state and confidence in their answers in real time through smart glasses, and ask appropriate follow-up questions. This makes it possible to objectively and efficiently evaluate the applicant's characteristics.

[0322] An example of a prompt for a generative AI model is, "Please tell me what kind of facial expression the applicant has, and whether it indicates stress or confidence." This prompt enables deeper emotional analysis.

[0323] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0324] Step 1:

[0325] The device activates its camera and microphone as soon as the user begins preparing for the interview. The user's voice and video data are acquired as input and transmitted to the server in real time. During this process, data compression is performed to optimize transmission to the server.

[0326] Step 2:

[0327] The server converts received audio data into text in real time using speech recognition software. It receives compressed audio data as input and generates text data of applicants' responses as output. In this process, it uses the Google Speech-to-Text API to convert the audio signal into text format.

[0328] Step 3:

[0329] The server processes the converted text data through a natural language processing model (e.g., Transformers) to perform sentiment analysis. The input is text data, and the output is the sentiment evaluation result. It evaluates the applicant's emotional state based on the text content and provides the evaluation result to the interviewer.

[0330] Step 4:

[0331] The device analyzes received video data in real time using OpenCV to recognize and analyze the user's facial expressions. It extracts facial features from the input video data and generates numerical data of the user's emotional state based on their facial expressions as output. In this process, it performs image processing to capture subtle changes in facial expressions.

[0332] Step 5:

[0333] The server integrates the results of sentiment analysis based on audio and video data to calculate an overall suitability score. The input is the results of each data analysis, and the output is an indicator of the applicant's suitability score. This process allows for a comprehensive assessment of the applicant's emotions and psychological state, and evaluates their compatibility with the organization.

[0334] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0335] The present invention is implemented as an automated interview system incorporating an emotion engine. In addition to the configuration of a server, terminal, and user, the system has the function of recognizing the user's emotions using the emotion engine and integrating them into the analysis data.

[0336] 1. Server Functions and Roles

[0337] The server manages the entire interview process, particularly the analysis and integration of data obtained from the emotion engine. The server processes both audio and image data in real time, utilizing the emotion engine to infer the user's emotional state. Furthermore, it integrates this emotional data into existing analysis algorithms to provide a more comprehensive evaluation of the applicant's skills and suitability.

[0338] 2. Functions and roles of the device

[0339] The terminal acts as an interface, connecting the user and the system. It presents the user with questions sent from the server, records the user's responses, and sends them back to the server. It also captures the user's facial expressions through its built-in camera, providing data to the emotion engine in real time.

[0340] 3. Functions and roles of the emotional engine

[0341] The emotion engine analyzes nonverbal information obtained from the user's voice and facial expressions to identify their emotional state. This engine utilizes facial recognition technology and acoustic analysis to quantify the user's tension, anxiety, excitement, and other emotional states. The identified emotional state is then adjusted to avoid affecting the user's performance evaluation and sent to the server.

[0342] 4. User Roles

[0343] Users respond to questions presented via their device and participate in the interview in a natural manner. The introduction of an emotion engine includes the user's emotional aspects in the interview evaluation; however, this is intended to complement the accurate assessment of abilities, ensuring that the user's true capabilities are fairly evaluated.

[0344] As a concrete example of the system, the server sends the question "What are your professional strengths?" to the terminal, and the user answers, "I have strong analytical skills." Subsequently, the emotion engine detects that the user is stating this answer with confidence and incorporates this information into the overall evaluation. This creates an interview environment where the user can relax and demonstrate their true worth.

[0345] The following describes the processing flow.

[0346] Step 1:

[0347] The server prepares to begin the interview session, selects the first question from a pre-configured list, and sends it to the terminal.

[0348] Step 2:

[0349] The terminal displays and audibly presents the questions received from the server to the user. It then activates voice input mode to allow the user to answer.

[0350] Step 3:

[0351] The user answers questions from the device verbally. The device records the user's voice in high quality and sends the audio data to the server.

[0352] Step 4:

[0353] The server processes the received audio data using a natural language processing engine to convert it into text. During this process, it prepares to analyze the keywords and content of the response.

[0354] Step 5:

[0355] The device captures the user's facial expressions with its built-in camera and sends that video data to an emotion engine. This allows for real-time analysis of emotional information.

[0356] Step 6:

[0357] The emotion engine analyzes the user's voice tone and facial expressions to identify their emotional state. This information is used to estimate the user's level of tension and confidence.

[0358] Step 7:

[0359] The server integrates text and sentiment data to calculate a comprehensive score that adjusts for the user's skill fit and the influence of emotions. This allows for an assessment of the applicant's match with the company's needs.

[0360] Step 8:

[0361] The server returns the analysis results to the terminal, which then displays the results to the user as feedback. If necessary, it displays the next interview question and continues the interview process.

[0362] Step 9:

[0363] Once the user interviews are complete, the server compiles all the analysis results and sends an overall evaluation of the applicant to the company. The company then uses this information to make hiring decisions.

[0364] (Example 2)

[0365] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0366] Traditional interview methods have limitations in measuring applicants' skills and aptitudes, and it was particularly difficult to adequately assess their emotional aspects. Furthermore, the lack of a means to suggest the next steps based on the interview's progress made it difficult to properly evaluate candidates and achieve an efficient interview process.

[0367] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0368] In this invention, the server includes means for conducting automated interviews for the purpose of data acquisition, means for converting voice data into text in real time, means for analyzing the user's emotional state from voice and image data using an emotion engine, and means for integrating the analysis results into a comprehensive evaluation system. This enables a comprehensive evaluation that includes the applicant's emotions and also allows for suggestions for the next interview stage based on their aptitude.

[0369] "Automated interviews" refer to the process of conducting interviews with applicants using mechanical means and collecting data from them.

[0370] "Means of real-time text conversion" refers to technologies or devices that instantly convert audio data into written text.

[0371] An "emotion engine" is an algorithm or system that analyzes and identifies a user's emotional state based on audio and image data.

[0372] "Means for integrating analysis results into a comprehensive evaluation system" refers to a method or apparatus for unifying obtained emotional data and other evaluation data and incorporating them into a comprehensive evaluation.

[0373] "Means for comprehensively evaluating suitability" refers to a process or device that uses analytical data to comprehensively assess an applicant's abilities and suitability.

[0374] This invention relates to a system having an automated interview format. This system consists of a server, a terminal, an emotion engine, and a user. The roles of each component and specific embodiments are described below.

[0375] The server is the core of the system, responsible for preparing interview questions and analyzing data. Specifically, the server selects appropriate questions from the interview question database and sends them to the terminal. It also receives audio and image data sent from the terminal and analyzes the data using an emotion engine. The analysis results are integrated into the comprehensive evaluation system and used to evaluate the user's skills and aptitude from multiple perspectives.

[0376] The terminal functions as an interface connecting the user and the system. When the terminal receives a question from the server, it presents it to the user. It can record the user's responses and capture their facial expressions with its built-in camera. The collected data is transmitted to the server in real time.

[0377] The user answers questions presented by the device. This process allows the user to behave naturally, and the emotion engine employed captures the user's intonation and facial expressions to identify their emotional state.

[0378] The emotion engine is software that analyzes audio and image data. This engine utilizes facial recognition and acoustic analysis technologies to quantify the user's emotions, such as tension and confidence. This enables accurate emotion analysis, which is then reflected in the user's overall evaluation.

[0379] As a concrete example, consider a scenario where a server sends the question "What are your professional strengths?" to a terminal, and the user responds, "I have strong analytical skills." In this case, the emotion engine analyzes the user's confidence and integrates the results into the overall evaluation system. As a result, a more accurate evaluation is made that includes the user's emotions.

[0380] An example of a prompt message would be, "Please analyze what emotions the user is expressing," which would allow the generative AI model to assist in the analysis.

[0381] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0382] Step 1:

[0383] The server prepares the interview questions. The server selects the most appropriate questions from the interview question database based on the applied position. The input to this process is the applicant's profile information, and the output is the question data sent to the terminal.

[0384] Step 2:

[0385] The terminal presents the user with pre-prepared questions. The terminal displays the received question data in audio or text format to help the user understand the questions. The input is the question data from the server, and the output is the presentation of the questions to the user.

[0386] Step 3:

[0387] The user answers the presented questions. The user provides voice answers to the questions through the device. The user's facial expressions are also observed. In this step, the input is the questions from the device, and the output is the user's voice responses and facial expression information.

[0388] Step 4:

[0389] The device records the user's responses and facial expressions. The device records the user's voice using its built-in microphone and captures their facial expressions with its camera. This audio and visual data is transmitted to the server in real time. The input is the user's voice and facial expressions, and the output is the data transfer to the server.

[0390] Step 5:

[0391] The emotion engine operates on a server and analyzes received audio and facial expression data. It uses facial expression recognition algorithms and acoustic analysis techniques to quantify the user's emotional state. Input is audio and image data from the device, and output is data representing the identified emotional state.

[0392] Step 6:

[0393] The server integrates the analysis results into the comprehensive evaluation system. The server takes in the emotional data obtained from the emotion engine along with other evaluation data to form the final evaluation. The input is the output data from the emotion engine, and the output is a comprehensive applicant evaluation.

[0394] (Application Example 2)

[0395] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0396] In modern autonomous vehicles, it is difficult to grasp passengers' psychological comfort in real time and provide individualized feedback and adjustments. Furthermore, the inability to provide optimal travel environments and routes that respond to passengers' emotions limits the quality of the travel experience.

[0397] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0398] In this invention, the server includes means for performing automated screening for the purpose of data collection, means for converting acoustic data into text information in real time, means for analyzing audio and image data to infer emotional states, means for adjusting the environment to optimize vehicle control based on the passenger's emotional state, and means for proposing the optimal travel route to the passenger based on the analysis results. This makes it possible to provide a travel environment and route that matches the passenger's psychological state in real time.

[0399] "Data collection" refers to the act of collecting information and data necessary to implement automated review processes.

[0400] "Audio data" refers to digital or analog data information that includes audio signals.

[0401] "Textual information" refers to information expressed in text format.

[0402] "Emotional state" is a concept that refers to a person's psychological state or emotional tendencies.

[0403] "Analytical means" refers to methods or devices used to derive specific conclusions or information based on data.

[0404] "Mobile object control" refers to the act of managing the operation and movement of vehicles and other mobile objects.

[0405] "Environmental adjustment means" refers to technologies or devices for adjusting the physical or digital environment to match the psychological state of passengers.

[0406] A "travel route" refers to the optimal path or route from the starting point to the destination.

[0407] "Passenger" refers to a person who travels using an autonomous vehicle or other means of transportation.

[0408] The system that realizes this application consists of multiple components. The server collects and analyzes passenger data by converting acoustic data into text information in real time and performing analysis to identify emotional states. In the emotion analysis, a specific generative AI model is used to infer the passenger's psychological state from audio and image data, and based on the results, optimal vehicle control and environmental adjustments are made.

[0409] The terminal works in conjunction with the server to display the optimal travel route for passengers and adjusts environmental settings as needed. Specifically, it provides feedback and suggestions to passengers through the terminal's display and speaker. This enables appropriate interaction that responds to the passenger's emotions.

[0410] Passengers, as users, can customize their environment and select routes according to their preferences through the terminal's interface. This enables a personalized travel experience.

[0411] As a concrete example, when the server detects a passenger's level of stress, relaxation music could be played on the terminal. This series of actions would allow passengers to enjoy a more comfortable and reassuring travel experience.

[0412] Example of a prompt:

[0413] "The emotion recognition system captures passenger voice and facial expression data to determine whether the passenger is relaxed or stressed. Based on that determination, please suggest appropriate adjustments to the in-car environment."

[0414] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0415] Step 1:

[0416] The server receives passenger audio and image data from the terminal. This input data is collected via microphone and camera and sent to the server in real time. Once the data reaches the server, it is processed using a generative AI model to convert the audio into text information. This calculation outputs useful textual information from the acoustic data.

[0417] Step 2:

[0418] The server utilizes a generated AI model to analyze passengers' emotional states from audio and image data. Based on the input audio and image data, the server uses an analysis algorithm to quantify passengers' emotions and measure their levels of tension and relaxation. This analysis result is generated as the server's output.

[0419] Step 3:

[0420] The terminal receives emotion analysis results from the server and adjusts the environment based on the passenger's psychological state. Based on the analysis results received as input, the terminal's control program operates, for example, activating a speaker to play relaxation music. It also displays emotionally appropriate suggestions on the display.

[0421] Step 4:

[0422] Passengers, as users, can review feedback from their devices and make choices to adjust their travel experience. By viewing relaxation music and route suggestions output from the device, passengers can feel more at ease. Further adjustments can be made based on new data as users re-enter their preferences.

[0423] Step 5:

[0424] The server receives new user input and repeats the same process as the previous step, continuously adjusting the environment and route accordingly. This cycle ensures that an optimized travel experience is always provided.

[0425] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0426] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0427] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0428] [Third Embodiment]

[0429] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0430] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0431] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0432] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0433] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0434] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0435] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0436] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0437] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0438] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0439] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0440] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0441] This invention is implemented in the form of an automated interview system and mainly consists of three main components: a server, a terminal, and a user. The functions performed by each component of the system and their specific roles are described below.

[0442] 1. Server Functions and Roles

[0443] The server acts as the central hub of the system, managing interviews, processing and analyzing data, and providing results. Specifically, the server prepares an optimal set of questions for each applicant and sends them to the terminal in real time. Furthermore, the server quickly converts the received audio data into text and analyzes the text using natural language processing technology. Emotional state inference is achieved by analyzing facial expressions from the tone and tempo of the voice, as well as from image data (if the interview is in video format). This allows the server to quantify the applicant's suitability and skills and calculate their fit with the company.

[0444] 2. Functions and roles of the device

[0445] The terminal functions as an interface with the user. Specifically, the terminal presents the user with questions sent from the server and records the user's responses as audio data. It also sends the recorded audio data to the server in real time, maintaining a data pipeline to ensure the smooth progress of the interview. The terminal provides an environment where the user can behave naturally and without stress.

[0446] 3. User Roles

[0447] As the main participant in the interview process, the user verbally answers questions presented via a device. The user's responses are recorded in real time and sent to a server as audio data for analysis. Users are expected to present their skills and experience without feeling nervous.

[0448] These components work together organically to enable fair and efficient automated interviews. For example, the server sends a question to the terminal saying, "Please introduce yourself," and the user answers. Through this series of interactions, the system can evaluate the applicant's abilities and characteristics with high accuracy.

[0449] The following describes the processing flow.

[0450] Step 1:

[0451] The server initiates the interview session and selects the first question from a pre-configured list. The selected question is then sent to the terminal.

[0452] Step 2:

[0453] The terminal presents the user with questions received from the server, either verbally or in text. The interface is appropriately adjusted to ensure the user can comfortably answer the questions.

[0454] Step 3:

[0455] The user answers questions presented by the device verbally. The answers are recorded in real time by the device.

[0456] Step 4:

[0457] The device sends the recorded audio data to the server. The transmission is performed without delay, ensuring high-quality data.

[0458] Step 5:

[0459] The server inputs the received audio data into a natural language processing (NLP) engine and converts it into text. The converted text becomes the basis for analyzing interview responses.

[0460] Step 6:

[0461] The server analyzes text data, extracts keywords, and evaluates the content of the responses. Furthermore, it infers the user's emotional state through analysis of tone and pace.

[0462] Step 7:

[0463] Based on the analysis results, the server matches the user's skill set with the skills required by the company and calculates a suitability score.

[0464] Step 8:

[0465] The server sends the goodness-of-fit score and analysis results to the terminal, while simultaneously selecting the next question and continuing the evaluation as needed.

[0466] Step 9:

[0467] The terminal displays the feedback and analysis results received from the server to the user and, if necessary, presents the following questions.

[0468] Step 10:

[0469] Once the process is complete, the server sends the final evaluation results to the company and also provides the user with information about the interview results and the next steps.

[0470] (Example 1)

[0471] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0472] In traditional interview processes, questions were standardized, making it difficult to evaluate applicants based on their individual characteristics. Furthermore, real-time data processing was insufficient, posing a challenge in realizing an efficient applicant evaluation system. Additionally, the assessment of applicants' emotional states and the calculation of suitability were cumbersome, making it difficult to achieve appropriate matching between companies and applicants.

[0473] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0474] In this invention, the server includes means for dynamically creating personalized questions based on generated information, means for transmitting the question set to a terminal in real time, and means for recording the user's responses as audio information and transmitting the information to the server. This enables an interview process optimized for each applicant and allows for real-time data analysis to evaluate suitability.

[0475] "A means of dynamically creating personalized questions based on generated information" refers to a method for generating appropriate questions in real time based on information that differs for each applicant, thereby providing a customized interview experience.

[0476] "A means of sending question sets to terminals in real time" refers to a method of instantly distributing questions generated on a server to terminals, enabling smooth interview progress.

[0477] "Means for recording the user's response as audio information and transmitting said information to a server" refers to means for saving the content of the applicant's verbal response as audio data and promptly transmitting that data to a server.

[0478] "Means for converting audio information into text information using automatic conversion means" refers to means that use speech recognition technology to convert recorded audio data into text.

[0479] "A means of analyzing textual information converted using natural language processing technology to evaluate emotional states" refers to a method of extracting emotions and intentions from interview content obtained as text using natural language processing technology, and evaluating the psychological state of the applicant.

[0480] "Methods for evaluating the suitability of applicants to the organization based on analyzed information" refer to methods for quantifying applicant characteristics and skills and calculating the degree of suitability by comparing them with the organization's requirements.

[0481] This invention is implemented as an automated interview system and mainly consists of three components: a server, a terminal, and a user. This system is designed to efficiently evaluate the suitability of each applicant, and its details are described below.

[0482] Server Functions

[0483] The server functions as the central device of the system. Based on data from applicants, it generates personalized questions using a generative AI model. For example, it dynamically generates questions based on the applicant's history using natural language generation technology. To achieve this, it uses natural language processing libraries such as TensorFlow and PyTorch to manage question generation.

[0484] The server also receives voice information transmitted from the user via the terminal and converts it into text. This conversion process uses speech recognition software such as the Google Cloud Speech-to-Text API. The converted text information is analyzed using natural language processing libraries such as nltk and spaCy. This analysis extracts and quantifies the applicant's emotions and skills to calculate a degree of relevance.

[0485] Device functions

[0486] The terminal acts as an interface connecting the user and the server. It visually displays questions sent from the server for the user to review. Furthermore, it can also present questions audibly using Text-to-Speech technology. During the question-and-answer session, the terminal records the user's responses using a high-quality microphone and sends the audio data to the server in real time. This communication takes place in a stable internet environment.

[0487] User roles

[0488] The user is the subject of the interview, answering questions presented via the device verbally. This allows the user's skills and experience to be collected as audio data. Users are expected to be as natural as possible during the interview, and creating a stress-free environment is crucial.

[0489] Examples of specific cases and prompt statements

[0490] As a concrete example, if a user is asked "Please introduce yourself," they might respond with "I have been working in software development for five years." This information is then analyzed sequentially to provide appropriate feedback.

[0491] An example of a prompt message might be, "Create questions to elicit self-introductions from applicants during interviews. These questions should be designed to facilitate the assessment of the applicant's skills and experience." This allows the system to present applicants with personalized questions, enabling more efficient interviews.

[0492] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0493] Step 1:

[0494] The server receives the applicant's data and inputs prompts into a generative AI model to generate personalized questions. Specifically, it generates prompts based on the applicant's history and work experience, and inputs them into the generative AI model to create the most appropriate questions. As output, a personalized set of questions is generated and used in the next step.

[0495] Step 2:

[0496] The server sends the generated set of questions to the terminal in real time. The terminal either displays them visually or presents them aloud using speech synthesis technology. For example, it might display the question "Please introduce yourself" on the screen and simultaneously read it aloud. In this step, the generated set of questions is used as input, and the output is a question in a format that the user can review.

[0497] Step 3:

[0498] The user verbally answers questions presented through the device. The device records these voice responses using a high-quality microphone and generates audio data. The input is the user's verbal responses, and the output is the generated audio data. This audio data is immediately sent to the server.

[0499] Step 4:

[0500] The server converts the received audio data into text using speech recognition software. Specifically, it passes the audio file to a service like Google Cloud Speech-to-Text and obtains text data as the conversion result. The input is audio data, and the output is parseable text data.

[0501] Step 5:

[0502] The server applies natural language processing techniques to the converted text data to evaluate the applicant's emotional state and skills. This analysis utilizes libraries such as nltk and spaCy to analyze extracted keywords and context. The input is text data, and the output is a numerical evaluation result.

[0503] Step 6:

[0504] The server calculates the applicant's fit with the organization based on the analyzed information. Using a machine learning model, it calculates a fit score by comparing skills with required qualifications. The input is the evaluation result, and the output is the fit score. This step involves data manipulation to compare the applicant's skills with the company's hiring requirements.

[0505] Step 7:

[0506] The server generates the final evaluation results and provides feedback to the user via the terminal. Specifically, it generates an evaluation report and makes it available for viewing on the terminal. At this stage, the goodness-of-fit score is used as input, and a detailed feedback report is provided as output.

[0507] (Application Example 1)

[0508] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0509] Traditional interview systems require considerable time and effort to accurately assess an applicant's suitability, and the interviewer's subjectivity can influence the outcome. Furthermore, it is difficult to accurately grasp an applicant's emotions and psychological state, resulting in a problem where their fit with the organization cannot be properly determined.

[0510] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0511] In this invention, the server includes means for performing facial expression analysis using video equipment for the purpose of data visualization, means for performing voice analysis to evaluate the applicant's psychological state, and means for converting voice data into text in real time. This makes it possible to more accurately evaluate the applicant's emotions and aptitude, and to objectively and efficiently calculate their suitability for the organization.

[0512] "Data acquisition" refers to the process by which the system collects information related to interviews in real time.

[0513] "Audio data" refers to recordings of applicants' statements and saving the waveform of those sounds in digital format.

[0514] "Real-time text conversion" refers to the process of converting audio data into text data the moment it is received.

[0515] "Emotional state estimation" involves analyzing audio and image data to evaluate the applicant's emotions and psychological state.

[0516] "Analysis from audio and image data" refers to a method of measuring and analyzing data such as the applicant's voice tone and facial expressions.

[0517] "Fit" refers to the degree of compatibility or matching between an applicant and an organization, expressed using numerical values ​​or indicators.

[0518] A "server" is a central device that manages and analyzes interview data.

[0519] "Data visualization" is a technique that displays collected data in the form of graphs, charts, and other diagrams to make the content easier to understand intuitively.

[0520] "Facial expression analysis using video equipment" is a method that uses devices such as cameras to analyze the facial expressions of applicants and evaluate their emotional state.

[0521] "Voice analysis" is the process of analyzing voice data in detail to extract characteristics of speech content and emotions.

[0522] This invention is a system that evaluates applicants' aptitude and psychological state through automated interviews. The server receives applicant voice data in real time and converts it to text using advanced speech recognition software. This process utilizes speech recognition technologies such as the Google Speech-to-Text API. Furthermore, the server analyzes the received text data through natural language processing to evaluate emotions and psychological state. For this purpose, an emotion analysis model using the Transformers library is employed.

[0523] The terminal functions as an interface with the user (applicant), and is equipped with a camera and microphone to capture the applicant's video and audio. During the interview, the terminal performs facial expression analysis based on the video data and sends the results to the server. This facial expression analysis uses image processing technology based on OpenCV. This allows the server to comprehensively analyze the data obtained from the audio and images and calculate the applicant's suitability.

[0524] For example, if a security company uses this system to interview new security guards, the interviewer can monitor the applicant's psychological state and confidence in their answers in real time through smart glasses, and ask appropriate follow-up questions. This makes it possible to objectively and efficiently evaluate the applicant's characteristics.

[0525] An example of a prompt for a generative AI model is, "Please tell me what kind of facial expression the applicant has, and whether it indicates stress or confidence." This prompt enables deeper emotional analysis.

[0526] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0527] Step 1:

[0528] The device activates its camera and microphone as soon as the user begins preparing for the interview. The user's voice and video data are acquired as input and transmitted to the server in real time. During this process, data compression is performed to optimize transmission to the server.

[0529] Step 2:

[0530] The server converts received audio data into text in real time using speech recognition software. It receives compressed audio data as input and generates text data of applicants' responses as output. In this process, it uses the Google Speech-to-Text API to convert the audio signal into text format.

[0531] Step 3:

[0532] The server processes the converted text data through a natural language processing model (e.g., Transformers) to perform sentiment analysis. The input is text data, and the output is the sentiment evaluation result. It evaluates the applicant's emotional state based on the text content and provides the evaluation result to the interviewer.

[0533] Step 4:

[0534] The device analyzes received video data in real time using OpenCV to recognize and analyze the user's facial expressions. It extracts facial features from the input video data and generates numerical data of the user's emotional state based on their facial expressions as output. In this process, it performs image processing to capture subtle changes in facial expressions.

[0535] Step 5:

[0536] The server integrates the results of sentiment analysis based on audio and video data to calculate an overall suitability score. The input is the results of each data analysis, and the output is an indicator of the applicant's suitability score. This process allows for a comprehensive assessment of the applicant's emotions and psychological state, and evaluates their compatibility with the organization.

[0537] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0538] The present invention is implemented as an automated interview system incorporating an emotion engine. In addition to the configuration of a server, terminal, and user, the system has the function of recognizing the user's emotions using the emotion engine and integrating them into the analysis data.

[0539] 1. Server Functions and Roles

[0540] The server manages the entire interview process, particularly the analysis and integration of data obtained from the emotion engine. The server processes both audio and image data in real time, utilizing the emotion engine to infer the user's emotional state. Furthermore, it integrates this emotional data into existing analysis algorithms to provide a more comprehensive evaluation of the applicant's skills and suitability.

[0541] 2. Functions and roles of the device

[0542] The terminal acts as an interface, connecting the user and the system. It presents the user with questions sent from the server, records the user's responses, and sends them back to the server. It also captures the user's facial expressions through its built-in camera, providing data to the emotion engine in real time.

[0543] 3. Functions and roles of the emotional engine

[0544] The emotion engine analyzes nonverbal information obtained from the user's voice and facial expressions to identify their emotional state. This engine utilizes facial recognition technology and acoustic analysis to quantify the user's tension, anxiety, excitement, and other emotional states. The identified emotional state is then adjusted to avoid affecting the user's performance evaluation and sent to the server.

[0545] 4. User Roles

[0546] Users respond to questions presented via their device and participate in the interview in a natural manner. The introduction of an emotion engine includes the user's emotional aspects in the interview evaluation; however, this is intended to complement the accurate assessment of abilities, ensuring that the user's true capabilities are fairly evaluated.

[0547] As a concrete example of the system, the server sends the question "What are your professional strengths?" to the terminal, and the user answers, "I have strong analytical skills." Subsequently, the emotion engine detects that the user is stating this answer with confidence and incorporates this information into the overall evaluation. This creates an interview environment where the user can relax and demonstrate their true worth.

[0548] The following describes the processing flow.

[0549] Step 1:

[0550] The server prepares to begin the interview session, selects the first question from a pre-configured list, and sends it to the terminal.

[0551] Step 2:

[0552] The terminal displays and audibly presents the questions received from the server to the user. It then activates voice input mode to allow the user to answer.

[0553] Step 3:

[0554] The user answers questions from the device verbally. The device records the user's voice in high quality and sends the audio data to the server.

[0555] Step 4:

[0556] The server processes the received audio data using a natural language processing engine to convert it into text. During this process, it prepares to analyze the keywords and content of the response.

[0557] Step 5:

[0558] The device captures the user's facial expressions with its built-in camera and sends that video data to an emotion engine. This allows for real-time analysis of emotional information.

[0559] Step 6:

[0560] The emotion engine analyzes the user's voice tone and facial expressions to identify their emotional state. This information is used to estimate the user's level of tension and confidence.

[0561] Step 7:

[0562] The server integrates text and sentiment data to calculate a comprehensive score that adjusts for the user's skill fit and the influence of emotions. This allows for an assessment of the applicant's match with the company's needs.

[0563] Step 8:

[0564] The server returns the analysis results to the terminal, which then displays the results to the user as feedback. If necessary, it displays the next interview question and continues the interview process.

[0565] Step 9:

[0566] Once the user interviews are complete, the server compiles all the analysis results and sends an overall evaluation of the applicant to the company. The company then uses this information to make hiring decisions.

[0567] (Example 2)

[0568] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0569] Traditional interview methods have limitations in measuring applicants' skills and aptitudes, and it was particularly difficult to adequately assess their emotional aspects. Furthermore, the lack of a means to suggest the next steps based on the interview's progress made it difficult to properly evaluate candidates and achieve an efficient interview process.

[0570] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0571] In this invention, the server includes means for conducting automated interviews for the purpose of data acquisition, means for converting voice data into text in real time, means for analyzing the user's emotional state from voice and image data using an emotion engine, and means for integrating the analysis results into a comprehensive evaluation system. This enables a comprehensive evaluation that includes the applicant's emotions and also allows for suggestions for the next interview stage based on their aptitude.

[0572] "Automated interviews" refer to the process of conducting interviews with applicants using mechanical means and collecting data from them.

[0573] "Means of real-time text conversion" refers to technologies or devices that instantly convert audio data into written text.

[0574] An "emotion engine" is an algorithm or system that analyzes and identifies a user's emotional state based on audio and image data.

[0575] "Means for integrating analysis results into a comprehensive evaluation system" refers to a method or apparatus for unifying obtained emotional data and other evaluation data and incorporating them into a comprehensive evaluation.

[0576] "Means for comprehensively evaluating suitability" refers to a process or device that uses analytical data to comprehensively assess an applicant's abilities and suitability.

[0577] This invention relates to a system having an automated interview format. This system consists of a server, a terminal, an emotion engine, and a user. The roles of each component and specific embodiments are described below.

[0578] The server is the core of the system, responsible for preparing interview questions and analyzing data. Specifically, the server selects appropriate questions from the interview question database and sends them to the terminal. It also receives audio and image data sent from the terminal and analyzes the data using an emotion engine. The analysis results are integrated into the comprehensive evaluation system and used to evaluate the user's skills and aptitude from multiple perspectives.

[0579] The terminal functions as an interface connecting the user and the system. When the terminal receives a question from the server, it presents it to the user. It can record the user's responses and capture their facial expressions with its built-in camera. The collected data is transmitted to the server in real time.

[0580] The user answers questions presented by the device. This process allows the user to behave naturally, and the emotion engine employed captures the user's intonation and facial expressions to identify their emotional state.

[0581] The emotion engine is software that analyzes audio and image data. This engine utilizes facial recognition and acoustic analysis technologies to quantify the user's emotions, such as tension and confidence. This enables accurate emotion analysis, which is then reflected in the user's overall evaluation.

[0582] As a concrete example, consider a scenario where a server sends the question "What are your professional strengths?" to a terminal, and the user responds, "I have strong analytical skills." In this case, the emotion engine analyzes the user's confidence and integrates the results into the overall evaluation system. As a result, a more accurate evaluation is made that includes the user's emotions.

[0583] An example of a prompt message would be, "Please analyze what emotions the user is expressing," which would allow the generative AI model to assist in the analysis.

[0584] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0585] Step 1:

[0586] The server prepares the interview questions. The server selects the most appropriate questions from the interview question database based on the applied position. The input to this process is the applicant's profile information, and the output is the question data sent to the terminal.

[0587] Step 2:

[0588] The terminal presents the user with pre-prepared questions. The terminal displays the received question data in audio or text format to help the user understand the questions. The input is the question data from the server, and the output is the presentation of the questions to the user.

[0589] Step 3:

[0590] The user answers the presented questions. The user provides voice answers to the questions through the device. The user's facial expressions are also observed. In this step, the input is the questions from the device, and the output is the user's voice responses and facial expression information.

[0591] Step 4:

[0592] The device records the user's responses and facial expressions. The device records the user's voice using its built-in microphone and captures their facial expressions with its camera. This audio and visual data is transmitted to the server in real time. The input is the user's voice and facial expressions, and the output is the data transfer to the server.

[0593] Step 5:

[0594] The emotion engine operates on a server and analyzes received audio and facial expression data. It uses facial expression recognition algorithms and acoustic analysis techniques to quantify the user's emotional state. Input is audio and image data from the device, and output is data representing the identified emotional state.

[0595] Step 6:

[0596] The server integrates the analysis results into the comprehensive evaluation system. The server takes in the emotional data obtained from the emotion engine along with other evaluation data to form the final evaluation. The input is the output data from the emotion engine, and the output is a comprehensive applicant evaluation.

[0597] (Application Example 2)

[0598] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0599] In modern autonomous vehicles, it is difficult to grasp passengers' psychological comfort in real time and provide individualized feedback and adjustments. Furthermore, the inability to provide optimal travel environments and routes that respond to passengers' emotions limits the quality of the travel experience.

[0600] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0601] In this invention, the server includes means for performing automated screening for the purpose of data collection, means for converting acoustic data into text information in real time, means for analyzing audio and image data to infer emotional states, means for adjusting the environment to optimize vehicle control based on the passenger's emotional state, and means for proposing the optimal travel route to the passenger based on the analysis results. This makes it possible to provide a travel environment and route that matches the passenger's psychological state in real time.

[0602] "Data collection" refers to the act of collecting information and data necessary to implement automated review processes.

[0603] "Audio data" refers to digital or analog data information that includes audio signals.

[0604] "Textual information" refers to information expressed in text format.

[0605] "Emotional state" is a concept that refers to a person's psychological state or emotional tendencies.

[0606] "Analytical means" refers to methods or devices used to derive specific conclusions or information based on data.

[0607] "Mobile object control" refers to the act of managing the operation and movement of vehicles and other mobile objects.

[0608] "Environmental adjustment means" refers to technologies or devices for adjusting the physical or digital environment to match the psychological state of passengers.

[0609] A "travel route" refers to the optimal path or route from the starting point to the destination.

[0610] "Passenger" refers to a person who travels using an autonomous vehicle or other means of transportation.

[0611] The system that realizes this application consists of multiple components. The server collects and analyzes passenger data by converting acoustic data into text information in real time and performing analysis to identify emotional states. In the emotion analysis, a specific generative AI model is used to infer the passenger's psychological state from audio and image data, and based on the results, optimal vehicle control and environmental adjustments are made.

[0612] The terminal works in conjunction with the server to display the optimal travel route for passengers and adjusts environmental settings as needed. Specifically, it provides feedback and suggestions to passengers through the terminal's display and speaker. This enables appropriate interaction that responds to the passenger's emotions.

[0613] Passengers, as users, can customize their environment and select routes according to their preferences through the terminal's interface. This enables a personalized travel experience.

[0614] As a concrete example, when the server detects a passenger's level of stress, relaxation music could be played on the terminal. This series of actions would allow passengers to enjoy a more comfortable and reassuring travel experience.

[0615] Example of a prompt:

[0616] "The emotion recognition system captures passenger voice and facial expression data to determine whether the passenger is relaxed or stressed. Based on that determination, please suggest appropriate adjustments to the in-car environment."

[0617] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0618] Step 1:

[0619] The server receives passenger audio and image data from the terminal. This input data is collected via microphone and camera and sent to the server in real time. Once the data reaches the server, it is processed using a generative AI model to convert the audio into text information. This calculation outputs useful textual information from the acoustic data.

[0620] Step 2:

[0621] The server utilizes a generated AI model to analyze passengers' emotional states from audio and image data. Based on the input audio and image data, the server uses an analysis algorithm to quantify passengers' emotions and measure their levels of tension and relaxation. This analysis result is generated as the server's output.

[0622] Step 3:

[0623] The terminal receives emotion analysis results from the server and adjusts the environment based on the passenger's psychological state. Based on the analysis results received as input, the terminal's control program operates, for example, activating a speaker to play relaxation music. It also displays emotionally appropriate suggestions on the display.

[0624] Step 4:

[0625] Passengers, as users, can review feedback from their devices and make choices to adjust their travel experience. By viewing relaxation music and route suggestions output from the device, passengers can feel more at ease. Further adjustments can be made based on new data as users re-enter their preferences.

[0626] Step 5:

[0627] The server receives new user input and repeats the same process as the previous step, continuously adjusting the environment and route accordingly. This cycle ensures that an optimized travel experience is always provided.

[0628] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0629] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0630] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0631] [Fourth Embodiment]

[0632] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0633] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0634] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0635] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0636] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0637] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0638] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0639] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0640] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0641] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0642] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0643] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0644] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0645] This invention is implemented in the form of an automated interview system and mainly consists of three main components: a server, a terminal, and a user. The functions performed by each component of the system and their specific roles are described below.

[0646] 1. Server Functions and Roles

[0647] The server acts as the central hub of the system, managing interviews, processing and analyzing data, and providing results. Specifically, the server prepares an optimal set of questions for each applicant and sends them to the terminal in real time. Furthermore, the server quickly converts the received audio data into text and analyzes the text using natural language processing technology. Emotional state inference is achieved by analyzing facial expressions from the tone and tempo of the voice, as well as from image data (if the interview is in video format). This allows the server to quantify the applicant's suitability and skills and calculate their fit with the company.

[0648] 2. Functions and roles of the device

[0649] The terminal functions as an interface with the user. Specifically, the terminal presents the user with questions sent from the server and records the user's responses as audio data. It also sends the recorded audio data to the server in real time, maintaining a data pipeline to ensure the smooth progress of the interview. The terminal provides an environment where the user can behave naturally and without stress.

[0650] 3. User Roles

[0651] As the main participant in the interview process, the user verbally answers questions presented via a device. The user's responses are recorded in real time and sent to a server as audio data for analysis. Users are expected to present their skills and experience without feeling nervous.

[0652] These components work together organically to enable fair and efficient automated interviews. For example, the server sends a question to the terminal saying, "Please introduce yourself," and the user answers. Through this series of interactions, the system can evaluate the applicant's abilities and characteristics with high accuracy.

[0653] The following describes the processing flow.

[0654] Step 1:

[0655] The server initiates the interview session and selects the first question from a pre-configured list. The selected question is then sent to the terminal.

[0656] Step 2:

[0657] The terminal presents the user with questions received from the server, either verbally or in text. The interface is appropriately adjusted to ensure the user can comfortably answer the questions.

[0658] Step 3:

[0659] The user answers questions presented by the device verbally. The answers are recorded in real time by the device.

[0660] Step 4:

[0661] The device sends the recorded audio data to the server. The transmission is performed without delay, ensuring high-quality data.

[0662] Step 5:

[0663] The server inputs the received audio data into a natural language processing (NLP) engine and converts it into text. The converted text becomes the basis for analyzing interview responses.

[0664] Step 6:

[0665] The server analyzes text data, extracts keywords, and evaluates the content of the responses. Furthermore, it infers the user's emotional state through analysis of tone and pace.

[0666] Step 7:

[0667] Based on the analysis results, the server matches the user's skill set with the skills required by the company and calculates a suitability score.

[0668] Step 8:

[0669] The server sends the goodness-of-fit score and analysis results to the terminal, while simultaneously selecting the next question and continuing the evaluation as needed.

[0670] Step 9:

[0671] The terminal displays the feedback and analysis results received from the server to the user and, if necessary, presents the following questions.

[0672] Step 10:

[0673] Once the process is complete, the server sends the final evaluation results to the company and also provides the user with information about the interview results and the next steps.

[0674] (Example 1)

[0675] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0676] In traditional interview processes, questions were standardized, making it difficult to evaluate applicants based on their individual characteristics. Furthermore, real-time data processing was insufficient, posing a challenge in realizing an efficient applicant evaluation system. Additionally, the assessment of applicants' emotional states and the calculation of suitability were cumbersome, making it difficult to achieve appropriate matching between companies and applicants.

[0677] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0678] In this invention, the server includes means for dynamically creating personalized questions based on generated information, means for transmitting the question set to a terminal in real time, and means for recording the user's responses as audio information and transmitting the information to the server. This enables an interview process optimized for each applicant and allows for real-time data analysis to evaluate suitability.

[0679] "A means of dynamically creating personalized questions based on generated information" refers to a method for generating appropriate questions in real time based on information that differs for each applicant, thereby providing a customized interview experience.

[0680] "A means of sending question sets to terminals in real time" refers to a method of instantly distributing questions generated on a server to terminals, enabling smooth interview progress.

[0681] "Means for recording the user's response as audio information and transmitting said information to a server" refers to means for saving the content of the applicant's verbal response as audio data and promptly transmitting that data to a server.

[0682] "Means for converting audio information into text information using automatic conversion means" refers to means that use speech recognition technology to convert recorded audio data into text.

[0683] "A means of analyzing textual information converted using natural language processing technology to evaluate emotional states" refers to a method of extracting emotions and intentions from interview content obtained as text using natural language processing technology, and evaluating the psychological state of the applicant.

[0684] "Methods for evaluating the suitability of applicants to the organization based on analyzed information" refer to methods for quantifying applicant characteristics and skills and calculating the degree of suitability by comparing them with the organization's requirements.

[0685] This invention is implemented as an automated interview system and mainly consists of three components: a server, a terminal, and a user. This system is designed to efficiently evaluate the suitability of each applicant, and its details are described below.

[0686] Server Functions

[0687] The server functions as the central device of the system. Based on data from applicants, it generates personalized questions using a generative AI model. For example, it dynamically generates questions based on the applicant's history using natural language generation technology. To achieve this, it uses natural language processing libraries such as TensorFlow and PyTorch to manage question generation.

[0688] The server also receives voice information transmitted from the user via the terminal and converts it into text. This conversion process uses speech recognition software such as the Google Cloud Speech-to-Text API. The converted text information is analyzed using natural language processing libraries such as nltk and spaCy. This analysis extracts and quantifies the applicant's emotions and skills to calculate a degree of relevance.

[0689] Device functions

[0690] The terminal acts as an interface connecting the user and the server. It visually displays questions sent from the server for the user to review. Furthermore, it can also present questions audibly using Text-to-Speech technology. During the question-and-answer session, the terminal records the user's responses using a high-quality microphone and sends the audio data to the server in real time. This communication takes place in a stable internet environment.

[0691] User roles

[0692] The user is the subject of the interview, answering questions presented via the device verbally. This allows the user's skills and experience to be collected as audio data. Users are expected to be as natural as possible during the interview, and creating a stress-free environment is crucial.

[0693] Examples of specific cases and prompt statements

[0694] As a concrete example, if a user is asked "Please introduce yourself," they might respond with "I have been working in software development for five years." This information is then analyzed sequentially to provide appropriate feedback.

[0695] An example of a prompt message might be, "Create questions to elicit self-introductions from applicants during interviews. These questions should be designed to facilitate the assessment of the applicant's skills and experience." This allows the system to present applicants with personalized questions, enabling more efficient interviews.

[0696] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0697] Step 1:

[0698] The server receives the applicant's data and inputs prompts into a generative AI model to generate personalized questions. Specifically, it generates prompts based on the applicant's history and work experience, and inputs them into the generative AI model to create the most appropriate questions. As output, a personalized set of questions is generated and used in the next step.

[0699] Step 2:

[0700] The server sends the generated set of questions to the terminal in real time. The terminal either displays them visually or presents them aloud using speech synthesis technology. For example, it might display the question "Please introduce yourself" on the screen and simultaneously read it aloud. In this step, the generated set of questions is used as input, and the output is a question in a format that the user can review.

[0701] Step 3:

[0702] The user verbally answers questions presented through the device. The device records these voice responses using a high-quality microphone and generates audio data. The input is the user's verbal responses, and the output is the generated audio data. This audio data is immediately sent to the server.

[0703] Step 4:

[0704] The server converts the received audio data into text using speech recognition software. Specifically, it passes the audio file to a service like Google Cloud Speech-to-Text and obtains text data as the conversion result. The input is audio data, and the output is parseable text data.

[0705] Step 5:

[0706] The server applies natural language processing techniques to the converted text data to evaluate the applicant's emotional state and skills. This analysis utilizes libraries such as nltk and spaCy to analyze extracted keywords and context. The input is text data, and the output is a numerical evaluation result.

[0707] Step 6:

[0708] The server calculates the applicant's fit with the organization based on the analyzed information. Using a machine learning model, it calculates a fit score by comparing skills with required qualifications. The input is the evaluation result, and the output is the fit score. This step involves data manipulation to compare the applicant's skills with the company's hiring requirements.

[0709] Step 7:

[0710] The server generates the final evaluation results and provides feedback to the user via the terminal. Specifically, it generates an evaluation report and makes it available for viewing on the terminal. At this stage, the goodness-of-fit score is used as input, and a detailed feedback report is provided as output.

[0711] (Application Example 1)

[0712] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0713] Traditional interview systems require considerable time and effort to accurately assess an applicant's suitability, and the interviewer's subjectivity can influence the outcome. Furthermore, it is difficult to accurately grasp an applicant's emotions and psychological state, resulting in a problem where their fit with the organization cannot be properly determined.

[0714] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0715] In this invention, the server includes means for performing facial expression analysis using video equipment for the purpose of data visualization, means for performing voice analysis to evaluate the applicant's psychological state, and means for converting voice data into text in real time. This makes it possible to more accurately evaluate the applicant's emotions and aptitude, and to objectively and efficiently calculate their suitability for the organization.

[0716] "Data acquisition" refers to the process by which the system collects information related to interviews in real time.

[0717] "Audio data" refers to recordings of applicants' statements and saving the waveform of those sounds in digital format.

[0718] "Real-time text conversion" refers to the process of converting audio data into text data the moment it is received.

[0719] "Emotional state estimation" involves analyzing audio and image data to evaluate the applicant's emotions and psychological state.

[0720] "Analysis from audio and image data" refers to a method of measuring and analyzing data such as the applicant's voice tone and facial expressions.

[0721] "Fit" refers to the degree of compatibility or matching between an applicant and an organization, expressed using numerical values ​​or indicators.

[0722] A "server" is a central device that manages and analyzes interview data.

[0723] "Data visualization" is a technique that displays collected data in the form of graphs, charts, and other diagrams to make the content easier to understand intuitively.

[0724] "Facial expression analysis using video equipment" is a method that uses devices such as cameras to analyze the facial expressions of applicants and evaluate their emotional state.

[0725] "Voice analysis" is the process of analyzing voice data in detail to extract characteristics of speech content and emotions.

[0726] This invention is a system that evaluates applicants' aptitude and psychological state through automated interviews. The server receives applicant voice data in real time and converts it to text using advanced speech recognition software. This process utilizes speech recognition technologies such as the Google Speech-to-Text API. Furthermore, the server analyzes the received text data through natural language processing to evaluate emotions and psychological state. For this purpose, an emotion analysis model using the Transformers library is employed.

[0727] The terminal functions as an interface with the user (applicant), and is equipped with a camera and microphone to capture the applicant's video and audio. During the interview, the terminal performs facial expression analysis based on the video data and sends the results to the server. This facial expression analysis uses image processing technology based on OpenCV. This allows the server to comprehensively analyze the data obtained from the audio and images and calculate the applicant's suitability.

[0728] For example, if a security company uses this system to interview new security guards, the interviewer can monitor the applicant's psychological state and confidence in their answers in real time through smart glasses, and ask appropriate follow-up questions. This makes it possible to objectively and efficiently evaluate the applicant's characteristics.

[0729] An example of a prompt for a generative AI model is, "Please tell me what kind of facial expression the applicant has, and whether it indicates stress or confidence." This prompt enables deeper emotional analysis.

[0730] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0731] Step 1:

[0732] The device activates its camera and microphone as soon as the user begins preparing for the interview. The user's voice and video data are acquired as input and transmitted to the server in real time. During this process, data compression is performed to optimize transmission to the server.

[0733] Step 2:

[0734] The server converts received audio data into text in real time using speech recognition software. It receives compressed audio data as input and generates text data of applicants' responses as output. In this process, it uses the Google Speech-to-Text API to convert the audio signal into text format.

[0735] Step 3:

[0736] The server processes the converted text data through a natural language processing model (e.g., Transformers) to perform sentiment analysis. The input is text data, and the output is the sentiment evaluation result. It evaluates the applicant's emotional state based on the text content and provides the evaluation result to the interviewer.

[0737] Step 4:

[0738] The device analyzes received video data in real time using OpenCV to recognize and analyze the user's facial expressions. It extracts facial features from the input video data and generates numerical data of the user's emotional state based on their facial expressions as output. In this process, it performs image processing to capture subtle changes in facial expressions.

[0739] Step 5:

[0740] The server integrates the results of sentiment analysis based on audio and video data to calculate an overall suitability score. The input is the results of each data analysis, and the output is an indicator of the applicant's suitability score. This process allows for a comprehensive assessment of the applicant's emotions and psychological state, and evaluates their compatibility with the organization.

[0741] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0742] The present invention is implemented as an automated interview system incorporating an emotion engine. In addition to the configuration of a server, terminal, and user, the system has the function of recognizing the user's emotions using the emotion engine and integrating them into the analysis data.

[0743] 1. Server Functions and Roles

[0744] The server manages the entire interview process, particularly the analysis and integration of data obtained from the emotion engine. The server processes both audio and image data in real time, utilizing the emotion engine to infer the user's emotional state. Furthermore, it integrates this emotional data into existing analysis algorithms to provide a more comprehensive evaluation of the applicant's skills and suitability.

[0745] 2. Functions and roles of the device

[0746] The terminal acts as an interface, connecting the user and the system. It presents the user with questions sent from the server, records the user's responses, and sends them back to the server. It also captures the user's facial expressions through its built-in camera, providing data to the emotion engine in real time.

[0747] 3. Functions and roles of the emotional engine

[0748] The emotion engine analyzes nonverbal information obtained from the user's voice and facial expressions to identify their emotional state. This engine utilizes facial recognition technology and acoustic analysis to quantify the user's tension, anxiety, excitement, and other emotional states. The identified emotional state is then adjusted to avoid affecting the user's performance evaluation and sent to the server.

[0749] 4. User Roles

[0750] Users respond to questions presented via their device and participate in the interview in a natural manner. The introduction of an emotion engine includes the user's emotional aspects in the interview evaluation; however, this is intended to complement the accurate assessment of abilities, ensuring that the user's true capabilities are fairly evaluated.

[0751] As a concrete example of the system, the server sends the question "What are your professional strengths?" to the terminal, and the user answers, "I have strong analytical skills." Subsequently, the emotion engine detects that the user is stating this answer with confidence and incorporates this information into the overall evaluation. This creates an interview environment where the user can relax and demonstrate their true worth.

[0752] The following describes the processing flow.

[0753] Step 1:

[0754] The server prepares to begin the interview session, selects the first question from a pre-configured list, and sends it to the terminal.

[0755] Step 2:

[0756] The terminal displays and audibly presents the questions received from the server to the user. It then activates voice input mode to allow the user to answer.

[0757] Step 3:

[0758] The user answers questions from the device verbally. The device records the user's voice in high quality and sends the audio data to the server.

[0759] Step 4:

[0760] The server processes the received audio data using a natural language processing engine to convert it into text. During this process, it prepares to analyze the keywords and content of the response.

[0761] Step 5:

[0762] The device captures the user's facial expressions with its built-in camera and sends that video data to an emotion engine. This allows for real-time analysis of emotional information.

[0763] Step 6:

[0764] The emotion engine analyzes the user's voice tone and facial expressions to identify their emotional state. This information is used to estimate the user's level of tension and confidence.

[0765] Step 7:

[0766] The server integrates text and sentiment data to calculate a comprehensive score that adjusts for the user's skill fit and the influence of emotions. This allows for an assessment of the applicant's match with the company's needs.

[0767] Step 8:

[0768] The server returns the analysis results to the terminal, which then displays the results to the user as feedback. If necessary, it displays the next interview question and continues the interview process.

[0769] Step 9:

[0770] Once the user interviews are complete, the server compiles all the analysis results and sends an overall evaluation of the applicant to the company. The company then uses this information to make hiring decisions.

[0771] (Example 2)

[0772] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0773] Traditional interview methods have limitations in measuring applicants' skills and aptitudes, and it was particularly difficult to adequately assess their emotional aspects. Furthermore, the lack of a means to suggest the next steps based on the interview's progress made it difficult to properly evaluate candidates and achieve an efficient interview process.

[0774] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0775] In this invention, the server includes means for conducting automated interviews for the purpose of data acquisition, means for converting voice data into text in real time, means for analyzing the user's emotional state from voice and image data using an emotion engine, and means for integrating the analysis results into a comprehensive evaluation system. This enables a comprehensive evaluation that includes the applicant's emotions and also allows for suggestions for the next interview stage based on their aptitude.

[0776] "Automated interviews" refer to the process of conducting interviews with applicants using mechanical means and collecting data from them.

[0777] "Means of real-time text conversion" refers to technologies or devices that instantly convert audio data into written text.

[0778] An "emotion engine" is an algorithm or system that analyzes and identifies a user's emotional state based on audio and image data.

[0779] "Means for integrating analysis results into a comprehensive evaluation system" refers to a method or apparatus for unifying obtained emotional data and other evaluation data and incorporating them into a comprehensive evaluation.

[0780] "Means for comprehensively evaluating suitability" refers to a process or device that uses analytical data to comprehensively assess an applicant's abilities and suitability.

[0781] This invention relates to a system having an automated interview format. This system consists of a server, a terminal, an emotion engine, and a user. The roles of each component and specific embodiments are described below.

[0782] The server is the core of the system, responsible for preparing interview questions and analyzing data. Specifically, the server selects appropriate questions from the interview question database and sends them to the terminal. It also receives audio and image data sent from the terminal and analyzes the data using an emotion engine. The analysis results are integrated into the comprehensive evaluation system and used to evaluate the user's skills and aptitude from multiple perspectives.

[0783] The terminal functions as an interface connecting the user and the system. When the terminal receives a question from the server, it presents it to the user. It can record the user's responses and capture their facial expressions with its built-in camera. The collected data is transmitted to the server in real time.

[0784] The user answers questions presented by the device. This process allows the user to behave naturally, and the emotion engine employed captures the user's intonation and facial expressions to identify their emotional state.

[0785] The emotion engine is software that analyzes audio and image data. This engine utilizes facial recognition and acoustic analysis technologies to quantify the user's emotions, such as tension and confidence. This enables accurate emotion analysis, which is then reflected in the user's overall evaluation.

[0786] As a concrete example, consider a scenario where a server sends the question "What are your professional strengths?" to a terminal, and the user responds, "I have strong analytical skills." In this case, the emotion engine analyzes the user's confidence and integrates the results into the overall evaluation system. As a result, a more accurate evaluation is made that includes the user's emotions.

[0787] An example of a prompt message would be, "Please analyze what emotions the user is expressing," which would allow the generative AI model to assist in the analysis.

[0788] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0789] Step 1:

[0790] The server prepares the interview questions. The server selects the most appropriate questions from the interview question database based on the applied position. The input to this process is the applicant's profile information, and the output is the question data sent to the terminal.

[0791] Step 2:

[0792] The terminal presents the user with pre-prepared questions. The terminal displays the received question data in audio or text format to help the user understand the questions. The input is the question data from the server, and the output is the presentation of the questions to the user.

[0793] Step 3:

[0794] The user answers the presented questions. The user provides voice answers to the questions through the device. The user's facial expressions are also observed. In this step, the input is the questions from the device, and the output is the user's voice responses and facial expression information.

[0795] Step 4:

[0796] The device records the user's responses and facial expressions. The device records the user's voice using its built-in microphone and captures their facial expressions with its camera. This audio and visual data is transmitted to the server in real time. The input is the user's voice and facial expressions, and the output is the data transfer to the server.

[0797] Step 5:

[0798] The emotion engine operates on a server and analyzes received audio and facial expression data. It uses facial expression recognition algorithms and acoustic analysis techniques to quantify the user's emotional state. Input is audio and image data from the device, and output is data representing the identified emotional state.

[0799] Step 6:

[0800] The server integrates the analysis results into the comprehensive evaluation system. The server takes in the emotional data obtained from the emotion engine along with other evaluation data to form the final evaluation. The input is the output data from the emotion engine, and the output is a comprehensive applicant evaluation.

[0801] (Application Example 2)

[0802] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0803] In modern autonomous vehicles, it is difficult to grasp passengers' psychological comfort in real time and provide individualized feedback and adjustments. Furthermore, the inability to provide optimal travel environments and routes that respond to passengers' emotions limits the quality of the travel experience.

[0804] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0805] In this invention, the server includes means for performing automated screening for the purpose of data collection, means for converting acoustic data into text information in real time, means for analyzing audio and image data to infer emotional states, means for adjusting the environment to optimize vehicle control based on the passenger's emotional state, and means for proposing the optimal travel route to the passenger based on the analysis results. This makes it possible to provide a travel environment and route that matches the passenger's psychological state in real time.

[0806] "Data collection" refers to the act of collecting information and data necessary to implement automated review processes.

[0807] "Audio data" refers to digital or analog data information that includes audio signals.

[0808] "Textual information" refers to information expressed in text format.

[0809] "Emotional state" is a concept that refers to a person's psychological state or emotional tendencies.

[0810] "Analytical means" refers to methods or devices used to derive specific conclusions or information based on data.

[0811] "Mobile object control" refers to the act of managing the operation and movement of vehicles and other mobile objects.

[0812] "Environmental adjustment means" refers to technologies or devices for adjusting the physical or digital environment to match the psychological state of passengers.

[0813] A "travel route" refers to the optimal path or route from the starting point to the destination.

[0814] "Passenger" refers to a person who travels using an autonomous vehicle or other means of transportation.

[0815] The system that realizes this application consists of multiple components. The server collects and analyzes passenger data by converting acoustic data into text information in real time and performing analysis to identify emotional states. In the emotion analysis, a specific generative AI model is used to infer the passenger's psychological state from audio and image data, and based on the results, optimal vehicle control and environmental adjustments are made.

[0816] The terminal works in conjunction with the server to display the optimal travel route for passengers and adjusts environmental settings as needed. Specifically, it provides feedback and suggestions to passengers through the terminal's display and speaker. This enables appropriate interaction that responds to the passenger's emotions.

[0817] Passengers, as users, can customize their environment and select routes according to their preferences through the terminal's interface. This enables a personalized travel experience.

[0818] As a concrete example, when the server detects a passenger's level of stress, relaxation music could be played on the terminal. This series of actions would allow passengers to enjoy a more comfortable and reassuring travel experience.

[0819] Example of a prompt:

[0820] "The emotion recognition system captures passenger voice and facial expression data to determine whether the passenger is relaxed or stressed. Based on that determination, please suggest appropriate adjustments to the in-car environment."

[0821] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0822] Step 1:

[0823] The server receives passenger audio and image data from the terminal. This input data is collected via microphone and camera and sent to the server in real time. Once the data reaches the server, it is processed using a generative AI model to convert the audio into text information. This calculation outputs useful textual information from the acoustic data.

[0824] Step 2:

[0825] The server utilizes a generated AI model to analyze passengers' emotional states from audio and image data. Based on the input audio and image data, the server uses an analysis algorithm to quantify passengers' emotions and measure their levels of tension and relaxation. This analysis result is generated as the server's output.

[0826] Step 3:

[0827] The terminal receives emotion analysis results from the server and adjusts the environment based on the passenger's psychological state. Based on the analysis results received as input, the terminal's control program operates, for example, activating a speaker to play relaxation music. It also displays emotionally appropriate suggestions on the display.

[0828] Step 4:

[0829] Passengers, as users, can review feedback from their devices and make choices to adjust their travel experience. By viewing relaxation music and route suggestions output from the device, passengers can feel more at ease. Further adjustments can be made based on new data as users re-enter their preferences.

[0830] Step 5:

[0831] The server receives new user input and repeats the same process as the previous step, continuously adjusting the environment and route accordingly. This cycle ensures that an optimized travel experience is always provided.

[0832] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0833] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0834] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0835] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0836] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0837] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0838] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0839] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0840] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0841] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0842] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0843] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0844] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0845] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0846] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0847] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0848] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0849] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0850] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0851] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0852] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0853] The following is further disclosed regarding the embodiments described above.

[0854] (Claim 1)

[0855] A means of conducting automated interviews for the purpose of data acquisition,

[0856] A means of converting audio data to text in real time,

[0857] means for analyzing audio and image data to infer emotional states,

[0858] A means for calculating the degree of fit between the applicant and the organization based on the analysis results,

[0859] A system that includes this.

[0860] (Claim 2)

[0861] The system according to claim 1, comprising means for analyzing interview results and providing feedback to interview participants.

[0862] (Claim 3)

[0863] The system according to claim 1, comprising means for proposing the next interview stage based on the calculation of the degree of fit.

[0864] "Example 1"

[0865] (Claim 1)

[0866] A means of dynamically creating personalized questions based on generated information,

[0867] A means of sending the question set to the terminal in real time,

[0868] A means for recording the user's response as audio information and transmitting said information to a server,

[0869] A means for converting audio information into text information using an automatic conversion means,

[0870] A means of analyzing textual information converted using natural language processing technology and evaluating emotional states,

[0871] A means of evaluating the suitability of applicants to the organization based on the analyzed information,

[0872] A system that includes this.

[0873] (Claim 2)

[0874] The system according to claim 1, comprising means for analyzing interview results and providing evaluation results to the interview participants.

[0875] (Claim 3)

[0876] The system according to claim 1, comprising means for determining the next selection stage based on the evaluation of suitability.

[0877] "Application Example 1"

[0878] (Claim 1)

[0879] A means of conducting automated interviews for the purpose of data acquisition,

[0880] A means of converting audio data to text in real time,

[0881] means for analyzing audio and image data to infer emotional states,

[0882] A means for calculating the degree of fit between the applicant and the organization based on the analysis results,

[0883] A means of performing facial expression analysis using video equipment for the purpose of data visualization,

[0884] A method for evaluating the psychological state of applicants by performing voice analysis,

[0885] A system that includes this.

[0886] (Claim 2)

[0887] The system according to claim 1, comprising means for analyzing interview results and providing feedback to interview participants.

[0888] (Claim 3)

[0889] The system according to claim 1, comprising means for proposing the next interview stage based on the calculation of the degree of fit.

[0890] "Example 2 of combining an emotion engine"

[0891] (Claim 1)

[0892] A means of conducting automated interviews for the purpose of data acquisition,

[0893] A means of converting audio data to text in real time,

[0894] A means for analyzing a user's emotional state from audio and image data using an emotion engine,

[0895] A means of integrating the analysis results into a comprehensive evaluation system,

[0896] In order to comprehensively evaluate the suitability of applicants, a means of incorporating analytical data into existing evaluation methods,

[0897] A system that includes this.

[0898] (Claim 2)

[0899] The system according to claim 1, comprising means for analyzing interview results and providing feedback to interview participants.

[0900] (Claim 3)

[0901] The system according to claim 1, comprising means for proposing the next interview stage based on the calculation of the degree of fit.

[0902] "Application example 2 when combining with an emotional engine"

[0903] (Claim 1)

[0904] A means of conducting an automated review for the purpose of collecting data,

[0905] A means of converting acoustic data into text information in real time,

[0906] means for analyzing audio and image data to infer emotional states,

[0907] An environmental adjustment means for optimizing vehicle control based on the emotional state of passengers,

[0908] A means for proposing the optimal travel route for passengers based on the analysis results,

[0909] A system that includes this.

[0910] (Claim 2)

[0911] The system according to claim 1, comprising means for providing suggestions to improve the travel experience based on the results of emotion analysis.

[0912] (Claim 3)

[0913] The system according to claim 1, comprising means for optimizing the travel path according to the emotional state. [Explanation of Symbols]

[0914] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A means of conducting automated interviews for the purpose of data acquisition, A means of converting audio data to text in real time, means for analyzing audio and image data to infer emotional states, A means for calculating the degree of fit between the applicant and the organization based on the analysis results, A means of performing facial expression analysis using video equipment for the purpose of data visualization, A method for evaluating the psychological state of applicants by performing voice analysis, A system that includes this.

2. The system according to claim 1, comprising means for analyzing interview results and providing feedback to interview participants.

3. The system according to claim 1, comprising means for proposing the next interview stage based on the calculation of the degree of fit.