system
The system automates presentation evaluation using speech and image recognition, coupled with generative AI, to provide comprehensive feedback, enhancing presentation quality without human intervention.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-26
AI Technical Summary
Existing methods for improving presentation quality require human cooperation and are limited in providing multifaceted feedback, making it difficult to prepare high-quality presentations efficiently.
A system that utilizes speech recognition, natural language processing, and generative artificial intelligence to analyze presentation materials and audio information, providing automated feedback for improvement without human intervention.
Enables users to independently enhance the quality of their presentations by receiving detailed and actionable feedback on speaking style and material design, reducing the need for external assistance.
Smart Images

Figure 2026105315000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] Conventionally, feedback for improving the quality of a presentation requires cooperation from others, and there is a problem that it is difficult to request such cooperation in a busy environment. In addition, since means for obtaining multi-faceted feedback on the way of speaking and the slide composition are limited, there is a problem that it is difficult to prepare a high-quality presentation.
Means for Solving the Problems
[0005] This invention provides a means for receiving presentation materials and audio information from a user and converting them into text data using speech recognition technology. Subsequently, natural language processing technology and image recognition technology are used to analyze the text data and presentation materials to evaluate the characteristics of the speaking style and the structure of the presentation materials. Furthermore, by generating feedback based on the analysis results using generative artificial intelligence and presenting it to the user, the invention provides a system that offers multifaceted improvement suggestions for presentations without requiring human intervention.
[0006] "User" refers to the entity that inputs presentation materials and audio information.
[0007] "Presentation materials" refers to digital files such as slides and text used during a presentation.
[0008] "Audio information" refers to audio data of presentations recorded by the user.
[0009] "Means of receiving" refers to the function for incorporating presentation materials and audio information provided by users into the system.
[0010] "Speech recognition technology" refers to technology that analyzes speech information and converts it into text data.
[0011] "Text data" refers to the written information of the presentation content converted using speech recognition technology.
[0012] "Means of analysis" refers to functions for evaluating text data and presentation materials and extracting information about speaking style and structure.
[0013] "Means of evaluation" refers to the process of analyzing the quality of a presentation based on past data and criteria.
[0014] "Generative artificial intelligence" refers to AI technology that generates proposals and feedback in a way similar to human thinking and experience based on the analyzed information.
[0015] "Presentation means" refers to a function for providing the generated feedback in a form that can be confirmed by the user.
Brief Explanation of Drawings
[0016] [Figure 1] It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which multiple emotions are mapped. [Figure 10] It shows an emotion map to which multiple emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13]It is a sequence diagram showing the processing flow of the data processing system in Embodiment 2 when the emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when the emotion engine is combined.
Mode for Carrying Out the Invention
[0017] Hereinafter, an example of an embodiment of the system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0018] First, the terms used in the following description will be explained.
[0019] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0020] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0021] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.
[0022] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0023] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0024] [First Embodiment]
[0025] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0026] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0027] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0028] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0029] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0030] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0031] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0032] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0033] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0034] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0035] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0036] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0037] This invention is a system that automates the proofreading of presentations, providing a method for users to efficiently prepare high-quality presentations. The system starts operating when the user sends their presentation materials and related audio information to a server via a terminal.
[0038] Users install a dedicated application on their device and upload presentation materials (e.g., in PDF or PowerPoint format) to the server through that application. Furthermore, they use their device's microphone to record audio in a simulated presentation format and import the audio file into the system.
[0039] The server converts the received audio information into text data using speech recognition technology. This transcribed information from the audio data forms the basis for organizing the flow of the presentation into a document and analyzing the characteristics of the speaking style. The speech recognition process utilizes natural language processing techniques to identify intonation, speed, and repeated words.
[0040] Meanwhile, the server applies image recognition technology to the uploaded presentation materials, analyzing the visual elements within the slides in detail. In particular, it evaluates the entire material based on criteria such as text font size, color coordination, and image resolution. This allows for an assessment of whether the material is effective and consistent.
[0041] Subsequently, the server uses generative artificial intelligence to generate useful feedback for improvement from the analysis results of the audio and materials. The feedback is based on best practices for general presentations and suggests specific areas for correction and improvement to the user.
[0042] Ultimately, the device displays the feedback sent from the server in a user-friendly format. This feedback includes detailed explanations of areas for improvement and specific ways to enhance the presentation. This allows users to independently improve the quality of their presentations without needing help from others.
[0043] This system configuration improves the efficiency of presentation preparation and reduces the burden on users.
[0044] The following describes the processing flow.
[0045] Step 1:
[0046] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. Furthermore, they record the presentation's audio using their device's microphone and upload that audio file to the system.
[0047] Step 2:
[0048] The server passes the uploaded audio file to the speech recognition engine, which converts the audio data into text data. This process uses a specific, dedicated dictionary to accurately translate technical terms and custom phrases.
[0049] Step 3:
[0050] The server analyzes the text data obtained through speech recognition and analyzes the characteristics of speech. In particular, it compares the speed, volume, and intonation of the speech with standard speech models and generates feedback.
[0051] Step 4:
[0052] The server analyzes uploaded presentation materials using image recognition technology. It evaluates the design elements of each slide, such as fonts, layout, and color contrast, to determine visual consistency and effectiveness.
[0053] Step 5:
[0054] The server integrates the analysis results and uses generative artificial intelligence to automatically generate feedback on improving the quality of the audio and materials. This feedback includes specific correction suggestions and detailed points for improvement.
[0055] Step 6:
[0056] The server sends the generated feedback report to the terminal.
[0057] Step 7:
[0058] The device displays received feedback to the user in a visually easy-to-understand format. The user can use this information to improve their presentation.
[0059] (Example 1)
[0060] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0061] There is a need for a system that automatically evaluates the quality of presentation materials and content, and provides users with specific improvement suggestions. Existing methods do not sufficiently automate the evaluation of material design or the analysis of presentation style, resulting in the challenge that users must spend considerable time and effort to improve the quality of their presentations.
[0062] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0063] In this invention, the server includes means for receiving material data and acoustic information from users, means for converting the acoustic information into text information using speech recognition technology, and means for analyzing the text information and material data to evaluate the characteristics of speech and the structure of the material data. This enables users to quickly and efficiently obtain useful feedback to improve the quality of their presentations.
[0064] A "user" is someone who uses the system to provide material data and audio information and to receive feedback.
[0065] "Document data" refers to files containing information in document or slide format used in presentations.
[0066] "Acoustic information" refers to the audio data spoken during a presentation, and is the data that is converted into text information.
[0067] "Speech recognition technology" is a technical method used to convert acoustic information into textual information.
[0068] "Textual information" refers to text data extracted from acoustic information using speech recognition technology.
[0069] "Speech characteristics" refer to the distinctive features of a speaker's delivery during a presentation, and specific examples include vocal intonation, speaking speed, and repeated phrases.
[0070] "Generative artificial intelligence technology" refers to artificial intelligence technology used to automatically generate feedback and suggestions based on analysis results.
[0071] "Improvement suggestions" refer to feedback provided to users that indicates specific areas and methods for improving presentation materials and speaking style.
[0072] "Design elements" refer to the visual components included in document data, such as fonts, colors, and image placement.
[0073] "Image recognition technology" is a technical method used to extract and evaluate design elements from document data.
[0074] A "server" refers to a central information processing unit that handles all data processing, analysis, and feedback generation.
[0075] This invention is a system that automatically analyzes presentation materials and audio information and provides improvement suggestions to the user. The user installs a dedicated application on their device and uses this application to prepare presentation materials (PDF or slide format). The user then uses the device's microphone function to conduct a mock presentation and records the audio. The device then transmits this material data and audio information to a server.
[0076] The server converts acoustic information into text information using speech recognition technology. This process utilizes commonly used speech recognition APIs and libraries (e.g., open-source speech recognition software). Next, the server analyzes the text information using natural language processing technology to extract speech characteristics. Specifically, analysis libraries such as NLTK and spaCy may be used.
[0077] Meanwhile, the server uses image recognition technology to extract design elements from the document data and compares them to standard design guidelines. Image analysis tools such as OpenCV and Tesseract OCR are used. This evaluates the consistency of the document's fonts, colors, and constituent elements.
[0078] Next, the server uses generative artificial intelligence technology to generate improvement suggestions based on the analysis results. Generative models such as GPT-4 (registered trademark) and BERT are used here. The generated feedback provides users with specific improvement methods and suggestions.
[0079] Ultimately, the terminal presents the improvement suggestions received from the server to the user. This presentation is in a format that the user can easily understand. For example, the improvements may be listed in bullet points or displayed as an infographic that is easy to understand visually.
[0080] For example, if a user submits a "presentation on corporate strategy," the system analyzes the audio data to determine that the user is "speaking too fast" and generates a suggestion that the "font size should be increased" for the slide design.
[0081] An example of a prompt message might be, "What are the key points to emphasize in the introduction of the presentation?" Through this example, the system can provide the user with objective and specific directions for improvement.
[0082] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0083] Step 1:
[0084] The user prepares presentation materials on their device and uploads them to the server using a dedicated application. The user also uses the device's microphone to perform a mock presentation and record the audio. The input consists of presentation materials (PDF or slide format) and audio information (audio file), and the output is the transmission of this data to the server.
[0085] Step 2:
[0086] The server converts received acoustic information into text information using speech recognition technology. Specifically, it analyzes the audio waveform using a speech recognition API and generates text data. The input is acoustic information, and the output is text information.
[0087] Step 3:
[0088] The server analyzes textual information using natural language processing techniques to extract speech characteristics. This process uses libraries such as NLTK and spaCy to analyze text data, identifying, for example, word repetitions and intonation tendencies. The input is textual information, and the output is data related to speech characteristics.
[0089] Step 4:
[0090] The server uses image recognition technology on uploaded presentation materials. Using OpenCV or Tesseract OCR, it extracts design elements from the materials and compares them to standard design criteria. The input is the presentation material, and the output is evaluation information of the design elements.
[0091] Step 5:
[0092] The server uses generative artificial intelligence technology to generate improvement suggestions based on evaluation information of speech characteristics and design elements. This process utilizes models such as GPT-4 and BERT to construct specific feedback. The input is evaluation data of speech and design, and the output is improvement suggestions.
[0093] Step 6:
[0094] The terminal presents improvement suggestions received from the server to the user. These suggestions are displayed in a format that the user can easily understand and implement. For example, this could include presenting improvements as a bulleted list or as a visually organized infographic. The input is the improvement suggestions, and the output is the feedback display to the user.
[0095] (Application Example 1)
[0096] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0097] In today's commercial environment, there is a need for technological means to improve the quality of presentations that store staff give to customers and to perform their duties efficiently. However, traditional methods result in staff being unable to prepare presentations effectively, leading to a lack of appeal to customers. To solve this problem, a system is needed that automates the editing of presentations and provides expert feedback.
[0098] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0099] In this invention, the server includes means for receiving presentation materials and audio information from a user using an information terminal, means for converting the audio information into text data using speech recognition technology, means for generating feedback based on the analysis results using generative artificial intelligence technology, and means for providing improvement instructions to the user, assuming store operations. This makes it possible to improve the quality of presentations that store staff give to customers.
[0100] An "information terminal" is a device used by users to input presentation materials and audio information, and specifically refers to smartphones, tablets, and similar devices.
[0101] "Speech recognition technology" is a technology that converts speech information into text data that a computer can understand.
[0102] "Text data" refers to character information converted by speech recognition technology and is used to analyze the characteristics of the user's speech.
[0103] "Generative artificial intelligence technology" is a field of artificial intelligence that performs inferences and makes suggestions based on analyzed data, and is used to generate feedback for users.
[0104] "Feedback" refers to information that includes evaluations and suggestions for improvement regarding the content of presentation materials and audio data provided by users.
[0105] "Improvement instructions for users based on store operations" refers to a means of providing specific suggestions and feedback aimed at improving the quality of presentations, with store operations in mind.
[0106] "Users" refers to the store staff giving the presentation and their associates.
[0107] To implement this invention, first, a smartphone or tablet must be prepared as an information terminal, and the user must transmit presentation materials and audio information to the server via that terminal. A dedicated application must be installed on the terminal, and the user uploads the presentation materials using that application.
[0108] The server converts the received audio information into text data using speech recognition technology. Specifically, it uses Google's speech_recognition library to convert the audio data into reliable text data, and then performs analysis using natural language processing technology.
[0109] Furthermore, the server applies image recognition technology to the presentation materials to evaluate them. It uses the python-pptx library to extract visual elements from the slides and checks whether their design conforms to standard guidelines.
[0110] Based on the analysis results of the text data and materials generated by the server, specific feedback is generated using generative artificial intelligence technology. The OpenAI® API can be used for this process. The AI evaluates the presentation content, points out areas for improvement to the user, and provides helpful instructions tailored to the user's store operations.
[0111] Ultimately, the terminal displays feedback sent from the server. This feedback is presented in a user-friendly format and designed to be immediately applicable to presentations in real-world business situations.
[0112] An example of a prompt could be: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product." This allows the generative AI model to provide feedback that aligns with the user's needs.
[0113] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0114] Step 1:
[0115] Users use their information terminals to send presentation materials and audio information to the server via a dedicated application. Input data consists of PDF and PowerPoint documents and audio data, which are uploaded to the server for processing.
[0116] Step 2:
[0117] The server converts uploaded audio data into text data using the speech_recognition library. The input is audio data, and the output is the converted text data. Data processing involves analyzing the audio waveform and converting it into a string using a language model.
[0118] Step 3:
[0119] The server analyzes the converted text data using natural language processing techniques to extract characteristics of the user's speech and specific recurring phrases. The input is text data, and the output is the result of the speech analysis. This process analyzes word frequency and structure.
[0120] Step 4:
[0121] The server uses the python-pptx library to perform image recognition on presentation materials, extracting and evaluating design elements. The input is the presentation material, and the output is the design evaluation result. In particular, the entire material is evaluated based on criteria such as font size and color coordination.
[0122] Step 5:
[0123] The server uses generative artificial intelligence technology to generate feedback based on text analysis results and design evaluation results. The input is the analysis and evaluation results, and the output is feedback information for improvement. The generative AI model develops useful improvement suggestions based on historical data and best practices.
[0124] Step 6:
[0125] The terminal visually presents feedback generated from the server to the user. The input is feedback information, and the output is specific improvement suggestions displayed on the user's screen. The terminal uses visual elements such as text and graphs to convey information in an easily understandable way for the user.
[0126] An example of a prompt sentence to be input into the generating AI model is: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product."
[0127] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0128] This invention is an automated system for improving the quality of user presentations. In addition to analyzing user voice information and presentation materials and generating feedback, it incorporates an emotion engine to evaluate the user's emotional expression.
[0129] Users launch a dedicated application on their device and upload their presentation materials to the server. They also use their device's microphone and camera to record the presentation's audio and video, and send this recording to the server. This data serves as crucial information for communicating the presentation's preparation status to the system.
[0130] The server first uses speech recognition technology to convert the audio data into text data. This text documents the content of the presentation and forms the basis for analyzing speaking speed, repetition, and key phrases. Simultaneously, the audio data is input into an emotion engine, which estimates emotions from the intonation and tone of the user's voice and analyzes the emotional impact of the presentation.
[0131] Next, the server uses image recognition technology to analyze the uploaded presentation materials and extract design elements. It evaluates the font size, color contrast, and layout consistency of each slide to determine the overall visual quality of the material. In addition, by analyzing video data, it uses an emotion engine to evaluate the user's facial expressions and gestures during the presentation and provides feedback based on emotional expression.
[0132] All generated analysis results are integrated by generative artificial intelligence and compiled into concrete feedback. This feedback includes improvements based on audio information, document structure, and emotional expression, and presents a specific action plan for the user.
[0133] The terminal displays feedback reports sent from the server on its user interface. Through this detailed feedback, users can improve the overall quality of their presentations, including not only the content but also the presentation style and emotional expression.
[0134] This system allows users to objectively evaluate their own performance and prepare themselves to approach presentations with confidence.
[0135] The following describes the processing flow.
[0136] Step 1:
[0137] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. They also use their device's microphone and camera to record the audio and video of the presentation. Finally, they send the recorded audio and video data to the server.
[0138] Step 2:
[0139] When the server receives audio data, it uses a speech recognition engine to convert the audio into text data. This text data is used as foundational data for analyzing the content of the presentation.
[0140] Step 3:
[0141] The server inputs voice data into an emotion engine and recognizes emotions from the characteristics of the user's voice. Specifically, it analyzes the tone, pitch, and speed of the voice to estimate the expressed emotional state.
[0142] Step 4:
[0143] The server applies image recognition technology to the uploaded presentation materials to extract design elements from the slides. It analyzes font size, color usage, image placement, etc., to evaluate the visual completeness of the materials.
[0144] Step 5:
[0145] The server analyzes video data and evaluates the user's facial expressions and gestures using an emotion engine. It identifies nonverbal emotional expressions derived from the video and prepares feedback on the emotional aspects of the presentation.
[0146] Step 6:
[0147] The server uses generative artificial intelligence to combine text data, presentation analysis results, and sentiment analysis results from audio and video to generate comprehensive feedback. This feedback details areas for improvement and suggestions for revising the presentation.
[0148] Step 7:
[0149] The device receives feedback from the server and displays it in the user interface. The user reviews this feedback and uses it to improve their presentation.
[0150] (Example 2)
[0151] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0152] Traditional presentation evaluation methods often rely on subjective evaluations by the users themselves, making objective assessment difficult. Furthermore, the lack of comprehensive feedback on the impact of vocal intonation and visual design on the audience made it challenging to improve the overall quality of presentations.
[0153] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0154] In this invention, the server is a device for acquiring information from a user, and includes means for acquiring audio and video information using a microphone and a camera, means for converting the audio information into text information using a speech recognition algorithm, and means for analyzing the text information and video information to evaluate the characteristics of the speaking style and the composition of the visual materials. This enables a comprehensive evaluation and feedback of speaking style, emotions, and visual information in a presentation.
[0155] A "user" refers to an individual or organization that uses the system to receive feedback on their presentation.
[0156] "Device for acquiring information" refers to equipment equipped with a microphone and camera for collecting audio and video information from users.
[0157] "Audio information" refers to data related to the user's voice and spoken language during a presentation.
[0158] "Video information" refers to data about the user's appearance, facial expressions, and gestures during the presentation.
[0159] A "speech recognition algorithm" refers to a method of analysis used to convert speech data into text data.
[0160] "Textual information" refers to text data converted from audio information by a speech recognition algorithm.
[0161] "Analysis" refers to the process of evaluating textual and visual information to understand the speaking style and material structure of a presentation.
[0162] "Speaking style characteristics" refers to the characteristics of the user's voice during a presentation, such as speed and intonation.
[0163] "Visual material composition" refers to the layout and design elements of slides and graphics used in a presentation.
[0164] "Evaluation" refers to the act of judging the quality of a presentation based on data obtained through analysis.
[0165] A "generative knowledge processing device" refers to a device that includes artificial intelligence used to generate evaluation information based on analysis results.
[0166] "Evaluation information" refers to information provided by generative knowledge processing devices, including feedback on areas for improvement and effectiveness of presentations.
[0167] A "device for displaying information to the user" refers to equipment equipped with a screen or display for visually presenting the generated evaluation information.
[0168] This invention provides a system that allows users to efficiently evaluate their own presentations and clearly identify areas for improvement. Users install a dedicated application on their device and use this application to prepare their presentation materials. The device is equipped with a microphone and camera, which can be used to collect audio and video information from the presentation.
[0169] When a user begins a presentation, the device collects audio and video data and sends it to the server. The server uses a speech recognition algorithm to convert the audio information into text. A commonly used speech recognition library can be used for this purpose. The converted text information then becomes data for further analysis of the speaker's characteristics.
[0170] In parallel, video information is analyzed using image processing technology to extract the user's facial expressions and gestures during the presentation. This allows for an evaluation of how the composition of the visual materials and the user's gestures visually and emotionally influence the presentation.
[0171] A generative knowledge processing device integrates this data to generate specific evaluation information. This evaluation information includes areas for improvement in the auditory, visual, and emotional aspects of the presentation. Users can visually review the evaluation information on their device's display and use it to improve their presentations.
[0172] For example, if a user provides feedback such as "the audio is a little monotonous" or "the text on the slides is hard to read," the tempo can be adjusted or the slide design can be improved.
[0173] An example of a prompt is, "Analyze the audio, slides, and video, and generate feedback based on them."
[0174] This system allows users to objectively evaluate their own presentations and efficiently improve them.
[0175] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0176] Step 1:
[0177] The user launches a dedicated application on their device and prepares their presentation materials. The user uses the device's microphone and camera to acquire audio and video information for the presentation. The input audio and video data is collected and sent to the server. This step includes the user recording their own presentation.
[0178] Step 2:
[0179] The server processes the received audio data using a speech recognition algorithm and converts it into text information. The input is audio data, and the output is text information. The server analyzes the tempo and intonation of speech from the audio data and extracts important keywords and phrases. In this step, the information obtained from the audio is processed again as text data.
[0180] Step 3:
[0181] The server processes the received video data using image processing technology. The input is video data, and the output is data related to facial expressions and gestures. The server analyzes this data to estimate emotional influences from facial expressions and gestures. This step involves a detailed analysis of the video information.
[0182] Step 4:
[0183] The server inputs the analyzed text information and video data into the generating AI model. The input consists of text data and video analysis data, and the output is integrated evaluation information. Based on the analysis results, the generating AI model generates specific evaluation information and feedback. This enables a comprehensive evaluation of the presentation. This step includes the integration of data and the generation of feedback.
[0184] Step 5:
[0185] The terminal receives evaluation information sent from the server and displays it on the user interface. The input is evaluation information, and the output is user-confirmable feedback. The user reviews the provided feedback and understands areas for improvement in their presentation. This step involves visualizing the evaluation information.
[0186] (Application Example 2)
[0187] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0188] While systems exist to improve presentation quality, there were insufficient systems that could comprehensively evaluate and improve not only the user's speaking style and presentation material structure, but also their emotional expression. Furthermore, there was a lack of mechanisms to provide users with concrete action plans to improve their own expressive abilities based on the feedback they received.
[0189] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0190] In this invention, the server includes means for receiving expressive materials and audio information from the user, means for converting the audio information into text data using speech recognition technology, and means for analyzing the text data and expressive materials to evaluate the characteristics of the speaking style, the structure of the presentation materials, and the expression of emotion. This enables the user to obtain specific actionable guidelines for improving their expressive abilities by utilizing the feedback.
[0191] A "user" is an individual or group that uses the system to improve the quality of their presentations.
[0192] "Presentation materials" refer to materials, including slides and visual content, used in presentations.
[0193] "Audio information" refers to recordings of the user's voice during a presentation.
[0194] "Speech recognition technology" is a technology that analyzes speech data and converts it into text data.
[0195] "Character data" refers to text-formatted data converted using speech recognition technology.
[0196] An "emotion engine" is a system that estimates and evaluates a user's emotions from their voice and images.
[0197] "Image recognition technology" is a technology used to analyze elements contained in visual content.
[0198] "Generative artificial intelligence technology" is a technology that generates feedback and suggestions based on analyzed data.
[0199] "Feedback" refers to information and advice that helps users improve their presentations.
[0200] "Action guidelines" refer to specific steps or suggestions that should be taken to improve one's expressive abilities.
[0201] The expression practice support system based on this invention provides feedback to users aiming to improve their presentation skills through the analysis of audio and materials. The system mainly consists of a server and user terminals.
[0202] The device is equipped with a microphone and camera for capturing audio and video data. Users can use this device to practice presentations and record the video and audio. The recorded data is then uploaded to private cloud storage.
[0203] The server uses the Google Cloud Speech-to-Text API to convert speech data into text data. This information is then used to analyze the repetition of specific words and the tempo of speech, leveraging natural language processing techniques. Additionally, Microsoft® Azure® sentiment analysis APIs are used to evaluate the emotions inferred from the speech data.
[0204] The OpenCV library is used for image recognition, and by analyzing the design elements of the user's presentation materials, feedback is generated based on standard design guidelines. The server then uses the generated AI model to integrate all analysis results into a comprehensive feedback.
[0205] On the user's device, aggregated feedback can be received through a visually accessible UI. This feedback is presented in common language, outlining areas for improvement and specific action suggestions.
[0206] As a concrete example, when a user practices their graduation thesis presentation using this system, the feedback generated by the server includes suggestions for improvement in speaking style and presentation materials. A typical prompt would be, "Analyze the audio data of the presentation and generate feedback on areas for improvement in speed and intonation." This allows users to efficiently improve their expressive abilities.
[0207] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0208] Step 1:
[0209] The device uses a microphone and camera to record audio and video data so that the user can begin practicing their presentation. The input at this time is the user's voice and video, and the output is audio and video files saved to local storage.
[0210] Step 2:
[0211] The terminal uploads recorded audio and video data to the server. The input is a data file stored in local storage, and the output is backup data stored in cloud storage on the server. This process ensures that the data is sent to the server in an analyzable state.
[0212] Step 3:
[0213] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. The input is the audio data uploaded to the server, and the output is the converted text data. In this step, the audio data becomes parseable as string data.
[0214] Step 4:
[0215] The server uses Microsoft Azure's sentiment analysis API to analyze the converted text data and original audio data and perform sentiment evaluation. The input is text data and audio intonation information, and the output is the sentiment evaluation result. This analysis result quantifies the user's emotional expression.
[0216] Step 5:
[0217] The server uses OpenCV to extract and analyze design elements from video data of a presentation material. The input is video data, and the output is data related to design evaluation. This allows for the evaluation of the visual characteristics of the material.
[0218] Step 6:
[0219] The server applies a generative AI model, integrates all analysis results, and generates feedback. The input is all the analysis data, and the output is an integrated feedback document. This allows the user to obtain comprehensive improvement suggestions.
[0220] Step 7:
[0221] The user's device receives feedback from the server and presents it through the interface. The input is a feedback document, and the output is the feedback displayed on the user interface. Based on this information, the user can easily decide on actions to improve their expressive skills.
[0222] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0223] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0224] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0225] [Second Embodiment]
[0226] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0227] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0228] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0229] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0230] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0231] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0232] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0233] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0234] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0235] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0236] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0237] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0238] This invention is a system that automates the proofreading of presentations, providing a method for users to efficiently prepare high-quality presentations. The system starts operating when the user sends their presentation materials and related audio information to a server via a terminal.
[0239] Users install a dedicated application on their device and upload presentation materials (e.g., in PDF or PowerPoint format) to the server through that application. Furthermore, they use their device's microphone to record audio in a simulated presentation format and import the audio file into the system.
[0240] The server converts the received audio information into text data using speech recognition technology. This transcribed information from the audio data forms the basis for organizing the flow of the presentation into a document and analyzing the characteristics of the speaking style. The speech recognition process utilizes natural language processing techniques to identify intonation, speed, and repeated words.
[0241] Meanwhile, the server applies image recognition technology to the uploaded presentation materials, analyzing the visual elements within the slides in detail. In particular, it evaluates the entire material based on criteria such as text font size, color coordination, and image resolution. This allows for an assessment of whether the material is effective and consistent.
[0242] Subsequently, the server uses generative artificial intelligence to generate useful feedback for improvement from the analysis results of the audio and materials. The feedback is based on best practices for general presentations and suggests specific areas for correction and improvement to the user.
[0243] Ultimately, the device displays the feedback sent from the server in a user-friendly format. This feedback includes detailed explanations of areas for improvement and specific ways to enhance the presentation. This allows users to independently improve the quality of their presentations without needing help from others.
[0244] This system configuration improves the efficiency of presentation preparation and reduces the burden on users.
[0245] The following describes the processing flow.
[0246] Step 1:
[0247] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. Furthermore, they record the presentation's audio using their device's microphone and upload that audio file to the system.
[0248] Step 2:
[0249] The server passes the uploaded audio file to the speech recognition engine, which converts the audio data into text data. This process uses a specific, dedicated dictionary to accurately translate technical terms and custom phrases.
[0250] Step 3:
[0251] The server analyzes the text data obtained through speech recognition and analyzes the characteristics of speech. In particular, it compares the speed, volume, and intonation of the speech with standard speech models and generates feedback.
[0252] Step 4:
[0253] The server analyzes uploaded presentation materials using image recognition technology. It evaluates the design elements of each slide, such as fonts, layout, and color contrast, to determine visual consistency and effectiveness.
[0254] Step 5:
[0255] The server integrates the analysis results and uses generative artificial intelligence to automatically generate feedback on improving the quality of the audio and materials. This feedback includes specific correction suggestions and detailed points for improvement.
[0256] Step 6:
[0257] The server sends the generated feedback report to the terminal.
[0258] Step 7:
[0259] The device displays received feedback to the user in a visually easy-to-understand format. The user can use this information to improve their presentation.
[0260] (Example 1)
[0261] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0262] There is a need for a system that automatically evaluates the quality of presentation materials and content, and provides users with specific improvement suggestions. Existing methods do not sufficiently automate the evaluation of material design or the analysis of presentation style, resulting in the challenge that users must spend considerable time and effort to improve the quality of their presentations.
[0263] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0264] In this invention, the server includes means for receiving material data and acoustic information from users, means for converting the acoustic information into text information using speech recognition technology, and means for analyzing the text information and material data to evaluate the characteristics of speech and the structure of the material data. This enables users to quickly and efficiently obtain useful feedback to improve the quality of their presentations.
[0265] A "user" is someone who uses the system to provide material data and audio information and to receive feedback.
[0266] "Document data" refers to files containing information in document or slide format used in presentations.
[0267] "Acoustic information" refers to the audio data spoken during a presentation, and is the data that is converted into text information.
[0268] "Speech recognition technology" is a technical method used to convert acoustic information into textual information.
[0269] "Textual information" refers to text data extracted from acoustic information using speech recognition technology.
[0270] "Speech characteristics" refer to the distinctive features of a speaker's delivery during a presentation, and specific examples include vocal intonation, speaking speed, and repeated phrases.
[0271] "Generative artificial intelligence technology" refers to artificial intelligence technology used to automatically generate feedback and suggestions based on analysis results.
[0272] "Improvement suggestions" refer to feedback provided to users that indicates specific areas and methods for improving presentation materials and speaking style.
[0273] "Design elements" refer to the visual components included in document data, such as fonts, colors, and image placement.
[0274] "Image recognition technology" is a technical method used to extract and evaluate design elements from document data.
[0275] A "server" refers to a central information processing unit that handles all data processing, analysis, and feedback generation.
[0276] This invention is a system that automatically analyzes presentation materials and audio information and provides improvement suggestions to the user. The user installs a dedicated application on their device and uses this application to prepare presentation materials (PDF or slide format). The user then uses the device's microphone function to conduct a mock presentation and records the audio. The device then transmits this material data and audio information to a server.
[0277] The server converts acoustic information into text information using speech recognition technology. This process utilizes commonly used speech recognition APIs and libraries (e.g., open-source speech recognition software). Next, the server analyzes the text information using natural language processing technology to extract speech characteristics. Specifically, analysis libraries such as NLTK and spaCy may be used.
[0278] Meanwhile, the server uses image recognition technology to extract design elements from the document data and compares them to standard design guidelines. Image analysis tools such as OpenCV and Tesseract OCR are used. This evaluates the consistency of the document's fonts, colors, and constituent elements.
[0279] Next, the server uses generative artificial intelligence technology to generate improvement suggestions based on the analysis results. Generative models such as GPT-4 and BERT are used here. The generated feedback provides users with specific improvement methods and suggestions.
[0280] Ultimately, the terminal presents the improvement suggestions received from the server to the user. This presentation is in a format that the user can easily understand. For example, the improvements may be listed in bullet points or displayed as an infographic that is easy to understand visually.
[0281] For example, if a user submits a "presentation on corporate strategy," the system analyzes the audio data to determine that the user is "speaking too fast" and generates a suggestion that the "font size should be increased" for the slide design.
[0282] An example of a prompt message might be, "What are the key points to emphasize in the introduction of the presentation?" Through this example, the system can provide the user with objective and specific directions for improvement.
[0283] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0284] Step 1:
[0285] The user prepares the presentation materials on the terminal and uploads them to the server using the dedicated application. Also, the user uses the microphone of the terminal to conduct a simulated presentation and records the voice. The input is the presentation materials (in PDF or slide format) and acoustic information (voice file), and the output is that this data is sent to the server.
[0286] Step 2:
[0287] The server converts the received acoustic information into character information using speech recognition technology. Specifically, it analyzes the voice waveform using the speech recognition API and generates text data. The input is acoustic information, and the output is character information.
[0288] Step 3:
[0289] The server analyzes the character information using natural language processing technology and extracts the characteristics of the speech pattern. In this process, libraries such as NLTK and spaCy are used to analyze the text data to identify, for example, the repetition of phrases and the tendency of intonation. The input is character information, and the output is data related to the characteristics of the speech pattern.
[0290] Step 4:
[0291] The server uses image recognition technology for the uploaded presentation materials. It utilizes OpenCV and Tesseract OCR to extract the design elements in the materials and compare them with the standard design criteria. The input is the presentation materials, and the output is the evaluation information of the design elements.
[0292] Step 5:
[0293] The server uses generative artificial intelligence technology to generate improvement suggestions based on evaluation information of speech characteristics and design elements. This process utilizes models such as GPT-4 and BERT to construct specific feedback. The input is evaluation data of speech and design, and the output is improvement suggestions.
[0294] Step 6:
[0295] The terminal presents improvement suggestions received from the server to the user. These suggestions are displayed in a format that the user can easily understand and implement. For example, this could include presenting improvements as a bulleted list or as a visually organized infographic. The input is the improvement suggestions, and the output is the feedback display to the user.
[0296] (Application Example 1)
[0297] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0298] In today's commercial environment, there is a need for technological means to improve the quality of presentations that store staff give to customers and to perform their duties efficiently. However, traditional methods result in staff being unable to prepare presentations effectively, leading to a lack of appeal to customers. To solve this problem, a system is needed that automates the editing of presentations and provides expert feedback.
[0299] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0300] In this invention, the server includes means for receiving presentation materials and voice information from a user using an information terminal, means for converting the voice information into text data using voice recognition technology, means for generating feedback based on the analysis results using generative artificial intelligence technology, and means for providing improvement instructions to users assuming store operations. This makes it possible to improve the quality of presentations made by store staff to customers.
[0301] The "information terminal" is a device for a user to input presentation materials and voice information, specifically referring to smartphones, tablets, etc.
[0302] "Voice recognition technology" is a technology for converting voice information into text data that can be understood by a computer.
[0303] "Text data" refers to character information converted by voice recognition technology and is used to analyze the characteristics of a user's speaking style.
[0304] "Generative artificial intelligence technology" is a field of artificial intelligence that makes inferences and proposals based on the analyzed data and is used to generate feedback to the user.
[0305] "Feedback" refers to information including evaluations and improvement points regarding the content of presentation materials and voice data provided by the user.
[0306] "Improvement instructions to users assuming store operations" are means for providing specific points and proposals aimed at improving the quality of presentations with store business activities in mind.
[0307] The "user" refers to store staff who conduct presentations and their related personnel.
[0308] To implement this invention, first, a smartphone or tablet must be prepared as an information terminal, and the user must transmit presentation materials and audio information to the server via that terminal. A dedicated application must be installed on the terminal, and the user uploads the presentation materials using that application.
[0309] The server converts the received audio information into text data using speech recognition technology. Specifically, it uses Google's speech_recognition library to convert the audio data into reliable text data, and then performs analysis using natural language processing technology.
[0310] Furthermore, the server applies image recognition technology to the presentation materials to evaluate them. It uses the python-pptx library to extract visual elements from the slides and checks whether their design conforms to standard guidelines.
[0311] Based on the analysis results of text data and materials generated by the server, specific feedback is generated using generative artificial intelligence technology. OpenAI's API can be used for this process. The AI evaluates the presentation content, suggests areas for improvement to the user, and provides helpful instructions tailored to the user's store operations.
[0312] Ultimately, the terminal displays feedback sent from the server. This feedback is presented in a user-friendly format and designed to be immediately applicable to presentations in real-world business situations.
[0313] An example of a prompt could be: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product." This allows the generative AI model to provide feedback that aligns with the user's needs.
[0314] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0315] Step 1:
[0316] Users use their information terminals to send presentation materials and audio information to the server via a dedicated application. Input data consists of PDF and PowerPoint documents and audio data, which are uploaded to the server for processing.
[0317] Step 2:
[0318] The server converts uploaded audio data into text data using the speech_recognition library. The input is audio data, and the output is the converted text data. Data processing involves analyzing the audio waveform and converting it into a string using a language model.
[0319] Step 3:
[0320] The server analyzes the converted text data using natural language processing techniques to extract characteristics of the user's speech and specific recurring phrases. The input is text data, and the output is the result of the speech analysis. This process analyzes word frequency and structure.
[0321] Step 4:
[0322] The server uses the python-pptx library to perform image recognition on presentation materials, extracting and evaluating design elements. The input is the presentation material, and the output is the design evaluation result. In particular, the entire material is evaluated based on criteria such as font size and color coordination.
[0323] Step 5:
[0324] The server uses generative artificial intelligence technology to generate feedback based on text analysis results and design evaluation results. The input is the analysis and evaluation results, and the output is feedback information for improvement. The generative AI model develops useful improvement suggestions based on historical data and best practices.
[0325] Step 6:
[0326] The terminal visually presents feedback generated from the server to the user. The input is feedback information, and the output is specific improvement suggestions displayed on the user's screen. The terminal uses visual elements such as text and graphs to convey information in an easily understandable way for the user.
[0327] An example of a prompt sentence to be input into the generating AI model is: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product."
[0328] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0329] This invention is an automated system for improving the quality of user presentations. In addition to analyzing user voice information and presentation materials and generating feedback, it incorporates an emotion engine to evaluate the user's emotional expression.
[0330] Users launch a dedicated application on their device and upload their presentation materials to the server. They also use their device's microphone and camera to record the presentation's audio and video, and send this recording to the server. This data serves as crucial information for communicating the presentation's preparation status to the system.
[0331] The server first uses speech recognition technology to convert the audio data into text data. This text documents the content of the presentation and forms the basis for analyzing speaking speed, repetition, and key phrases. Simultaneously, the audio data is input into an emotion engine, which estimates emotions from the intonation and tone of the user's voice and analyzes the emotional impact of the presentation.
[0332] Next, the server uses image recognition technology to analyze the uploaded presentation materials and extract design elements. It evaluates the font size, color contrast, and layout consistency of each slide to determine the overall visual quality of the material. In addition, by analyzing video data, it uses an emotion engine to evaluate the user's facial expressions and gestures during the presentation and provides feedback based on emotional expression.
[0333] All generated analysis results are integrated by generative artificial intelligence and compiled into concrete feedback. This feedback includes improvements based on audio information, document structure, and emotional expression, and presents a specific action plan for the user.
[0334] The terminal displays feedback reports sent from the server on its user interface. Through this detailed feedback, users can improve the overall quality of their presentations, including not only the content but also the presentation style and emotional expression.
[0335] This system allows users to objectively evaluate their own performance and prepare themselves to approach presentations with confidence.
[0336] The following describes the processing flow.
[0337] Step 1:
[0338] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. They also use their device's microphone and camera to record the audio and video of the presentation. Finally, they send the recorded audio and video data to the server.
[0339] Step 2:
[0340] When the server receives audio data, it uses a speech recognition engine to convert the audio into text data. This text data is used as foundational data for analyzing the content of the presentation.
[0341] Step 3:
[0342] The server inputs voice data into an emotion engine and recognizes emotions from the characteristics of the user's voice. Specifically, it analyzes the tone, pitch, and speed of the voice to estimate the expressed emotional state.
[0343] Step 4:
[0344] The server applies image recognition technology to the uploaded presentation materials to extract design elements from the slides. It analyzes font size, color usage, image placement, etc., to evaluate the visual completeness of the materials.
[0345] Step 5:
[0346] The server analyzes video data and evaluates the user's facial expressions and gestures using an emotion engine. It identifies nonverbal emotional expressions derived from the video and prepares feedback on the emotional aspects of the presentation.
[0347] Step 6:
[0348] The server uses generative artificial intelligence to combine text data, presentation analysis results, and sentiment analysis results from audio and video to generate comprehensive feedback. This feedback details areas for improvement and suggestions for revising the presentation.
[0349] Step 7:
[0350] The device receives feedback from the server and displays it in the user interface. The user reviews this feedback and uses it to improve their presentation.
[0351] (Example 2)
[0352] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0353] Traditional presentation evaluation methods often rely on subjective evaluations by the users themselves, making objective assessment difficult. Furthermore, the lack of comprehensive feedback on the impact of vocal intonation and visual design on the audience made it challenging to improve the overall quality of presentations.
[0354] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0355] In this invention, the server is a device for acquiring information from a user, and includes means for acquiring audio and video information using a microphone and a camera, means for converting the audio information into text information using a speech recognition algorithm, and means for analyzing the text information and video information to evaluate the characteristics of the speaking style and the composition of the visual materials. This enables a comprehensive evaluation and feedback of speaking style, emotions, and visual information in a presentation.
[0356] A "user" refers to an individual or organization that uses the system to receive feedback on their presentation.
[0357] "Device for acquiring information" refers to equipment equipped with a microphone and camera for collecting audio and video information from users.
[0358] "Audio information" refers to data related to the user's voice and spoken language during a presentation.
[0359] "Video information" refers to data about the user's appearance, facial expressions, and gestures during the presentation.
[0360] A "speech recognition algorithm" refers to a method of analysis used to convert speech data into text data.
[0361] "Textual information" refers to text data converted from audio information by a speech recognition algorithm.
[0362] "Analysis" refers to the process of evaluating textual and visual information to understand the speaking style and material structure of a presentation.
[0363] "Speaking style characteristics" refers to the characteristics of the user's voice during a presentation, such as speed and intonation.
[0364] "Visual material composition" refers to the layout and design elements of slides and graphics used in a presentation.
[0365] "Evaluation" refers to the act of judging the quality of a presentation based on data obtained through analysis.
[0366] A "generative knowledge processing device" refers to a device that includes artificial intelligence used to generate evaluation information based on analysis results.
[0367] "Evaluation information" refers to information provided by generative knowledge processing devices, including feedback on areas for improvement and effectiveness of presentations.
[0368] A "device for displaying information to the user" refers to equipment equipped with a screen or display for visually presenting the generated evaluation information.
[0369] This invention provides a system that allows users to efficiently evaluate their own presentations and clearly identify areas for improvement. Users install a dedicated application on their device and use this application to prepare their presentation materials. The device is equipped with a microphone and camera, which can be used to collect audio and video information from the presentation.
[0370] When a user begins a presentation, the device collects audio and video data and sends it to the server. The server uses a speech recognition algorithm to convert the audio information into text. A commonly used speech recognition library can be used for this purpose. The converted text information then becomes data for further analysis of the speaker's characteristics.
[0371] In parallel, video information is analyzed using image processing technology to extract the user's facial expressions and gestures during the presentation. This allows for an evaluation of how the composition of the visual materials and the user's gestures visually and emotionally influence the presentation.
[0372] A generative knowledge processing device integrates this data to generate specific evaluation information. This evaluation information includes areas for improvement in the auditory, visual, and emotional aspects of the presentation. Users can visually review the evaluation information on their device's display and use it to improve their presentations.
[0373] For example, if a user provides feedback such as "the audio is a little monotonous" or "the text on the slides is hard to read," the tempo can be adjusted or the slide design can be improved.
[0374] An example of a prompt is, "Analyze the audio, slides, and video, and generate feedback based on them."
[0375] This system allows users to objectively evaluate their own presentations and efficiently improve them.
[0376] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0377] Step 1:
[0378] The user launches a dedicated application on their device and prepares their presentation materials. The user uses the device's microphone and camera to acquire audio and video information for the presentation. The input audio and video data is collected and sent to the server. This step includes the user recording their own presentation.
[0379] Step 2:
[0380] The server processes the received audio data using a speech recognition algorithm and converts it into text information. The input is audio data, and the output is text information. The server analyzes the tempo and intonation of speech from the audio data and extracts important keywords and phrases. In this step, the information obtained from the audio is processed again as text data.
[0381] Step 3:
[0382] The server processes the received video data using image processing technology. The input is video data, and the output is data related to facial expressions and gestures. The server analyzes this data to estimate emotional influences from facial expressions and gestures. This step involves a detailed analysis of the video information.
[0383] Step 4:
[0384] The server inputs the analyzed text information and video data into the generating AI model. The input consists of text data and video analysis data, and the output is integrated evaluation information. Based on the analysis results, the generating AI model generates specific evaluation information and feedback. This enables a comprehensive evaluation of the presentation. This step includes the integration of data and the generation of feedback.
[0385] Step 5:
[0386] The terminal receives evaluation information sent from the server and displays it on the user interface. The input is evaluation information, and the output is user-confirmable feedback. The user reviews the provided feedback and understands areas for improvement in their presentation. This step involves visualizing the evaluation information.
[0387] (Application Example 2)
[0388] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0389] While systems exist to improve presentation quality, there were insufficient systems that could comprehensively evaluate and improve not only the user's speaking style and presentation material structure, but also their emotional expression. Furthermore, there was a lack of mechanisms to provide users with concrete action plans to improve their own expressive abilities based on the feedback they received.
[0390] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0391] In this invention, the server includes means for receiving expressive materials and audio information from the user, means for converting the audio information into text data using speech recognition technology, and means for analyzing the text data and expressive materials to evaluate the characteristics of the speaking style, the structure of the presentation materials, and the expression of emotion. This enables the user to obtain specific actionable guidelines for improving their expressive abilities by utilizing the feedback.
[0392] A "user" is an individual or group that uses the system to improve the quality of their presentations.
[0393] "Presentation materials" refer to materials, including slides and visual content, used in presentations.
[0394] "Audio information" refers to recordings of the user's voice during a presentation.
[0395] "Speech recognition technology" is a technology that analyzes speech data and converts it into text data.
[0396] "Character data" refers to text-formatted data converted using speech recognition technology.
[0397] An "emotion engine" is a system that estimates and evaluates a user's emotions from their voice and images.
[0398] "Image recognition technology" is a technology used to analyze elements contained in visual content.
[0399] "Generative artificial intelligence technology" is a technology that generates feedback and suggestions based on analyzed data.
[0400] "Feedback" refers to information and advice that helps users improve their presentations.
[0401] "Action guidelines" refer to specific steps or suggestions that should be taken to improve one's expressive abilities.
[0402] The expression practice support system based on this invention provides feedback to users aiming to improve their presentation skills through the analysis of audio and materials. The system mainly consists of a server and user terminals.
[0403] The device is equipped with a microphone and camera for capturing audio and video data. Users can use this device to practice presentations and record the video and audio. The recorded data is then uploaded to private cloud storage.
[0404] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. This information is then used to analyze the repetition of specific words and the tempo of speech, leveraging natural language processing techniques. Additionally, Microsoft Azure's Sentiment Analysis API is used to evaluate the emotions inferred from the audio data.
[0405] The OpenCV library is used for image recognition, and by analyzing the design elements of the user's presentation materials, feedback is generated based on standard design guidelines. The server then uses the generated AI model to integrate all analysis results into a comprehensive feedback.
[0406] On the user's device, aggregated feedback can be received through a visually accessible UI. This feedback is presented in common language, outlining areas for improvement and specific action suggestions.
[0407] As a concrete example, when a user practices their graduation thesis presentation using this system, the feedback generated by the server includes suggestions for improvement in speaking style and presentation materials. A typical prompt would be, "Analyze the audio data of the presentation and generate feedback on areas for improvement in speed and intonation." This allows users to efficiently improve their expressive abilities.
[0408] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0409] Step 1:
[0410] The device uses a microphone and camera to record audio and video data so that the user can begin practicing their presentation. The input at this time is the user's voice and video, and the output is audio and video files saved to local storage.
[0411] Step 2:
[0412] The terminal uploads recorded audio and video data to the server. The input is a data file stored in local storage, and the output is backup data stored in cloud storage on the server. This process ensures that the data is sent to the server in an analyzable state.
[0413] Step 3:
[0414] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. The input is the audio data uploaded to the server, and the output is the converted text data. In this step, the audio data becomes parseable as string data.
[0415] Step 4:
[0416] The server uses Microsoft Azure's sentiment analysis API to analyze the converted text data and original audio data and perform sentiment evaluation. The input is text data and audio intonation information, and the output is the sentiment evaluation result. This analysis result quantifies the user's emotional expression.
[0417] Step 5:
[0418] The server uses OpenCV to extract and analyze design elements from video data of a presentation material. The input is video data, and the output is data related to design evaluation. This allows for the evaluation of the visual characteristics of the material.
[0419] Step 6:
[0420] The server applies a generative AI model, integrates all analysis results, and generates feedback. The input is all the analysis data, and the output is an integrated feedback document. This allows the user to obtain comprehensive improvement suggestions.
[0421] Step 7:
[0422] The user's device receives feedback from the server and presents it through the interface. The input is a feedback document, and the output is the feedback displayed on the user interface. Based on this information, the user can easily decide on actions to improve their expressive skills.
[0423] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0424] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0425] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0426] [Third Embodiment]
[0427] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0428] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0429] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0430] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0431] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0432] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0433] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0434] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0435] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0436] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0437] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0438] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0439] This invention is a system that automates the proofreading of presentations, providing a method for users to efficiently prepare high-quality presentations. The system starts operating when the user sends their presentation materials and related audio information to a server via a terminal.
[0440] Users install a dedicated application on their device and upload presentation materials (e.g., in PDF or PowerPoint format) to the server through that application. Furthermore, they use their device's microphone to record audio in a simulated presentation format and import the audio file into the system.
[0441] The server converts the received audio information into text data using speech recognition technology. This transcribed information from the audio data forms the basis for organizing the flow of the presentation into a document and analyzing the characteristics of the speaking style. The speech recognition process utilizes natural language processing techniques to identify intonation, speed, and repeated words.
[0442] Meanwhile, the server applies image recognition technology to the uploaded presentation materials, analyzing the visual elements within the slides in detail. In particular, it evaluates the entire material based on criteria such as text font size, color coordination, and image resolution. This allows for an assessment of whether the material is effective and consistent.
[0443] Subsequently, the server uses generative artificial intelligence to generate useful feedback for improvement from the analysis results of the audio and materials. The feedback is based on best practices for general presentations and suggests specific areas for correction and improvement to the user.
[0444] Ultimately, the device displays the feedback sent from the server in a user-friendly format. This feedback includes detailed explanations of areas for improvement and specific ways to enhance the presentation. This allows users to independently improve the quality of their presentations without needing help from others.
[0445] This system configuration improves the efficiency of presentation preparation and reduces the burden on users.
[0446] The following describes the processing flow.
[0447] Step 1:
[0448] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. Furthermore, they record the presentation's audio using their device's microphone and upload that audio file to the system.
[0449] Step 2:
[0450] The server passes the uploaded audio file to the speech recognition engine, which converts the audio data into text data. This process uses a specific, dedicated dictionary to accurately translate technical terms and custom phrases.
[0451] Step 3:
[0452] The server analyzes the text data obtained through speech recognition and analyzes the characteristics of speech. In particular, it compares the speed, volume, and intonation of the speech with standard speech models and generates feedback.
[0453] Step 4:
[0454] The server analyzes uploaded presentation materials using image recognition technology. It evaluates the design elements of each slide, such as fonts, layout, and color contrast, to determine visual consistency and effectiveness.
[0455] Step 5:
[0456] The server integrates the analysis results and uses generative artificial intelligence to automatically generate feedback on improving the quality of the audio and materials. This feedback includes specific correction suggestions and detailed points for improvement.
[0457] Step 6:
[0458] The server sends the generated feedback report to the terminal.
[0459] Step 7:
[0460] The device displays received feedback to the user in a visually easy-to-understand format. The user can use this information to improve their presentation.
[0461] (Example 1)
[0462] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0463] There is a need for a system that automatically evaluates the quality of presentation materials and content, and provides users with specific improvement suggestions. Existing methods do not sufficiently automate the evaluation of material design or the analysis of presentation style, resulting in the challenge that users must spend considerable time and effort to improve the quality of their presentations.
[0464] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0465] In this invention, the server includes means for receiving material data and acoustic information from users, means for converting the acoustic information into text information using speech recognition technology, and means for analyzing the text information and material data to evaluate the characteristics of speech and the structure of the material data. This enables users to quickly and efficiently obtain useful feedback to improve the quality of their presentations.
[0466] A "user" is someone who uses the system to provide material data and audio information and to receive feedback.
[0467] "Document data" refers to files containing information in document or slide format used in presentations.
[0468] "Acoustic information" refers to the audio data spoken during a presentation, and is the data that is converted into text information.
[0469] "Speech recognition technology" is a technical method used to convert acoustic information into textual information.
[0470] "Textual information" refers to text data extracted from acoustic information using speech recognition technology.
[0471] "Speech characteristics" refer to the distinctive features of a speaker's delivery during a presentation, and specific examples include vocal intonation, speaking speed, and repeated phrases.
[0472] "Generative artificial intelligence technology" refers to artificial intelligence technology used to automatically generate feedback and suggestions based on analysis results.
[0473] "Improvement suggestions" refer to feedback provided to users that indicates specific areas and methods for improving presentation materials and speaking style.
[0474] "Design elements" refer to the visual components included in document data, such as fonts, colors, and image placement.
[0475] "Image recognition technology" is a technical method used to extract and evaluate design elements from document data.
[0476] A "server" refers to a central information processing unit that handles all data processing, analysis, and feedback generation.
[0477] This invention is a system that automatically analyzes presentation materials and audio information and provides improvement suggestions to the user. The user installs a dedicated application on their device and uses this application to prepare presentation materials (PDF or slide format). The user then uses the device's microphone function to conduct a mock presentation and records the audio. The device then transmits this material data and audio information to a server.
[0478] The server converts acoustic information into text information using speech recognition technology. This process utilizes commonly used speech recognition APIs and libraries (e.g., open-source speech recognition software). Next, the server analyzes the text information using natural language processing technology to extract speech characteristics. Specifically, analysis libraries such as NLTK and spaCy may be used.
[0479] Meanwhile, the server uses image recognition technology to extract design elements from the document data and compares them to standard design guidelines. Image analysis tools such as OpenCV and Tesseract OCR are used. This evaluates the consistency of the document's fonts, colors, and constituent elements.
[0480] Next, the server uses generative artificial intelligence technology to generate improvement suggestions based on the analysis results. Generative models such as GPT-4 and BERT are used here. The generated feedback provides users with specific improvement methods and suggestions.
[0481] Ultimately, the terminal presents the improvement suggestions received from the server to the user. This presentation is in a format that the user can easily understand. For example, the improvements may be listed in bullet points or displayed as an infographic that is easy to understand visually.
[0482] For example, if a user submits a "presentation on corporate strategy," the system analyzes the audio data to determine that the user is "speaking too fast" and generates a suggestion that the "font size should be increased" for the slide design.
[0483] An example of a prompt message might be, "What are the key points to emphasize in the introduction of the presentation?" Through this example, the system can provide the user with objective and specific directions for improvement.
[0484] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0485] Step 1:
[0486] The user prepares presentation materials on their device and uploads them to the server using a dedicated application. The user also uses the device's microphone to perform a mock presentation and record the audio. The input consists of presentation materials (PDF or slide format) and audio information (audio file), and the output is the transmission of this data to the server.
[0487] Step 2:
[0488] The server converts received acoustic information into text information using speech recognition technology. Specifically, it analyzes the audio waveform using a speech recognition API and generates text data. The input is acoustic information, and the output is text information.
[0489] Step 3:
[0490] The server analyzes textual information using natural language processing techniques to extract speech characteristics. This process uses libraries such as NLTK and spaCy to analyze text data, identifying, for example, word repetitions and intonation tendencies. The input is textual information, and the output is data related to speech characteristics.
[0491] Step 4:
[0492] The server uses image recognition technology on uploaded presentation materials. Using OpenCV or Tesseract OCR, it extracts design elements from the materials and compares them to standard design criteria. The input is the presentation material, and the output is evaluation information of the design elements.
[0493] Step 5:
[0494] The server uses generative artificial intelligence technology to generate improvement suggestions based on evaluation information of speech characteristics and design elements. This process utilizes models such as GPT-4 and BERT to construct specific feedback. The input is evaluation data of speech and design, and the output is improvement suggestions.
[0495] Step 6:
[0496] The terminal presents improvement suggestions received from the server to the user. These suggestions are displayed in a format that the user can easily understand and implement. For example, this could include presenting improvements as a bulleted list or as a visually organized infographic. The input is the improvement suggestions, and the output is the feedback display to the user.
[0497] (Application Example 1)
[0498] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0499] In today's commercial environment, there is a need for technological means to improve the quality of presentations that store staff give to customers and to perform their duties efficiently. However, traditional methods result in staff being unable to prepare presentations effectively, leading to a lack of appeal to customers. To solve this problem, a system is needed that automates the editing of presentations and provides expert feedback.
[0500] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0501] In this invention, the server includes means for receiving presentation materials and audio information from a user using an information terminal, means for converting the audio information into text data using speech recognition technology, means for generating feedback based on the analysis results using generative artificial intelligence technology, and means for providing improvement instructions to the user, assuming store operations. This makes it possible to improve the quality of presentations that store staff give to customers.
[0502] An "information terminal" is a device used by users to input presentation materials and audio information, and specifically refers to smartphones, tablets, and similar devices.
[0503] "Speech recognition technology" is a technology that converts speech information into text data that a computer can understand.
[0504] "Text data" refers to character information converted by speech recognition technology and is used to analyze the characteristics of the user's speech.
[0505] "Generative artificial intelligence technology" is a field of artificial intelligence that performs inferences and makes suggestions based on analyzed data, and is used to generate feedback for users.
[0506] "Feedback" refers to information that includes evaluations and suggestions for improvement regarding the content of presentation materials and audio data provided by users.
[0507] "Improvement instructions for users based on store operations" refers to a means of providing specific suggestions and feedback aimed at improving the quality of presentations, with store operations in mind.
[0508] "Users" refers to the store staff giving the presentation and their associates.
[0509] To implement this invention, first, a smartphone or tablet must be prepared as an information terminal, and the user must transmit presentation materials and audio information to the server via that terminal. A dedicated application must be installed on the terminal, and the user uploads the presentation materials using that application.
[0510] The server converts the received audio information into text data using speech recognition technology. Specifically, it uses Google's speech_recognition library to convert the audio data into reliable text data, and then performs analysis using natural language processing technology.
[0511] Furthermore, the server applies image recognition technology to the presentation materials to evaluate them. It uses the python-pptx library to extract visual elements from the slides and checks whether their design conforms to standard guidelines.
[0512] Based on the analysis results of text data and materials generated by the server, specific feedback is generated using generative artificial intelligence technology. OpenAI's API can be used for this process. The AI evaluates the presentation content, suggests areas for improvement to the user, and provides helpful instructions tailored to the user's store operations.
[0513] Ultimately, the terminal displays feedback sent from the server. This feedback is presented in a user-friendly format and designed to be immediately applicable to presentations in real-world business situations.
[0514] An example of a prompt could be: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product." This allows the generative AI model to provide feedback that aligns with the user's needs.
[0515] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0516] Step 1:
[0517] Users use their information terminals to send presentation materials and audio information to the server via a dedicated application. Input data consists of PDF and PowerPoint documents and audio data, which are uploaded to the server for processing.
[0518] Step 2:
[0519] The server converts uploaded audio data into text data using the speech_recognition library. The input is audio data, and the output is the converted text data. Data processing involves analyzing the audio waveform and converting it into a string using a language model.
[0520] Step 3:
[0521] The server analyzes the converted text data using natural language processing techniques to extract characteristics of the user's speech and specific recurring phrases. The input is text data, and the output is the result of the speech analysis. This process analyzes word frequency and structure.
[0522] Step 4:
[0523] The server uses the python-pptx library to perform image recognition on presentation materials, extracting and evaluating design elements. The input is the presentation material, and the output is the design evaluation result. In particular, the entire material is evaluated based on criteria such as font size and color coordination.
[0524] Step 5:
[0525] The server uses generative artificial intelligence technology to generate feedback based on text analysis results and design evaluation results. The input is the analysis and evaluation results, and the output is feedback information for improvement. The generative AI model develops useful improvement suggestions based on historical data and best practices.
[0526] Step 6:
[0527] The terminal visually presents feedback generated from the server to the user. The input is feedback information, and the output is specific improvement suggestions displayed on the user's screen. The terminal uses visual elements such as text and graphs to convey information in an easily understandable way for the user.
[0528] An example of a prompt sentence to be input into the generating AI model is: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product."
[0529] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0530] This invention is an automated system for improving the quality of user presentations. In addition to analyzing user voice information and presentation materials and generating feedback, it incorporates an emotion engine to evaluate the user's emotional expression.
[0531] Users launch a dedicated application on their device and upload their presentation materials to the server. They also use their device's microphone and camera to record the presentation's audio and video, and send this recording to the server. This data serves as crucial information for communicating the presentation's preparation status to the system.
[0532] The server first uses speech recognition technology to convert the audio data into text data. This text documents the content of the presentation and forms the basis for analyzing speaking speed, repetition, and key phrases. Simultaneously, the audio data is input into an emotion engine, which estimates emotions from the intonation and tone of the user's voice and analyzes the emotional impact of the presentation.
[0533] Next, the server uses image recognition technology to analyze the uploaded presentation materials and extract design elements. It evaluates the font size, color contrast, and layout consistency of each slide to determine the overall visual quality of the material. In addition, by analyzing video data, it uses an emotion engine to evaluate the user's facial expressions and gestures during the presentation and provides feedback based on emotional expression.
[0534] All generated analysis results are integrated by generative artificial intelligence and compiled into concrete feedback. This feedback includes improvements based on audio information, document structure, and emotional expression, and presents a specific action plan for the user.
[0535] The terminal displays feedback reports sent from the server on its user interface. Through this detailed feedback, users can improve the overall quality of their presentations, including not only the content but also the presentation style and emotional expression.
[0536] This system allows users to objectively evaluate their own performance and prepare themselves to approach presentations with confidence.
[0537] The following describes the processing flow.
[0538] Step 1:
[0539] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. They also use their device's microphone and camera to record the audio and video of the presentation. Finally, they send the recorded audio and video data to the server.
[0540] Step 2:
[0541] When the server receives audio data, it uses a speech recognition engine to convert the audio into text data. This text data is used as foundational data for analyzing the content of the presentation.
[0542] Step 3:
[0543] The server inputs voice data into an emotion engine and recognizes emotions from the characteristics of the user's voice. Specifically, it analyzes the tone, pitch, and speed of the voice to estimate the expressed emotional state.
[0544] Step 4:
[0545] The server applies image recognition technology to the uploaded presentation materials to extract design elements from the slides. It analyzes font size, color usage, image placement, etc., to evaluate the visual completeness of the materials.
[0546] Step 5:
[0547] The server analyzes video data and evaluates the user's facial expressions and gestures using an emotion engine. It identifies nonverbal emotional expressions derived from the video and prepares feedback on the emotional aspects of the presentation.
[0548] Step 6:
[0549] The server uses generative artificial intelligence to combine text data, presentation analysis results, and sentiment analysis results from audio and video to generate comprehensive feedback. This feedback details areas for improvement and suggestions for revising the presentation.
[0550] Step 7:
[0551] The device receives feedback from the server and displays it in the user interface. The user reviews this feedback and uses it to improve their presentation.
[0552] (Example 2)
[0553] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0554] Traditional presentation evaluation methods often rely on subjective evaluations by the users themselves, making objective assessment difficult. Furthermore, the lack of comprehensive feedback on the impact of vocal intonation and visual design on the audience made it challenging to improve the overall quality of presentations.
[0555] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0556] In this invention, the server is a device for acquiring information from a user, and includes means for acquiring audio and video information using a microphone and a camera, means for converting the audio information into text information using a speech recognition algorithm, and means for analyzing the text information and video information to evaluate the characteristics of the speaking style and the composition of the visual materials. This enables a comprehensive evaluation and feedback of speaking style, emotions, and visual information in a presentation.
[0557] A "user" refers to an individual or organization that uses the system to receive feedback on their presentation.
[0558] "Device for acquiring information" refers to equipment equipped with a microphone and camera for collecting audio and video information from users.
[0559] "Audio information" refers to data related to the user's voice and spoken language during a presentation.
[0560] "Video information" refers to data about the user's appearance, facial expressions, and gestures during the presentation.
[0561] A "speech recognition algorithm" refers to a method of analysis used to convert speech data into text data.
[0562] "Textual information" refers to text data converted from audio information by a speech recognition algorithm.
[0563] "Analysis" refers to the process of evaluating textual and visual information to understand the speaking style and material structure of a presentation.
[0564] "Speaking style characteristics" refers to the characteristics of the user's voice during a presentation, such as speed and intonation.
[0565] "Visual material composition" refers to the layout and design elements of slides and graphics used in a presentation.
[0566] "Evaluation" refers to the act of judging the quality of a presentation based on data obtained through analysis.
[0567] A "generative knowledge processing device" refers to a device that includes artificial intelligence used to generate evaluation information based on analysis results.
[0568] "Evaluation information" refers to information provided by generative knowledge processing devices, including feedback on areas for improvement and effectiveness of presentations.
[0569] A "device for displaying information to the user" refers to equipment equipped with a screen or display for visually presenting the generated evaluation information.
[0570] This invention provides a system that allows users to efficiently evaluate their own presentations and clearly identify areas for improvement. Users install a dedicated application on their device and use this application to prepare their presentation materials. The device is equipped with a microphone and camera, which can be used to collect audio and video information from the presentation.
[0571] When a user begins a presentation, the device collects audio and video data and sends it to the server. The server uses a speech recognition algorithm to convert the audio information into text. A commonly used speech recognition library can be used for this purpose. The converted text information then becomes data for further analysis of the speaker's characteristics.
[0572] In parallel, video information is analyzed using image processing technology to extract the user's facial expressions and gestures during the presentation. This allows for an evaluation of how the composition of the visual materials and the user's gestures visually and emotionally influence the presentation.
[0573] A generative knowledge processing device integrates this data to generate specific evaluation information. This evaluation information includes areas for improvement in the auditory, visual, and emotional aspects of the presentation. Users can visually review the evaluation information on their device's display and use it to improve their presentations.
[0574] For example, if a user provides feedback such as "the audio is a little monotonous" or "the text on the slides is hard to read," the tempo can be adjusted or the slide design can be improved.
[0575] An example of a prompt is, "Analyze the audio, slides, and video, and generate feedback based on them."
[0576] This system allows users to objectively evaluate their own presentations and efficiently improve them.
[0577] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0578] Step 1:
[0579] The user launches a dedicated application on their device and prepares their presentation materials. The user uses the device's microphone and camera to acquire audio and video information for the presentation. The input audio and video data is collected and sent to the server. This step includes the user recording their own presentation.
[0580] Step 2:
[0581] The server processes the received audio data using a speech recognition algorithm and converts it into text information. The input is audio data, and the output is text information. The server analyzes the tempo and intonation of speech from the audio data and extracts important keywords and phrases. In this step, the information obtained from the audio is processed again as text data.
[0582] Step 3:
[0583] The server processes the received video data using image processing technology. The input is video data, and the output is data related to facial expressions and gestures. The server analyzes this data to estimate emotional influences from facial expressions and gestures. This step involves a detailed analysis of the video information.
[0584] Step 4:
[0585] The server inputs the analyzed text information and video data into the generating AI model. The input consists of text data and video analysis data, and the output is integrated evaluation information. Based on the analysis results, the generating AI model generates specific evaluation information and feedback. This enables a comprehensive evaluation of the presentation. This step includes the integration of data and the generation of feedback.
[0586] Step 5:
[0587] The terminal receives evaluation information sent from the server and displays it on the user interface. The input is evaluation information, and the output is user-confirmable feedback. The user reviews the provided feedback and understands areas for improvement in their presentation. This step involves visualizing the evaluation information.
[0588] (Application Example 2)
[0589] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0590] While systems exist to improve presentation quality, there were insufficient systems that could comprehensively evaluate and improve not only the user's speaking style and presentation material structure, but also their emotional expression. Furthermore, there was a lack of mechanisms to provide users with concrete action plans to improve their own expressive abilities based on the feedback they received.
[0591] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0592] In this invention, the server includes means for receiving expressive materials and audio information from the user, means for converting the audio information into text data using speech recognition technology, and means for analyzing the text data and expressive materials to evaluate the characteristics of the speaking style, the structure of the presentation materials, and the expression of emotion. This enables the user to obtain specific actionable guidelines for improving their expressive abilities by utilizing the feedback.
[0593] A "user" is an individual or group that uses the system to improve the quality of their presentations.
[0594] "Presentation materials" refer to materials, including slides and visual content, used in presentations.
[0595] "Audio information" refers to recordings of the user's voice during a presentation.
[0596] "Speech recognition technology" is a technology that analyzes speech data and converts it into text data.
[0597] "Character data" refers to text-formatted data converted using speech recognition technology.
[0598] An "emotion engine" is a system that estimates and evaluates a user's emotions from their voice and images.
[0599] "Image recognition technology" is a technology used to analyze elements contained in visual content.
[0600] "Generative artificial intelligence technology" is a technology that generates feedback and suggestions based on analyzed data.
[0601] "Feedback" refers to information and advice that helps users improve their presentations.
[0602] "Action guidelines" refer to specific steps or suggestions that should be taken to improve one's expressive abilities.
[0603] The expression practice support system based on this invention provides feedback to users aiming to improve their presentation skills through the analysis of audio and materials. The system mainly consists of a server and user terminals.
[0604] The device is equipped with a microphone and camera for capturing audio and video data. Users can use this device to practice presentations and record the video and audio. The recorded data is then uploaded to private cloud storage.
[0605] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. This information is then used to analyze the repetition of specific words and the tempo of speech, leveraging natural language processing techniques. Additionally, Microsoft Azure's Sentiment Analysis API is used to evaluate the emotions inferred from the audio data.
[0606] The OpenCV library is used for image recognition, and by analyzing the design elements of the user's presentation materials, feedback is generated based on standard design guidelines. The server then uses the generated AI model to integrate all analysis results into a comprehensive feedback.
[0607] On the user's device, aggregated feedback can be received through a visually accessible UI. This feedback is presented in common language, outlining areas for improvement and specific action suggestions.
[0608] As a concrete example, when a user practices their graduation thesis presentation using this system, the feedback generated by the server includes suggestions for improvement in speaking style and presentation materials. A typical prompt would be, "Analyze the audio data of the presentation and generate feedback on areas for improvement in speed and intonation." This allows users to efficiently improve their expressive abilities.
[0609] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0610] Step 1:
[0611] The device uses a microphone and camera to record audio and video data so that the user can begin practicing their presentation. The input at this time is the user's voice and video, and the output is audio and video files saved to local storage.
[0612] Step 2:
[0613] The terminal uploads recorded audio and video data to the server. The input is a data file stored in local storage, and the output is backup data stored in cloud storage on the server. This process ensures that the data is sent to the server in an analyzable state.
[0614] Step 3:
[0615] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. The input is the audio data uploaded to the server, and the output is the converted text data. In this step, the audio data becomes parseable as string data.
[0616] Step 4:
[0617] The server uses Microsoft Azure's sentiment analysis API to analyze the converted text data and original audio data and perform sentiment evaluation. The input is text data and audio intonation information, and the output is the sentiment evaluation result. This analysis result quantifies the user's emotional expression.
[0618] Step 5:
[0619] The server uses OpenCV to extract and analyze design elements from video data of a presentation material. The input is video data, and the output is data related to design evaluation. This allows for the evaluation of the visual characteristics of the material.
[0620] Step 6:
[0621] The server applies a generative AI model, integrates all analysis results, and generates feedback. The input is all the analysis data, and the output is an integrated feedback document. This allows the user to obtain comprehensive improvement suggestions.
[0622] Step 7:
[0623] The user's device receives feedback from the server and presents it through the interface. The input is a feedback document, and the output is the feedback displayed on the user interface. Based on this information, the user can easily decide on actions to improve their expressive skills.
[0624] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0625] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0626] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0627] [Fourth Embodiment]
[0628] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0629] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0630] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0631] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0632] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0633] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0634] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0635] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0636] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0637] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0638] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0639] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0640] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0641] This invention is a system that automates the proofreading of presentations, providing a method for users to efficiently prepare high-quality presentations. The system starts operating when the user sends their presentation materials and related audio information to a server via a terminal.
[0642] Users install a dedicated application on their device and upload presentation materials (e.g., in PDF or PowerPoint format) to the server through that application. Furthermore, they use their device's microphone to record audio in a simulated presentation format and import the audio file into the system.
[0643] The server converts the received audio information into text data using speech recognition technology. This transcribed information from the audio data forms the basis for organizing the flow of the presentation into a document and analyzing the characteristics of the speaking style. The speech recognition process utilizes natural language processing techniques to identify intonation, speed, and repeated words.
[0644] Meanwhile, the server applies image recognition technology to the uploaded presentation materials, analyzing the visual elements within the slides in detail. In particular, it evaluates the entire material based on criteria such as text font size, color coordination, and image resolution. This allows for an assessment of whether the material is effective and consistent.
[0645] Subsequently, the server uses generative artificial intelligence to generate useful feedback for improvement from the analysis results of the audio and materials. The feedback is based on best practices for general presentations and suggests specific areas for correction and improvement to the user.
[0646] Ultimately, the device displays the feedback sent from the server in a user-friendly format. This feedback includes detailed explanations of areas for improvement and specific ways to enhance the presentation. This allows users to independently improve the quality of their presentations without needing help from others.
[0647] This system configuration improves the efficiency of presentation preparation and reduces the burden on users.
[0648] The following describes the processing flow.
[0649] Step 1:
[0650] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. Furthermore, they record the presentation's audio using their device's microphone and upload that audio file to the system.
[0651] Step 2:
[0652] The server passes the uploaded audio file to the speech recognition engine, which converts the audio data into text data. This process uses a specific, dedicated dictionary to accurately translate technical terms and custom phrases.
[0653] Step 3:
[0654] The server analyzes the text data obtained through speech recognition and analyzes the characteristics of speech. In particular, it compares the speed, volume, and intonation of the speech with standard speech models and generates feedback.
[0655] Step 4:
[0656] The server analyzes uploaded presentation materials using image recognition technology. It evaluates the design elements of each slide, such as fonts, layout, and color contrast, to determine visual consistency and effectiveness.
[0657] Step 5:
[0658] The server integrates the analysis results and uses generative artificial intelligence to automatically generate feedback on improving the quality of the audio and materials. This feedback includes specific correction suggestions and detailed points for improvement.
[0659] Step 6:
[0660] The server sends the generated feedback report to the terminal.
[0661] Step 7:
[0662] The device displays received feedback to the user in a visually easy-to-understand format. The user can use this information to improve their presentation.
[0663] (Example 1)
[0664] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0665] There is a need for a system that automatically evaluates the quality of presentation materials and content, and provides users with specific improvement suggestions. Existing methods do not sufficiently automate the evaluation of material design or the analysis of presentation style, resulting in the challenge that users must spend considerable time and effort to improve the quality of their presentations.
[0666] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0667] In this invention, the server includes means for receiving material data and acoustic information from users, means for converting the acoustic information into text information using speech recognition technology, and means for analyzing the text information and material data to evaluate the characteristics of speech and the structure of the material data. This enables users to quickly and efficiently obtain useful feedback to improve the quality of their presentations.
[0668] A "user" is someone who uses the system to provide material data and audio information and to receive feedback.
[0669] "Document data" refers to files containing information in document or slide format used in presentations.
[0670] "Acoustic information" refers to the audio data spoken during a presentation, and is the data that is converted into text information.
[0671] "Speech recognition technology" is a technical method used to convert acoustic information into textual information.
[0672] "Textual information" refers to text data extracted from acoustic information using speech recognition technology.
[0673] "Speech characteristics" refer to the distinctive features of a speaker's delivery during a presentation, and specific examples include vocal intonation, speaking speed, and repeated phrases.
[0674] "Generative artificial intelligence technology" refers to artificial intelligence technology used to automatically generate feedback and suggestions based on analysis results.
[0675] "Improvement suggestions" refer to feedback provided to users that indicates specific areas and methods for improving presentation materials and speaking style.
[0676] "Design elements" refer to the visual components included in document data, such as fonts, colors, and image placement.
[0677] "Image recognition technology" is a technical method used to extract and evaluate design elements from document data.
[0678] A "server" refers to a central information processing unit that handles all data processing, analysis, and feedback generation.
[0679] This invention is a system that automatically analyzes presentation materials and audio information and provides improvement suggestions to the user. The user installs a dedicated application on their device and uses this application to prepare presentation materials (PDF or slide format). The user then uses the device's microphone function to conduct a mock presentation and records the audio. The device then transmits this material data and audio information to a server.
[0680] The server converts acoustic information into text information using speech recognition technology. This process utilizes commonly used speech recognition APIs and libraries (e.g., open-source speech recognition software). Next, the server analyzes the text information using natural language processing technology to extract speech characteristics. Specifically, analysis libraries such as NLTK and spaCy may be used.
[0681] Meanwhile, the server uses image recognition technology to extract design elements from the document data and compares them to standard design guidelines. Image analysis tools such as OpenCV and Tesseract OCR are used. This evaluates the consistency of the document's fonts, colors, and constituent elements.
[0682] Next, the server uses generative artificial intelligence technology to generate improvement suggestions based on the analysis results. Generative models such as GPT-4 and BERT are used here. The generated feedback provides users with specific improvement methods and suggestions.
[0683] Ultimately, the terminal presents the improvement suggestions received from the server to the user. This presentation is in a format that the user can easily understand. For example, the improvements may be listed in bullet points or displayed as an infographic that is easy to understand visually.
[0684] For example, if a user submits a "presentation on corporate strategy," the system analyzes the audio data to determine that the user is "speaking too fast" and generates a suggestion that the "font size should be increased" for the slide design.
[0685] An example of a prompt message might be, "What are the key points to emphasize in the introduction of the presentation?" Through this example, the system can provide the user with objective and specific directions for improvement.
[0686] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0687] Step 1:
[0688] The user prepares presentation materials on their device and uploads them to the server using a dedicated application. The user also uses the device's microphone to perform a mock presentation and record the audio. The input consists of presentation materials (PDF or slide format) and audio information (audio file), and the output is the transmission of this data to the server.
[0689] Step 2:
[0690] The server converts received acoustic information into text information using speech recognition technology. Specifically, it analyzes the audio waveform using a speech recognition API and generates text data. The input is acoustic information, and the output is text information.
[0691] Step 3:
[0692] The server analyzes textual information using natural language processing techniques to extract speech characteristics. This process uses libraries such as NLTK and spaCy to analyze text data, identifying, for example, word repetitions and intonation tendencies. The input is textual information, and the output is data related to speech characteristics.
[0693] Step 4:
[0694] The server uses image recognition technology on uploaded presentation materials. Using OpenCV or Tesseract OCR, it extracts design elements from the materials and compares them to standard design criteria. The input is the presentation material, and the output is evaluation information of the design elements.
[0695] Step 5:
[0696] The server uses generative artificial intelligence technology to generate improvement suggestions based on evaluation information of speech characteristics and design elements. This process utilizes models such as GPT-4 and BERT to construct specific feedback. The input is evaluation data of speech and design, and the output is improvement suggestions.
[0697] Step 6:
[0698] The terminal presents improvement suggestions received from the server to the user. These suggestions are displayed in a format that the user can easily understand and implement. For example, this could include presenting improvements as a bulleted list or as a visually organized infographic. The input is the improvement suggestions, and the output is the feedback display to the user.
[0699] (Application Example 1)
[0700] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0701] In today's commercial environment, there is a need for technological means to improve the quality of presentations that store staff give to customers and to perform their duties efficiently. However, traditional methods result in staff being unable to prepare presentations effectively, leading to a lack of appeal to customers. To solve this problem, a system is needed that automates the editing of presentations and provides expert feedback.
[0702] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0703] In this invention, the server includes means for receiving presentation materials and audio information from a user using an information terminal, means for converting the audio information into text data using speech recognition technology, means for generating feedback based on the analysis results using generative artificial intelligence technology, and means for providing improvement instructions to the user, assuming store operations. This makes it possible to improve the quality of presentations that store staff give to customers.
[0704] An "information terminal" is a device used by users to input presentation materials and audio information, and specifically refers to smartphones, tablets, and similar devices.
[0705] "Speech recognition technology" is a technology that converts speech information into text data that a computer can understand.
[0706] "Text data" refers to character information converted by speech recognition technology and is used to analyze the characteristics of the user's speech.
[0707] "Generative artificial intelligence technology" is a field of artificial intelligence that performs inferences and makes suggestions based on analyzed data, and is used to generate feedback for users.
[0708] "Feedback" refers to information that includes evaluations and suggestions for improvement regarding the content of presentation materials and audio data provided by users.
[0709] "Improvement instructions for users based on store operations" refers to a means of providing specific suggestions and feedback aimed at improving the quality of presentations, with store operations in mind.
[0710] "Users" refers to the store staff giving the presentation and their associates.
[0711] To implement this invention, first, a smartphone or tablet must be prepared as an information terminal, and the user must transmit presentation materials and audio information to the server via that terminal. A dedicated application must be installed on the terminal, and the user uploads the presentation materials using that application.
[0712] The server converts the received audio information into text data using speech recognition technology. Specifically, it uses Google's speech_recognition library to convert the audio data into reliable text data, and then performs analysis using natural language processing technology.
[0713] Furthermore, the server applies image recognition technology to the presentation materials to evaluate them. It uses the python-pptx library to extract visual elements from the slides and checks whether their design conforms to standard guidelines.
[0714] Based on the analysis results of text data and materials generated by the server, specific feedback is generated using generative artificial intelligence technology. OpenAI's API can be used for this process. The AI evaluates the presentation content, suggests areas for improvement to the user, and provides helpful instructions tailored to the user's store operations.
[0715] Ultimately, the terminal displays feedback sent from the server. This feedback is presented in a user-friendly format and designed to be immediately applicable to presentations in real-world business situations.
[0716] An example of a prompt could be: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product." This allows the generative AI model to provide feedback that aligns with the user's needs.
[0717] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0718] Step 1:
[0719] Users use their information terminals to send presentation materials and audio information to the server via a dedicated application. Input data consists of PDF and PowerPoint documents and audio data, which are uploaded to the server for processing.
[0720] Step 2:
[0721] The server converts uploaded audio data into text data using the speech_recognition library. The input is audio data, and the output is the converted text data. Data processing involves analyzing the audio waveform and converting it into a string using a language model.
[0722] Step 3:
[0723] The server analyzes the converted text data using natural language processing techniques to extract characteristics of the user's speech and specific recurring phrases. The input is text data, and the output is the result of the speech analysis. This process analyzes word frequency and structure.
[0724] Step 4:
[0725] The server uses the python-pptx library to perform image recognition on presentation materials, extracting and evaluating design elements. The input is the presentation material, and the output is the design evaluation result. In particular, the entire material is evaluated based on criteria such as font size and color coordination.
[0726] Step 5:
[0727] The server uses generative artificial intelligence technology to generate feedback based on text analysis results and design evaluation results. The input is the analysis and evaluation results, and the output is feedback information for improvement. The generative AI model develops useful improvement suggestions based on historical data and best practices.
[0728] Step 6:
[0729] The terminal visually presents feedback generated from the server to the user. The input is feedback information, and the output is specific improvement suggestions displayed on the user's screen. The terminal uses visual elements such as text and graphs to convey information in an easily understandable way for the user.
[0730] An example of a prompt sentence to be input into the generating AI model is: "Please provide suggestions for improvement and specific advice regarding the content of the following presentation. The goal of the presentation is to effectively communicate the benefits of the new product."
[0731] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0732] This invention is an automated system for improving the quality of user presentations. In addition to analyzing user voice information and presentation materials and generating feedback, it incorporates an emotion engine to evaluate the user's emotional expression.
[0733] Users launch a dedicated application on their device and upload their presentation materials to the server. They also use their device's microphone and camera to record the presentation's audio and video, and send this recording to the server. This data serves as crucial information for communicating the presentation's preparation status to the system.
[0734] The server first uses speech recognition technology to convert the audio data into text data. This text documents the content of the presentation and forms the basis for analyzing speaking speed, repetition, and key phrases. Simultaneously, the audio data is input into an emotion engine, which estimates emotions from the intonation and tone of the user's voice and analyzes the emotional impact of the presentation.
[0735] Next, the server uses image recognition technology to analyze the uploaded presentation materials and extract design elements. It evaluates the font size, color contrast, and layout consistency of each slide to determine the overall visual quality of the material. In addition, by analyzing video data, it uses an emotion engine to evaluate the user's facial expressions and gestures during the presentation and provides feedback based on emotional expression.
[0736] All generated analysis results are integrated by generative artificial intelligence and compiled into concrete feedback. This feedback includes improvements based on audio information, document structure, and emotional expression, and presents a specific action plan for the user.
[0737] The terminal displays feedback reports sent from the server on its user interface. Through this detailed feedback, users can improve the overall quality of their presentations, including not only the content but also the presentation style and emotional expression.
[0738] This system allows users to objectively evaluate their own performance and prepare themselves to approach presentations with confidence.
[0739] The following describes the processing flow.
[0740] Step 1:
[0741] The user launches a dedicated application on their device, selects a presentation file, and uploads it to the server. They also use their device's microphone and camera to record the audio and video of the presentation. Finally, they send the recorded audio and video data to the server.
[0742] Step 2:
[0743] When the server receives audio data, it uses a speech recognition engine to convert the audio into text data. This text data is used as foundational data for analyzing the content of the presentation.
[0744] Step 3:
[0745] The server inputs voice data into an emotion engine and recognizes emotions from the characteristics of the user's voice. Specifically, it analyzes the tone, pitch, and speed of the voice to estimate the expressed emotional state.
[0746] Step 4:
[0747] The server applies image recognition technology to the uploaded presentation materials to extract design elements from the slides. It analyzes font size, color usage, image placement, etc., to evaluate the visual completeness of the materials.
[0748] Step 5:
[0749] The server analyzes video data and evaluates the user's facial expressions and gestures using an emotion engine. It identifies nonverbal emotional expressions derived from the video and prepares feedback on the emotional aspects of the presentation.
[0750] Step 6:
[0751] The server uses generative artificial intelligence to combine text data, presentation analysis results, and sentiment analysis results from audio and video to generate comprehensive feedback. This feedback details areas for improvement and suggestions for revising the presentation.
[0752] Step 7:
[0753] The device receives feedback from the server and displays it in the user interface. The user reviews this feedback and uses it to improve their presentation.
[0754] (Example 2)
[0755] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0756] Traditional presentation evaluation methods often rely on subjective evaluations by the users themselves, making objective assessment difficult. Furthermore, the lack of comprehensive feedback on the impact of vocal intonation and visual design on the audience made it challenging to improve the overall quality of presentations.
[0757] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0758] In this invention, the server is a device for acquiring information from a user, and includes means for acquiring audio and video information using a microphone and a camera, means for converting the audio information into text information using a speech recognition algorithm, and means for analyzing the text information and video information to evaluate the characteristics of the speaking style and the composition of the visual materials. This enables a comprehensive evaluation and feedback of speaking style, emotions, and visual information in a presentation.
[0759] A "user" refers to an individual or organization that uses the system to receive feedback on their presentation.
[0760] "Device for acquiring information" refers to equipment equipped with a microphone and camera for collecting audio and video information from users.
[0761] "Audio information" refers to data related to the user's voice and spoken language during a presentation.
[0762] "Video information" refers to data about the user's appearance, facial expressions, and gestures during the presentation.
[0763] A "speech recognition algorithm" refers to a method of analysis used to convert speech data into text data.
[0764] "Textual information" refers to text data converted from audio information by a speech recognition algorithm.
[0765] "Analysis" refers to the process of evaluating textual and visual information to understand the speaking style and material structure of a presentation.
[0766] "Speaking style characteristics" refers to the characteristics of the user's voice during a presentation, such as speed and intonation.
[0767] "Visual material composition" refers to the layout and design elements of slides and graphics used in a presentation.
[0768] "Evaluation" refers to the act of judging the quality of a presentation based on data obtained through analysis.
[0769] A "generative knowledge processing device" refers to a device that includes artificial intelligence used to generate evaluation information based on analysis results.
[0770] "Evaluation information" refers to information provided by generative knowledge processing devices, including feedback on areas for improvement and effectiveness of presentations.
[0771] A "device for displaying information to the user" refers to equipment equipped with a screen or display for visually presenting the generated evaluation information.
[0772] This invention provides a system that allows users to efficiently evaluate their own presentations and clearly identify areas for improvement. Users install a dedicated application on their device and use this application to prepare their presentation materials. The device is equipped with a microphone and camera, which can be used to collect audio and video information from the presentation.
[0773] When a user begins a presentation, the device collects audio and video data and sends it to the server. The server uses a speech recognition algorithm to convert the audio information into text. A commonly used speech recognition library can be used for this purpose. The converted text information then becomes data for further analysis of the speaker's characteristics.
[0774] In parallel, video information is analyzed using image processing technology to extract the user's facial expressions and gestures during the presentation. This allows for an evaluation of how the composition of the visual materials and the user's gestures visually and emotionally influence the presentation.
[0775] A generative knowledge processing device integrates this data to generate specific evaluation information. This evaluation information includes areas for improvement in the auditory, visual, and emotional aspects of the presentation. Users can visually review the evaluation information on their device's display and use it to improve their presentations.
[0776] For example, if a user provides feedback such as "the audio is a little monotonous" or "the text on the slides is hard to read," the tempo can be adjusted or the slide design can be improved.
[0777] An example of a prompt is, "Analyze the audio, slides, and video, and generate feedback based on them."
[0778] This system allows users to objectively evaluate their own presentations and efficiently improve them.
[0779] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0780] Step 1:
[0781] The user launches a dedicated application on their device and prepares their presentation materials. The user uses the device's microphone and camera to acquire audio and video information for the presentation. The input audio and video data is collected and sent to the server. This step includes the user recording their own presentation.
[0782] Step 2:
[0783] The server processes the received audio data using a speech recognition algorithm and converts it into text information. The input is audio data, and the output is text information. The server analyzes the tempo and intonation of speech from the audio data and extracts important keywords and phrases. In this step, the information obtained from the audio is processed again as text data.
[0784] Step 3:
[0785] The server processes the received video data using image processing technology. The input is video data, and the output is data related to facial expressions and gestures. The server analyzes this data to estimate emotional influences from facial expressions and gestures. This step involves a detailed analysis of the video information.
[0786] Step 4:
[0787] The server inputs the analyzed text information and video data into the generating AI model. The input consists of text data and video analysis data, and the output is integrated evaluation information. Based on the analysis results, the generating AI model generates specific evaluation information and feedback. This enables a comprehensive evaluation of the presentation. This step includes the integration of data and the generation of feedback.
[0788] Step 5:
[0789] The terminal receives evaluation information sent from the server and displays it on the user interface. The input is evaluation information, and the output is user-confirmable feedback. The user reviews the provided feedback and understands areas for improvement in their presentation. This step involves visualizing the evaluation information.
[0790] (Application Example 2)
[0791] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0792] While systems exist to improve presentation quality, there were insufficient systems that could comprehensively evaluate and improve not only the user's speaking style and presentation material structure, but also their emotional expression. Furthermore, there was a lack of mechanisms to provide users with concrete action plans to improve their own expressive abilities based on the feedback they received.
[0793] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0794] In this invention, the server includes means for receiving expressive materials and audio information from the user, means for converting the audio information into text data using speech recognition technology, and means for analyzing the text data and expressive materials to evaluate the characteristics of the speaking style, the structure of the presentation materials, and the expression of emotion. This enables the user to obtain specific actionable guidelines for improving their expressive abilities by utilizing the feedback.
[0795] A "user" is an individual or group that uses the system to improve the quality of their presentations.
[0796] "Presentation materials" refer to materials, including slides and visual content, used in presentations.
[0797] "Audio information" refers to recordings of the user's voice during a presentation.
[0798] "Speech recognition technology" is a technology that analyzes speech data and converts it into text data.
[0799] "Character data" refers to text-formatted data converted using speech recognition technology.
[0800] An "emotion engine" is a system that estimates and evaluates a user's emotions from their voice and images.
[0801] "Image recognition technology" is a technology used to analyze elements contained in visual content.
[0802] "Generative artificial intelligence technology" is a technology that generates feedback and suggestions based on analyzed data.
[0803] "Feedback" refers to information and advice that helps users improve their presentations.
[0804] "Action guidelines" refer to specific steps or suggestions that should be taken to improve one's expressive abilities.
[0805] The expression practice support system based on this invention provides feedback to users aiming to improve their presentation skills through the analysis of audio and materials. The system mainly consists of a server and user terminals.
[0806] The device is equipped with a microphone and camera for capturing audio and video data. Users can use this device to practice presentations and record the video and audio. The recorded data is then uploaded to private cloud storage.
[0807] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. This information is then used to analyze the repetition of specific words and the tempo of speech, leveraging natural language processing techniques. Additionally, Microsoft Azure's Sentiment Analysis API is used to evaluate the emotions inferred from the audio data.
[0808] The OpenCV library is used for image recognition, and by analyzing the design elements of the user's presentation materials, feedback is generated based on standard design guidelines. The server then uses the generated AI model to integrate all analysis results into a comprehensive feedback.
[0809] On the user's device, aggregated feedback can be received through a visually accessible UI. This feedback is presented in common language, outlining areas for improvement and specific action suggestions.
[0810] As a concrete example, when a user practices their graduation thesis presentation using this system, the feedback generated by the server includes suggestions for improvement in speaking style and presentation materials. A typical prompt would be, "Analyze the audio data of the presentation and generate feedback on areas for improvement in speed and intonation." This allows users to efficiently improve their expressive abilities.
[0811] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0812] Step 1:
[0813] The device uses a microphone and camera to record audio and video data so that the user can begin practicing their presentation. The input at this time is the user's voice and video, and the output is audio and video files saved to local storage.
[0814] Step 2:
[0815] The terminal uploads recorded audio and video data to the server. The input is a data file stored in local storage, and the output is backup data stored in cloud storage on the server. This process ensures that the data is sent to the server in an analyzable state.
[0816] Step 3:
[0817] The server uses the Google Cloud Speech-to-Text API to convert audio data into text data. The input is the audio data uploaded to the server, and the output is the converted text data. In this step, the audio data becomes parseable as string data.
[0818] Step 4:
[0819] The server uses Microsoft Azure's sentiment analysis API to analyze the converted text data and original audio data and perform sentiment evaluation. The input is text data and audio intonation information, and the output is the sentiment evaluation result. This analysis result quantifies the user's emotional expression.
[0820] Step 5:
[0821] The server uses OpenCV to extract and analyze design elements from video data of a presentation material. The input is video data, and the output is data related to design evaluation. This allows for the evaluation of the visual characteristics of the material.
[0822] Step 6:
[0823] The server applies a generative AI model, integrates all analysis results, and generates feedback. The input is all the analysis data, and the output is an integrated feedback document. This allows the user to obtain comprehensive improvement suggestions.
[0824] Step 7:
[0825] The user's device receives feedback from the server and presents it through the interface. The input is a feedback document, and the output is the feedback displayed on the user interface. Based on this information, the user can easily decide on actions to improve their expressive skills.
[0826] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0827] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0828] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0829] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0830] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0831] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0832] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0833] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0834] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0835] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0836] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0837] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0838] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0839] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0840] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0841] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0842] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0843] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0844] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0845] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0846] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0847] The following is further disclosed regarding the embodiments described above.
[0848] (Claim 1)
[0849] A means for receiving presentation materials and audio information from users,
[0850] A means of converting speech information into text data using speech recognition technology,
[0851] A means for analyzing text data and presentation materials to evaluate speaking style characteristics and presentation material structure,
[0852] A means for generating feedback based on analysis results using generative artificial intelligence technology,
[0853] A system including means for presenting the aforementioned feedback to the user.
[0854] (Claim 2)
[0855] The system according to claim 1, which analyzes text data obtained by speech recognition technology using natural language processing technology to identify repetitions of specific words or phrases.
[0856] (Claim 3)
[0857] The system according to claim 1, which extracts design elements contained in each slide of a presentation using image recognition technology and compares them with standard design guidelines.
[0858] "Example 1"
[0859] (Claim 1)
[0860] A means for receiving data and audio information from users,
[0861] A means of converting acoustic information into text information using speech recognition technology,
[0862] A means for analyzing textual information and document data to evaluate the characteristics of speech and the structure of the document data,
[0863] A means for generating improvement suggestions based on analysis results using generative artificial intelligence technology,
[0864] A system including means for providing the aforementioned improvement suggestions to users.
[0865] (Claim 2)
[0866] The system according to claim 1, which analyzes textual information obtained by speech recognition technology using language processing technology to identify repetitions of specific words or phrases.
[0867] (Claim 3)
[0868] The system according to claim 1, which extracts design elements contained in each section of data using image recognition technology and compares them with general design standards.
[0869] "Application Example 1"
[0870] (Claim 1)
[0871] A means of receiving presentation materials and audio information from users using an information terminal,
[0872] A means of converting speech information into text data using speech recognition technology,
[0873] A means for analyzing text data and presentation materials to evaluate speaking style characteristics and presentation material structure,
[0874] A means for generating feedback based on analysis results using generative artificial intelligence technology,
[0875] A means of providing improvement instructions to users, assuming store operations,
[0876] A means for presenting the aforementioned feedback to the user,
[0877] A system that includes this.
[0878] (Claim 2)
[0879] The system according to claim 1, which analyzes text data obtained by speech recognition technology using natural language processing technology to identify repetitions of specific words or phrases.
[0880] (Claim 3)
[0881] The system according to claim 1, which extracts design elements contained in each slide of a presentation using image recognition technology and compares them with standard design guidelines.
[0882] "Example 2 of combining an emotion engine"
[0883] (Claim 1)
[0884] A device for acquiring information from a user, comprising means for acquiring audio and video information using a microphone and a camera,
[0885] Means for converting the aforementioned audio information into text information using a speech recognition algorithm,
[0886] A means for analyzing the aforementioned textual and visual information and evaluating the characteristics of speech and the structure of visual materials,
[0887] A means for generating evaluation information using a generative knowledge processing device with the results of the analysis as input,
[0888] A system including a device for displaying the aforementioned evaluation information to the user.
[0889] (Claim 2)
[0890] The system according to claim 1, which identifies the speed and repetitive elements of speech from textual information and evaluates emotional responses using emotion analysis means.
[0891] (Claim 3)
[0892] The system according to claim 1, which extracts design elements from visual materials using image processing technology and performs structural consistency and emotional evaluation based on gaze and movement.
[0893] "Application example 2 when combining with an emotional engine"
[0894] (Claim 1)
[0895] A means for receiving expressive materials and audio information from users,
[0896] A means of converting speech information into text data using speech recognition technology,
[0897] A means for analyzing text data and presentation materials to evaluate speaking style characteristics, presentation material structure, and emotional expression,
[0898] A means for generating feedback based on analysis results using generative artificial intelligence technology and presenting it to the interface of an audiovisual device,
[0899] Applications that allow users to improve their self-expression abilities by utilizing feedback,
[0900] A system that includes this.
[0901] (Claim 2)
[0902] The system according to claim 1, which analyzes text data obtained by speech recognition technology using natural language processing technology to identify repetitions of specific words or phrases, and evaluates emotional expressions using an emotion engine.
[0903] (Claim 3)
[0904] The system according to claim 1, which extracts design elements contained in each page of a presentation material using image recognition technology, compares them with standard design guidelines, and suggests areas for visual improvement. [Explanation of Symbols]
[0905] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of receiving presentation materials and audio information from users using an information terminal, A means of converting speech information into text data using speech recognition technology, A means for analyzing text data and presentation materials to evaluate speaking style characteristics and presentation material structure, A means for generating feedback based on analysis results using generative artificial intelligence technology, A means of providing improvement instructions to users, assuming store operations, A means for presenting the aforementioned feedback to the user, A system that includes this.
2. The system according to claim 1, which analyzes text data obtained by speech recognition technology using natural language processing technology to identify repetitions of specific words or phrases.
3. The system according to claim 1, which extracts design elements contained in each slide of a presentation using image recognition technology and compares them with standard design guidelines.