system

The presentation editing system uses speech and image recognition to provide users with detailed feedback on speech and material quality, addressing the challenge of self-improvement in presentations.

JP2026100526APending Publication Date: 2026-06-19SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Businessmen, students, and researchers face challenges in preparing effective presentations without relying on others for feedback, particularly in objectively analyzing aspects like speech appearance, composition, and tempo, and obtaining multifaceted improvement points.

Method used

A presentation editing system utilizing speech recognition technology to analyze speech flow and tempo, and image recognition to evaluate presentation materials, providing comprehensive feedback on areas for improvement.

Benefits of technology

Enables users to autonomously refine their presentations by receiving detailed feedback on speech and material quality, enhancing their presentation skills effectively.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100526000001_ABST
    Figure 2026100526000001_ABST
Patent Text Reader

Abstract

We provide the system. [Solution] A speech recognition means that receives audio data and converts that audio into text, A method for analyzing the flow, tempo, and timing of speech from text, A means of receiving presentation materials and analyzing the layout and content of the materials using image recognition technology, A means of integrating the results of audio analysis and the results of document analysis to generate feedback, A system that includes means for sending generated feedback to a terminal.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In order for businessmen, students, researchers, etc. to prepare presentations efficiently and effectively, it is important to receive opinions and revisions from others. However, there is a need for a method to autonomously improve the quality of presentations without relying on others such as superiors and colleagues. In particular, it is an issue to objectively analyze multi-faceted improvement points such as the appearance, composition, and tempo of speech of materials and obtain feedback.

Means for Solving the Problems

[0005] To solve this problem, the present invention provides a presentation editing system using speech recognition technology and image recognition technology. When a user inputs audio data and presentation materials via a terminal, the server converts the audio into text and analyzes the flow and tempo of the speech. In addition, image recognition is used to analyze the layout and content of the materials and generate comprehensive feedback on the audio and materials. The generated feedback is sent to the terminal and an interface is provided to the user that shows areas for improvement, enabling effective presentation preparation without the need for assistance from others.

[0006] "Speech recognition" is a technology that analyzes speech data and converts it into text data.

[0007] "Image recognition" is a technology that detects and analyzes specific information from images and videos.

[0008] "Feedback" refers to suggestions for improvement or correction provided based on the analysis results.

[0009] "Terminal" refers to electronic devices such as computers and smartphones used by users.

[0010] A "server" is a computer system that provides data processing and storage over a network.

[0011] "Presentation materials" refer to information media such as slides and documents used in a presentation.

[0012] "Analysis" is the process of examining and breaking down data to find meaning and patterns.

[0013] An "interface" is an operating screen or means that enables the exchange of information and instructions between a user and a system. [Brief explanation of the drawing]

[0014] [Figure 1]It is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] It is a conceptual diagram showing an example of the main functions of a data processing device and a smart device according to the first embodiment. [Figure 3] It is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] It is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] It is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] It is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] It is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] It is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] It shows an emotion map to which a plurality of emotions are mapped. [Figure 10] It shows an emotion map to which a plurality of emotions are mapped. [Figure 11] It is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] It is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] It is a sequence diagram showing the processing flow of the data processing system in Example 2 when an emotion engine is combined. [Figure 14] It is a sequence diagram showing the processing flow of the data processing system in Application Example 2 when an emotion engine is combined.

MODE FOR CARRYING OUT THE INVENTION

[0015] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0016] First, the terms used in the following description will be explained.

[0017] In the following embodiments, the labeled processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0018] In the following embodiments, the labeled RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0019] In the following embodiments, the labeled storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, etc.

[0020] In the following embodiments, the labeled communication I / F (Interface) is an interface including a communication processor and an antenna, etc. The communication I / F controls communication between multiple computers. Examples of communication standards applied to the communication I / F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark), etc.

[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0022] [First Embodiment]

[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0035] This invention is a presentation editing system that uses speech recognition and image recognition technology. The system configuration and specific embodiments for carrying out the invention are described below.

[0036] System Configuration

[0037] This system consists of a user terminal, a server that processes data, and a network that connects them. The user uses the terminal to input audio data and materials for their presentation and sends them to the server. The server analyzes this data, generates feedback, and then sends it back to the terminal for the user to see.

[0038] Method for carrying out the invention

[0039] 1. Data entry and transmission

[0040] Users practice their presentations on their own devices and record the audio. They also prepare presentation materials (slides and documents) as electronic files. These audio data and material files are then sent to the server via a dedicated application.

[0041] 2. Voice Analysis

[0042] The server performs speech recognition using the received audio data. This speech recognition engine converts the audio into text and analyzes the way the speaker speaks, their tone, and their pacing. For example, it generates specific feedback such as, "You're speaking too fast in the introduction."

[0043] 3. Material analysis

[0044] The server analyzes the received presentation materials using image recognition technology. It evaluates the structure, layout, text readability, and placement of visual elements of the materials, and generates suggestions such as, "Slide 3 contains too much information and needs to be rearranged for clarity."

[0045] 4. Feedback generation and display

[0046] The server integrates the results of audio and document analysis to create comprehensive feedback. This feedback is sent to the terminal in the form of multiple areas for improvement, providing users with information that allows them to efficiently improve their presentations.

[0047] 5. User modification and verification

[0048] Users review the feedback displayed on their devices and revise the content and delivery of their presentations. Afterward, they re-enter data into the system as needed to receive further feedback, thereby improving the quality of their presentations.

[0049] Specific example

[0050] For example, suppose a user is preparing a five-minute presentation on a marketing strategy. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improving the presentation style, such as "slow down the pace of the introduction and take longer pauses when stating the problem," as well as feedback on the materials, such as "the market analysis chart on the second slide is too complex; we suggest simplifying it visually."

[0051] This format allows users to receive multifaceted feedback and autonomously refine their presentations.

[0052] The following describes the processing flow.

[0053] Step 1:

[0054] The user uses their device to prepare the audio data and document files for the presentation and opens a dedicated application. Through the application, they upload the audio data and document files to the server.

[0055] Step 2:

[0056] The device packages the uploaded data and sends it to the server using a secure communication protocol. Additional information, such as the presentation topic and time constraints specified by the user, is also sent at the same time.

[0057] Step 3:

[0058] The server places the received audio data into an analysis queue and activates the speech recognition module. This module converts the audio data into text and extracts speech characteristics from it. These characteristics include speed, intonation, and pauses.

[0059] Step 4:

[0060] The server uses an image recognition module to analyze presentation files. This module evaluates the placement of text, shapes, and images within the slides, checking for visual consistency and readability of the materials.

[0061] Step 5:

[0062] The server compares the results of speech recognition and image recognition to generate integrated feedback. This feedback includes specific areas for improvement and recommended corrections.

[0063] Step 6:

[0064] The server repackages the generated feedback and sends it to the terminal. A secure protocol is used for transmission, ensuring the confidentiality of the data.

[0065] Step 7:

[0066] The device analyzes the received feedback and displays it in a format that is easy for the user to understand. The interface includes features that provide detailed explanations for each area of ​​improvement, making it clear to the user which parts of the presentation need to be revised.

[0067] Step 8:

[0068] Users can revise their presentation materials and delivery based on the feedback provided, and use the system again to obtain new feedback as needed. This allows for a gradual improvement in the quality of the presentation.

[0069] (Example 1)

[0070] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0071] In today's business environment, the quality of presentations and presentations is a crucial factor in success. However, there is a lack of effective tools for objectively evaluating and improving one's speaking style and the content of presentation materials. Therefore, there is a need for support in improving one's presentation skills by comprehensively analyzing audio and visual information and providing specific areas for improvement.

[0072] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0073] In this invention, the server includes recognition means for receiving acoustic information and converting that acoustic information into textual information, means for analyzing the flow, speed, and intervals of speech from the textual information, and means for receiving visual materials and analyzing the structure and content of the materials using image processing technology. This makes it possible to comprehensively analyze acoustic and visual information and provide the user with effective suggestions for improvement.

[0074] "Acoustic information" refers to audio data and related sound information, and is primarily the data targeted for speech recognition.

[0075] "Textual information" refers to text data generated by speech recognition based on acoustic information, and is information expressed in a format that is easily understood visually by humans.

[0076] "Recognition means" refers to the technologies and devices used to process acoustic information and convert it into textual information, and usually includes speech recognition algorithms.

[0077] "Speech flow" refers to the structure and order of spoken language, and is an element that should be considered in order to maintain the natural flow of speech.

[0078] "Speech rate" refers to the tempo of speech and is usually measured by the amount of sound uttered per unit of time.

[0079] "Pauses" refer to pauses or breaks that a speaker creates during utterances, and are an important element for improving the clarity of speech.

[0080] "Visual materials" refer to media used to present information visually, such as slides and documents created for presentations.

[0081] "Image processing technology" refers to the technology of using computers to analyze and transform image data and extract useful information.

[0082] "Composition" refers to the arrangement and combination of text and image elements in visual materials, and is important for effectively conveying information.

[0083] "Content" refers to the information that visual materials attempt to convey, and plays a central role in expressing the theme and message of the material.

[0084] "Areas for improvement" refers to specific suggestions and advice from the user to improve their presentation, derived from the analysis results of acoustic and visual information.

[0085] This invention is a system for analyzing acoustic and visual information and providing users with suggestions for improvement. The following describes how to implement the system in detail.

[0086] The server receives acoustic information and uses a speech recognition engine to convert that sound into text. For example, it utilizes speech recognition technologies such as "Google® Speech-to-Text" and "Amazon Transcribe." The server analyzes the converted text and evaluates the flow, speed, and spacing of the speech. In this process, it uses a generative AI model to generate specific feedback regarding the emphasis of particular words and phrases, as well as the tempo of the speech.

[0087] The terminal is responsible for transmitting audio information and visual materials recorded by the user to the server. Visual materials include slides and documents, and the server analyzes the received materials using image processing technology. By using image recognition tools such as OpenCV and Tesseract, the structure and content of the materials are evaluated, and improvement suggestions are generated.

[0088] Users can improve the quality of their presentations by receiving feedback from the server and repeatedly practicing. Based on the improvements, users can record new practice sessions and input them into the system again.

[0089] As a concrete example of implementation, let's say a user is preparing a 5-minute presentation introducing a new product. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improvement, such as "improve your delivery in the introduction and simplify the design of slide 2." Based on these suggestions, the user can practice repeatedly.

[0090] An example of a prompt message that could be sent to the server is: "Analyze the audio data and materials of the presentation and generate specific areas for improvement. The audio data is what the user said, and the materials include slides and documents."

[0091] This embodiment of the invention provides a concrete method for efficiently improving the quality of presentations through collaboration between servers, terminals, and users.

[0092] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0093] Step 1:

[0094] Users practice their presentations on their own devices and record the audio information (audio data). They also prepare visual materials (slides and documents) for their presentations. The input consists of audio information and visual materials, which are then sent to the server using a dedicated application.

[0095] Step 2:

[0096] The server receives acoustic information sent by the user and converts it into text information using a speech recognition engine. The input is acoustic information, and the server analyzes the waveform of the voice to obtain output as a string of characters. In this process, the server uses technologies such as "Google Speech-to-Text" or "Amazon Transcribe".

[0097] Step 3:

[0098] The server analyzes the speed, flow, and spacing of speech from the converted text information. Text information is input, and a generative AI model is used to evaluate the speaking style and tempo, outputting feedback for improvement. The server specifically analyzes certain words and speech patterns to provide appropriate suggestions.

[0099] Step 4:

[0100] Simultaneously, the server receives visual materials and analyzes their structure and content using image processing techniques. The input for this step is visual materials, and the layout and text legibility are automatically checked. The output generates specific suggestions for improvement regarding the materials. Technically, image recognition is performed using OpenCV or Tesseract.

[0101] Step 5:

[0102] The server integrates feedback from acoustic and visual materials to generate comprehensive improvement suggestions. The output includes detailed feedback, including areas for improvement, providing information to enhance the user's presentation skills from multiple perspectives.

[0103] Step 6:

[0104] The server sends feedback to the user's device. The user uses this feedback to improve their presentation. By reviewing the content displayed on the device and practicing again, it is possible to improve the quality of the presentation.

[0105] (Application Example 1)

[0106] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0107] In modern content distribution platforms, providing high-quality presentations and content is crucial, but it's not easy for individual users to improve their own presentation skills and the quality of their materials. Furthermore, there's a lack of systems that effectively analyze user-recorded audio and slides and provide specific suggestions for improvement. There's also a need for support in enhancing the quality of content, including presentations, through the input of prompts using generative AI models.

[0108] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0109] In this invention, the server includes a speech recognition means for receiving audio data and converting the audio into text, a means for analyzing the flow, tempo, and pauses of speech from the text, and a means for receiving presentation materials and analyzing the layout and content of the materials using image recognition technology. This makes it possible to analyze audio files and slide images recorded by the user and use the feedback to qualitatively improve the materials for the content distribution platform. Furthermore, by generating prompt sentences and inputting them into a generation AI model, it is possible to provide feedback that improves the overall quality of the content.

[0110] "Speech recognition means" refers to a device or software that has the function of analyzing speech data and converting it into text.

[0111] "Image recognition technology" is a technology that analyzes image data and determines its content and structure.

[0112] A "feedback generation method" is a function that provides users with suggestions for improvement based on the results of processed audio and image data.

[0113] A "terminal" is an electronic device that a user operates to input voice and image data and receive feedback.

[0114] A "content distribution platform" is an online service that provides users with various digital content and enables interaction.

[0115] "Means of qualitatively improving materials" refer to methods and techniques for evaluating provided materials and presentations and improving their content.

[0116] A "generative AI model" is an artificial intelligence algorithm that generates new information or results based on given data.

[0117] A "prompt statement" is an instruction given to a generative AI model, and it is a phrase used to obtain a specific output.

[0118] The system for implementing this invention consists of a user terminal, a data processing server, and a network that connects them. The user inputs audio data and presentation materials via the terminal and sends them to the server. At this time, the terminal records the audio and uploads the presentation materials as image files to the server.

[0119] The server converts the received audio data into text using speech recognition technology and analyzes the flow and tempo of the speech. The "speech_recognition" library is used for this process. Furthermore, for presentation materials, image recognition technology is used to analyze the content and layout, and the readability and visual arrangement of the information are evaluated. The "PIL" and "pytesseract" libraries are used at this stage.

[0120] The analysis results are integrated, and a feedback generation mechanism generates improvement suggestions for the user. These suggestions, based on both audio and document data, help the user prepare qualitatively improved material for the content distribution platform. The feedback is input into the AI ​​model as prompts, resulting in an optimized presentation. The user can receive the feedback on their device and use it to improve their presentation.

[0121] For example, when a user prepares an online lecture, they upload audio and slide materials to the system, and the server provides specific feedback such as "adjust the tempo of the audio introduction" or "simplify the slide structure." This allows the user to significantly improve the overall quality of the delivered content. An example of a prompt to the generative AI model would be, "Analyze the audio and materials of the presentation and point out specific areas for improvement. For example, include feedback on the tempo of speaking and the visual elements of the slides."

[0122] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0123] Step 1:

[0124] The user uses a device to record the audio of the presentation and prepare the materials as image files. The input consists of audio and image files, which serve as the basic data for subsequent analysis.

[0125] Step 2:

[0126] The terminal sends the prepared audio and image files to the server. At this point, the input is the audio and image files, and the output is the transmission of data to the server. Once the data reaches the server, the analysis process is ready to begin.

[0127] Step 3:

[0128] The server uses speech recognition to convert audio files into text data. The input is an audio file, and the output is the converted text. Here, the "speech_recognition" library is used to convert audio data into text.

[0129] Step 4:

[0130] The server analyzes text data and evaluates the flow, tempo, and pacing of speech. The input is text data, and the output is feedback data as a result of the analysis. It specifically extracts characteristics of the user's speech and indicates which areas can be improved.

[0131] Step 5:

[0132] The server uses image recognition technology to analyze presentation materials. The input is an image file, and the output is an evaluation of the layout and content. Here, the "PIL" and "pytesseract" libraries are used to extract text information from the slides and evaluate the design.

[0133] Step 6:

[0134] The server integrates the analysis results of audio and image data and uses a generative AI model to generate optimal feedback. The input is the analysis results data, and the output is feedback that includes detailed improvement suggestions. Prompt sentences are also generated during this process and input into the generative AI model.

[0135] Step 7:

[0136] The server sends the generated feedback to the terminal. The input is feedback data, and the output is a feedback presentation to the user. The suggestions are displayed on the terminal, and the user can use them to improve the quality of their presentation.

[0137] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0138] This invention is a presentation editing system that combines speech recognition and image recognition technologies with an emotion engine that recognizes the user's emotions. The specific system configuration and embodiments of the invention are shown below.

[0139] System Configuration

[0140] The system consists of a user-operated terminal, a server that performs data analysis, and a communication network connecting them. The user inputs audio and presentation data through the terminal and sends it to the server. The server performs speech recognition, image recognition, and emotion recognition, generates feedback based on the analysis results, and sends it back to the terminal.

[0141] Method for carrying out the invention

[0142] 1. Data entry and transmission

[0143] When preparing a presentation, users record audio data on their device and upload presentation materials as electronic files. This data is sent to the server via the application.

[0144] 2. Voice Analysis and Emotion Recognition

[0145] The server converts the received audio data into text using a speech recognition engine and simultaneously analyzes it using an emotion engine. The emotion engine recognizes the user's emotions (e.g., nervousness, anxiety, confidence) from factors such as tone of voice and speaking speed. This makes it possible to evaluate the effectiveness of emotional expression in a presentation.

[0146] 3. Material analysis

[0147] The server uses an image recognition engine to analyze the structure, design, and consistency of the presentation materials. It evaluates whether the materials are easy to understand and visually effective, and suggests areas for improvement as needed.

[0148] 4. Feedback generation and display

[0149] The server integrates voice and sentiment analysis results with document analysis results to generate feedback. This feedback includes specific areas for improvement in speaking style during presentations, guidelines for emotional expression, and suggestions for document revisions. The generated feedback is sent to the terminal and displayed to the user in an easy-to-understand format.

[0150] 5. User improvements and verification

[0151] Users can revise each element of their presentation based on the generated feedback. If necessary, they can re-enter data, receive new feedback, and further improve the quality of their presentation.

[0152] Specific example

[0153] Consider a scenario where a user is giving a presentation introducing a new product. The user uploads the audio and materials of their presentation to the system, and receives feedback from the server such as, "You sound nervous in the introduction; try to relax more," or "Slide 2 has too many graphs and is difficult to read; highlight the important data." This provides specific guidance for improving the presentation.

[0154] The following describes the processing flow.

[0155] Step 1:

[0156] The user records the audio of the presentation on their device and prepares the presentation materials as electronic files. Next, they launch a dedicated application and perform the input procedure to upload these audio data and material files to the server.

[0157] Step 2:

[0158] The terminal packages the uploaded audio data and document files into a data package and prepares it for transmission to the server. It sends the data to the server using a secure protocol (e.g., HTTPS).

[0159] Step 3:

[0160] The server adds the received audio data to the analysis queue and starts the speech recognition engine. The engine converts the audio data into text data and visualizes the content of the presentation.

[0161] Step 4:

[0162] The server continuously feeds the audio data into the emotion recognition engine. The emotion recognition engine analyzes emotional patterns from the tone, pitch, and speed of the voice to identify the emotional state during the presentation (tension, joy, calmness).

[0163] Step 5:

[0164] The server passes the presentation file to an image recognition engine for analysis. It evaluates the layout, design, and information placement within the document, and analyzes the visual consistency and appropriateness of the amount of information.

[0165] Step 6:

[0166] The server integrates the results of speech recognition, emotion recognition, and document analysis to generate feedback. This feedback includes suggestions for improving the presentation's overall quality, as well as areas for improvement in emotional expression, which should be particularly noteworthy.

[0167] Step 7:

[0168] The server packets the generated feedback and sends it to the terminal. The terminal receives the feedback and launches an interface that displays detailed error reports and success evaluations to the user.

[0169] Step 8:

[0170] Based on the feedback displayed, users revise the content, delivery, and emotional expression of their presentations. If necessary, they re-enter the data into the system to obtain further feedback for improvement.

[0171] (Example 2)

[0172] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0173] In today's presentation environment, presenters spend a great deal of time preparing their speaking style and materials, but they often struggle to identify specific areas for improvement and receive effective feedback. Furthermore, while emotional expression significantly impacts the success of a presentation, it's difficult for presenters to evaluate this aspect themselves.

[0174] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0175] In this invention, the server includes a speech recognition device that receives audio data and converts the audio into text, an emotion analysis device that detects the speaker's emotional state from the audio data, and a device that receives presentation materials and evaluates the structure and content of the materials using image recognition technology. This allows presenters to receive specific and improveable feedback on each element of their presentation: audio, emotion, and materials.

[0176] "Audio data" refers to information that represents the waveform of sound acquired through an input device such as a microphone in digital format.

[0177] A "speech recognition device" refers to a technology and device that receives speech data as input, analyzes that speech, and converts it into corresponding text information.

[0178] An "emotion analysis device" refers to a technology and device that analyzes a speaker's emotional state from voice data and its associated information, and identifies the type and intensity of that emotion.

[0179] "Emotional state" refers to the speaker's psychological state inferred from the characteristics of their voice, and includes, for example, joy, sadness, and tension.

[0180] "Image recognition technology" refers to technologies that analyze image data and recognize its content and structure, with examples including layout analysis and visual content recognition.

[0181] "Presentation materials" refer to visual content used during presentations or explanations, and may be provided in the form of slides, graphs, and other visual aids.

[0182] "Information equipment" refers to devices used for inputting and displaying data, and includes computers and smartphones.

[0183] A "user interface" refers to the screens and operating methods that operate on information devices and allow users to interact with the system.

[0184] This invention is realized through a system that combines speech and image recognition technology, as well as emotion analysis technology, to improve the quality of presentations. This system consists of a terminal operated by the user, a server that analyzes data, and a network that connects them.

[0185] Users prepare presentation audio and materials using their devices. Audio data is recorded using a microphone and saved as an electronic file. Material data is uploaded in electronic file formats such as PDF and PPT. This data is transmitted from the device to the server via the internet.

[0186] The server is equipped with hardware and software that perform multiple processes. Audio data is converted to text by speech recognition software (e.g., Google Speech-to-Text API or IBM Watson®). Simultaneously, sentiment analysis software analyzes the emotional state from the audio data. This analysis includes elements such as voice tone, pitch, and speaking speed.

[0187] For the document data, image recognition software (e.g., OpenCV or Tesseract) is used to analyze the layout, consistency, and visual effect of the document. This allows us to evaluate whether the document is clearly structured and easy for viewers to understand.

[0188] The server integrates the analysis results of voice, emotion, and materials to generate specific feedback. This feedback includes suggestions for improving speaking style and revisions to materials. The generated feedback is sent back to the terminal via the internet. The feedback is then displayed in the application on the terminal, presented in a format that is easy for the user to understand.

[0189] One concrete example is a scenario where a user gives a presentation introducing a new product. When a user uploads the audio and materials of their presentation, they receive feedback from the server such as, "You sound nervous in the introduction, so try to relax when you speak," or "There are too many graphs on slide 2, making it difficult to read, so you should emphasize the important points." This feedback allows the user to improve their presentation.

[0190] Using a generative AI model, an example of a prompt message could be: "When a user records a presentation introducing a new product and sends the materials to the server, analyze the user's emotions from the audio data and evaluate the effectiveness of the materials. The feedback should include specific guidance regarding the user's proficiency."

[0191] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0192] Step 1:

[0193] The user launches a dedicated application on their device and simultaneously records the presentation audio using a microphone while preparing the presentation materials as electronic files (PDF or PPT). The device saves this audio and material data and sends them to the server via Wi-Fi or a mobile network. The input consists of the audio and material files, and the output is the transmission of data to the server.

[0194] Step 2:

[0195] The server converts the received audio file into text using a speech recognition device. Specifically, it uses software such as the Google Speech-to-Text API or IBM Watson to analyze the audio data and convert it into corresponding text. The input to this process is audio data, and the output is text data.

[0196] Step 3:

[0197] The server uses an emotion analysis device to analyze voice data and detect the user's emotional state based on indicators such as voice tone, pitch, and speaking speed. The input to this process is voice data, and the output is the detected emotion information. The emotion analysis uses an algorithm to determine the speaker's psychological state.

[0198] Step 4:

[0199] The server analyzes the received document files using image recognition technology. Specifically, it uses OpenCV and Tesseract to evaluate the layout, consistency, and visual effect of the documents. The input is the document file, and the output is the analysis result of the document.

[0200] Step 5:

[0201] The server integrates speech-to-text, sentiment information, and document analysis results to generate user-facing feedback. This feedback includes suggestions for improving speaking style, guidance on emotional expression, and slide revisions. The input consists of the results of each analysis, while the output is the integrated feedback.

[0202] Step 6:

[0203] The server sends the generated feedback to the terminal. The terminal displays this feedback within the application, presenting it in a format easily understandable to the user. The input is the generated feedback, and the output is the information displayed to the user.

[0204] Step 7:

[0205] The user modifies each element of the presentation based on the displayed feedback. If necessary, the user can resend the newly modified data from their device to the server to receive additional feedback. The input is the feedback-based modification process, and the output is the improved presentation.

[0206] (Application Example 2)

[0207] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0208] In today's commercial environment, interactive communication directly impacts customer satisfaction, making the improvement of store staff's customer service skills a crucial issue. However, traditional customer service training methods struggle to accurately evaluate individual staff members' emotional expressions and conversational flow. This creates a problem in providing specific and effective feedback tailored to each staff member's abilities.

[0209] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0210] In this invention, the server includes an acoustic recognition means for converting audio data into text, a method for analyzing the layout and content of materials, an algorithm for integrating the analysis results of audio and materials to generate feedback, and a function for evaluating emotions and providing guidance for improving customer service skills. This makes it possible to improve customer service skills by providing specific areas for improvement tailored to each individual staff member.

[0211] "Acoustic recognition means" refers to a function that converts audio data into text, and is a technology that enables the processing of audio information as textual information.

[0212] "Structure" refers to a mechanism for analyzing the flow, tempo, and pacing of speech from text, and is an analytical method for evaluating the characteristics of linguistic expression.

[0213] A "method" is a way of analyzing the layout and content of a document using image recognition technology, and is a process for evaluating the structure and effectiveness of a document based on visual information.

[0214] An "algorithm" is a series of processing steps for integrating the results of audio and data analysis to generate feedback, and is a computational method for combining data to produce useful information.

[0215] A "device" is hardware or software equipped with the functionality to transmit generated feedback to a terminal, and is a physical or virtual interface for enabling information transmission.

[0216] "Function" refers to the ability to evaluate emotions and provide guidance for improving customer service skills; it is a system for analyzing individual emotional expressions and providing appropriate advice.

[0217] "Applied technology" refers to techniques that recognize emotions, integrate the results, and enhance feedback, representing innovative technologies for improving information delivery by utilizing diverse data.

[0218] A "generative AI model" is a machine learning model that uses prompt sentences based on training data to suggest effective ways to improve conversations, and is an artificial intelligence technology for automated knowledge processing.

[0219] A "prompt statement" is an input phrase used to elicit a specific response from a generative AI model, and is an instruction statement that facilitates interaction with the AI.

[0220] The system for realizing this application consists of an application installed on a terminal and a server. The user uses the terminal to record audio data and uploads related materials to the server. The server converts the received audio data into text using acoustic recognition. It also utilizes emotion recognition to analyze the user's emotions from the audio data and generates appropriate feedback based on that information.

[0221] Regarding the materials, image recognition technology is used to analyze the consistency and design of the layout and to make suggestions for enhancing the visual effect. The server integrates these results to generate feedback. This feedback includes specific guidelines for improving customer service skills and is sent to the terminal.

[0222] The specific hardware used will be smartphones and tablets operated by the user. For the software, speech recognition and sentiment analysis processing will be implemented using Python, while machine learning frameworks such as TENSORFLOW® will be used for image recognition. AWS® Lambda will be used as the server backend, enabling serverless data processing.

[0223] For example, if a user uploads an audio recording of themselves explaining the features of a new product to a customer in a store, the server will generate feedback such as, "Your explanation is too fast; please speak a little slower." In this process, a generative AI model is used, and an example of a prompt might be, "Please tell me some phrases that will effectively introduce the product to the customer." This allows users to receive specific advice to improve their customer service skills.

[0224] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0225] Step 1:

[0226] The user records audio data using their device and prepares related materials in electronic file format. The audio data is saved on the device as an audio file, and the material data is formatted as an electronic file. The user uploads this data to the server via an application on their device.

[0227] Step 2:

[0228] The server converts the received audio data into text using acoustic recognition. Specifically, it analyzes the audio data using the Google Speech-to-Text API and outputs the results in text format. This process stores the linguistic content of the audio data as text in the database.

[0229] Step 3:

[0230] The server uses emotion recognition capabilities to evaluate emotions from transcribed speech. The input text data is passed through an emotion analysis algorithm, and the user's emotional state (e.g., tension, anxiety, confidence) is analyzed and output. This output is then used to generate feedback.

[0231] Step 4:

[0232] The server applies image recognition technology to analyze the layout and design of the materials. It uses machine learning frameworks such as TensorFlow to analyze the material data, and based on the analysis results, it evaluates the visual effects and identifies areas for improvement. These results provide information to determine how the materials will be received by users.

[0233] Step 5:

[0234] The server integrates the results of voice analysis and document analysis to generate feedback. Using a generative AI model, it generates feedback that suggests specific improvement measures and guidelines, taking prompts into consideration, and outputs this feedback in text format. The feedback may include specific instructions such as, "Please tell me some phrases that will effectively introduce the product to customers."

[0235] Step 6:

[0236] The feedback is sent to the device and displayed to the user. Based on this feedback, the user takes action to improve their skills. Based on the feedback, the user can re-enter data and make further improvements.

[0237] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0238] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0239] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0240] [Second Embodiment]

[0241] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0242] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0243] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0244] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0245] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0246] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0247] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0248] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0249] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0250] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0251] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0252] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0253] This invention is a presentation editing system that uses speech recognition and image recognition technology. The system configuration and specific embodiments for carrying out the invention are described below.

[0254] System Configuration

[0255] This system consists of a user terminal, a server that processes data, and a network that connects them. The user uses the terminal to input audio data and materials for their presentation and sends them to the server. The server analyzes this data, generates feedback, and then sends it back to the terminal for the user to see.

[0256] Method for carrying out the invention

[0257] 1. Data entry and transmission

[0258] Users practice their presentations on their own devices and record the audio. They also prepare presentation materials (slides and documents) as electronic files. These audio data and material files are then sent to the server via a dedicated application.

[0259] 2. Voice Analysis

[0260] The server performs speech recognition using the received audio data. This speech recognition engine converts the audio into text and analyzes the way the speaker speaks, their tone, and their pacing. For example, it generates specific feedback such as, "You're speaking too fast in the introduction."

[0261] 3. Material analysis

[0262] The server analyzes the received presentation materials using image recognition technology. It evaluates the structure, layout, text readability, and placement of visual elements of the materials, and generates suggestions such as, "Slide 3 contains too much information and needs to be rearranged for clarity."

[0263] 4. Feedback generation and display

[0264] The server integrates the results of audio and document analysis to create comprehensive feedback. This feedback is sent to the terminal in the form of multiple areas for improvement, providing users with information that allows them to efficiently improve their presentations.

[0265] 5. User modification and verification

[0266] Users review the feedback displayed on their devices and revise the content and delivery of their presentations. Afterward, they re-enter data into the system as needed to receive further feedback, thereby improving the quality of their presentations.

[0267] Specific example

[0268] For example, suppose a user is preparing a five-minute presentation on a marketing strategy. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improving the presentation style, such as "slow down the pace of the introduction and take longer pauses when stating the problem," as well as feedback on the materials, such as "the market analysis chart on the second slide is too complex; we suggest simplifying it visually."

[0269] This format allows users to receive multifaceted feedback and autonomously refine their presentations.

[0270] The following describes the processing flow.

[0271] Step 1:

[0272] The user uses their device to prepare the audio data and document files for the presentation and opens a dedicated application. Through the application, they upload the audio data and document files to the server.

[0273] Step 2:

[0274] The device packages the uploaded data and sends it to the server using a secure communication protocol. Additional information, such as the presentation topic and time constraints specified by the user, is also sent at the same time.

[0275] Step 3:

[0276] The server puts the received voice data into the analysis queue and activates the speech recognition module. This module converts the voice data into text and extracts the speaking style features from it. The features include speed, intonation, and speaking pattern.

[0277] Step 4:

[0278] The server uses the image recognition module to analyze the presentation material file. In this module, the character information, graphics, and image layout in the slides are evaluated, and the visual consistency and readability of the material are checked.

[0279] Step 5:

[0280] The server compares the results of speech recognition and image recognition and generates integrated feedback. This feedback includes specific improvement points and recommended amendments.

[0281] Step 6:

[0282] The server repackages the generated feedback and sends it to the terminal. A protocol considering security is used for transmission to protect the confidentiality of the data.

[0283] Step 7:

[0284] The terminal analyzes the received feedback and displays it in a form that is easily understandable to the user. The interface includes a function to provide a detailed explanation for each improvement point, clarifying which parts of the presentation the user should correct.

[0285] Step 8:

[0286] The user modifies the presentation material and speaking style based on the presented feedback and, if necessary, uses the system again to obtain new feedback. This can gradually improve the quality of the presentation.

[0287] (Example 1)

[0288] Next, Example 1 will be described. In the following description, the data processing device 12 is referred to as a "server", and the smart glasses 214 are referred to as a "terminal".

[0289] In the modern business environment, the quality of presentations and exhibitions has become an important factor in success. However, in preparing for a presentation, there is a lack of effective tools for objectively evaluating and improving one's speaking style and the content of materials. Therefore, there is a need for support to help users improve their presentation skills by comprehensively analyzing acoustic information and visual materials and presenting specific points for improvement.

[0290] The specific processing by the specific processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0291] In this invention, the server includes recognition means for receiving acoustic information and converting the sound into character information, means for analyzing the flow, speed, and intervals of speech from the character information, and means for receiving visual materials and analyzing the structure and content of the materials using image processing technology. This makes it possible to comprehensively analyze acoustic and visual information and provide effective points for improvement to the user.

[0292] "Acoustic information" refers to voice data and related sound information, and is mainly data that is the target of speech recognition.

[0293] "Character information" refers to text data generated by speech recognition based on acoustic information, and is information expressed in a form that is easy for humans to visually understand.

[0294] "Recognition means" refers to the technology and devices used to process acoustic information and convert it into character information, and usually includes a speech recognition algorithm.

[0295] "Speech flow" refers to the structure and order of spoken language, and is an element that should be considered in order to maintain the natural flow of speech.

[0296] "Speech rate" refers to the tempo of speech and is usually measured by the amount of sound uttered per unit of time.

[0297] "Pauses" refer to pauses or breaks that a speaker creates during utterances, and are an important element for improving the clarity of speech.

[0298] "Visual materials" refer to media used to present information visually, such as slides and documents created for presentations.

[0299] "Image processing technology" refers to the technology of using computers to analyze and transform image data and extract useful information.

[0300] "Composition" refers to the arrangement and combination of text and image elements in visual materials, and is important for effectively conveying information.

[0301] "Content" refers to the information that visual materials attempt to convey, and plays a central role in expressing the theme and message of the material.

[0302] "Areas for improvement" refers to specific suggestions and advice from the user to improve their presentation, derived from the analysis results of acoustic and visual information.

[0303] This invention is a system for analyzing acoustic and visual information and providing users with suggestions for improvement. The following describes how to implement the system in detail.

[0304] The server receives acoustic information and uses a speech recognition engine to convert the acoustics into character information. For example, it utilizes speech recognition technologies such as "Google Speech-to-Text" and "Amazon Transcribe". The server analyzes the converted character information and evaluates the flow, speed, and intervals of the speech. At this time, a generative AI model is used to generate specific feedback regarding the emphasis of specific phrases and the tempo of the speech.

[0305] The terminal is responsible for transmitting the acoustic information and visual materials recorded by the user to the server. The visual materials include slides and documents, and the server analyzes the received materials using image processing technologies. By using image recognition tools such as OpenCV and Tesseract, the composition and content of the materials are evaluated, and improvement suggestions are generated.

[0306] The user can receive feedback from the server and improve the quality by repeating the practice of the presentation. Based on the improvement points, the user can newly record the practice content and input it into the system again.

[0307] As a specific example of implementation, assume that the user is preparing a presentation on "Introduction of New Products for 5 Minutes". The user uploads the recording of their own speech and the presentation materials to the system. The server provides specific improvement points such as "Improve the way of speaking in the intro part and simplify the design of Slide 2". Based on this improvement plan, the user can repeat the practice as many times as needed.

[0308] As an example of the prompt text, it is possible to send the text "Analyze the speech data and materials of the presentation and generate specific improvement points. The speech data is the content spoken by the user, and the materials include slides and documents." to the server.

[0309] In this way, the server, terminal, and user cooperate to provide a specific method for efficiently improving the quality of the presentation, which is the embodiment of this invention.

[0310] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0311] Step 1:

[0312] Users practice their presentations on their own devices and record the audio information (audio data). They also prepare visual materials (slides and documents) for their presentations. The input consists of audio information and visual materials, which are then sent to the server using a dedicated application.

[0313] Step 2:

[0314] The server receives acoustic information sent by the user and converts it into text information using a speech recognition engine. The input is acoustic information, and the server analyzes the waveform of the voice to obtain output as a string of characters. In this process, the server uses technologies such as "Google Speech-to-Text" or "Amazon Transcribe".

[0315] Step 3:

[0316] The server analyzes the speed, flow, and spacing of speech from the converted text information. Text information is input, and a generative AI model is used to evaluate the speaking style and tempo, outputting feedback for improvement. The server specifically analyzes certain words and speech patterns to provide appropriate suggestions.

[0317] Step 4:

[0318] Simultaneously, the server receives visual materials and analyzes their structure and content using image processing techniques. The input for this step is visual materials, and the layout and text legibility are automatically checked. The output generates specific suggestions for improvement regarding the materials. Technically, image recognition is performed using OpenCV or Tesseract.

[0319] Step 5:

[0320] The server integrates feedback from acoustic and visual materials to generate comprehensive improvement suggestions. The output includes detailed feedback, including areas for improvement, providing information to enhance the user's presentation skills from multiple perspectives.

[0321] Step 6:

[0322] The server sends feedback to the user's device. The user uses this feedback to improve their presentation. By reviewing the content displayed on the device and practicing again, it is possible to improve the quality of the presentation.

[0323] (Application Example 1)

[0324] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0325] In modern content distribution platforms, providing high-quality presentations and content is crucial, but it's not easy for individual users to improve their own presentation skills and the quality of their materials. Furthermore, there's a lack of systems that effectively analyze user-recorded audio and slides and provide specific suggestions for improvement. There's also a need for support in enhancing the quality of content, including presentations, through the input of prompts using generative AI models.

[0326] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0327] In this invention, the server includes a speech recognition means for receiving audio data and converting the audio into text, a means for analyzing the flow, tempo, and pauses of speech from the text, and a means for receiving presentation materials and analyzing the layout and content of the materials using image recognition technology. This makes it possible to analyze audio files and slide images recorded by the user and use the feedback to qualitatively improve the materials for the content distribution platform. Furthermore, by generating prompt sentences and inputting them into a generation AI model, it is possible to provide feedback that improves the overall quality of the content.

[0328] "Speech recognition means" refers to a device or software that has the function of analyzing speech data and converting it into text.

[0329] "Image recognition technology" is a technology that analyzes image data and determines its content and structure.

[0330] A "feedback generation method" is a function that provides users with suggestions for improvement based on the results of processed audio and image data.

[0331] A "terminal" is an electronic device that a user operates to input voice and image data and receive feedback.

[0332] A "content distribution platform" is an online service that provides users with various digital content and enables interaction.

[0333] "Means of qualitatively improving materials" refer to methods and techniques for evaluating provided materials and presentations and improving their content.

[0334] A "generative AI model" is an artificial intelligence algorithm that generates new information or results based on given data.

[0335] A "prompt statement" is an instruction given to a generative AI model, and it is a phrase used to obtain a specific output.

[0336] The system for implementing this invention consists of a user terminal, a data processing server, and a network that connects them. The user inputs audio data and presentation materials via the terminal and sends them to the server. At this time, the terminal records the audio and uploads the presentation materials as image files to the server.

[0337] The server converts the received audio data into text using speech recognition technology and analyzes the flow and tempo of the speech. The "speech_recognition" library is used for this process. Furthermore, for presentation materials, image recognition technology is used to analyze the content and layout, and the readability and visual arrangement of the information are evaluated. The "PIL" and "pytesseract" libraries are used at this stage.

[0338] The analysis results are integrated, and a feedback generation mechanism generates improvement suggestions for the user. These suggestions, based on both audio and document data, help the user prepare qualitatively improved material for the content distribution platform. The feedback is input into the AI ​​model as prompts, resulting in an optimized presentation. The user can receive the feedback on their device and use it to improve their presentation.

[0339] For example, when a user prepares an online lecture, they upload audio and slide materials to the system, and the server provides specific feedback such as "adjust the tempo of the audio introduction" or "simplify the slide structure." This allows the user to significantly improve the overall quality of the delivered content. An example of a prompt to the generative AI model would be, "Analyze the audio and materials of the presentation and point out specific areas for improvement. For example, include feedback on the tempo of speaking and the visual elements of the slides."

[0340] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0341] Step 1:

[0342] The user uses a device to record the audio of the presentation and prepare the materials as image files. The input consists of audio and image files, which serve as the basic data for subsequent analysis.

[0343] Step 2:

[0344] The terminal sends the prepared audio and image files to the server. At this point, the input is the audio and image files, and the output is the transmission of data to the server. Once the data reaches the server, the analysis process is ready to begin.

[0345] Step 3:

[0346] The server uses speech recognition to convert audio files into text data. The input is an audio file, and the output is the converted text. Here, the "speech_recognition" library is used to convert audio data into text.

[0347] Step 4:

[0348] The server analyzes text data and evaluates the flow, tempo, and pacing of speech. The input is text data, and the output is feedback data as a result of the analysis. It specifically extracts characteristics of the user's speech and indicates which areas can be improved.

[0349] Step 5:

[0350] The server uses image recognition technology to analyze presentation materials. The input is an image file, and the output is an evaluation of the layout and content. Here, the "PIL" and "pytesseract" libraries are used to extract text information from the slides and evaluate the design.

[0351] Step 6:

[0352] The server integrates the analysis results of audio and image data and uses a generative AI model to generate optimal feedback. The input is the analysis results data, and the output is feedback that includes detailed improvement suggestions. Prompt sentences are also generated during this process and input into the generative AI model.

[0353] Step 7:

[0354] The server sends the generated feedback to the terminal. The input is feedback data, and the output is a feedback presentation to the user. The suggestions are displayed on the terminal, and the user can use them to improve the quality of their presentation.

[0355] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0356] This invention is a presentation editing system that combines speech recognition and image recognition technologies with an emotion engine that recognizes the user's emotions. The specific system configuration and embodiments of the invention are shown below.

[0357] System Configuration

[0358] The system consists of a user-operated terminal, a server that performs data analysis, and a communication network connecting them. The user inputs audio and presentation data through the terminal and sends it to the server. The server performs speech recognition, image recognition, and emotion recognition, generates feedback based on the analysis results, and sends it back to the terminal.

[0359] Method for carrying out the invention

[0360] 1. Data entry and transmission

[0361] When preparing a presentation, users record audio data on their device and upload presentation materials as electronic files. This data is sent to the server via the application.

[0362] 2. Voice Analysis and Emotion Recognition

[0363] The server converts the received audio data into text using a speech recognition engine and simultaneously analyzes it using an emotion engine. The emotion engine recognizes the user's emotions (e.g., nervousness, anxiety, confidence) from factors such as tone of voice and speaking speed. This makes it possible to evaluate the effectiveness of emotional expression in a presentation.

[0364] 3. Material analysis

[0365] The server uses an image recognition engine to analyze the structure, design, and consistency of the presentation materials. It evaluates whether the materials are easy to understand and visually effective, and suggests areas for improvement as needed.

[0366] 4. Feedback generation and display

[0367] The server integrates voice and sentiment analysis results with document analysis results to generate feedback. This feedback includes specific areas for improvement in speaking style during presentations, guidelines for emotional expression, and suggestions for document revisions. The generated feedback is sent to the terminal and displayed to the user in an easy-to-understand format.

[0368] 5. User improvements and verification

[0369] Users can revise each element of their presentation based on the generated feedback. If necessary, they can re-enter data, receive new feedback, and further improve the quality of their presentation.

[0370] Specific example

[0371] Consider a scenario where a user is giving a presentation introducing a new product. The user uploads the audio and materials of their presentation to the system, and receives feedback from the server such as, "You sound nervous in the introduction; try to relax more," or "Slide 2 has too many graphs and is difficult to read; highlight the important data." This provides specific guidance for improving the presentation.

[0372] The following describes the processing flow.

[0373] Step 1:

[0374] The user records the audio of the presentation on their device and prepares the presentation materials as electronic files. Next, they launch a dedicated application and perform the input procedure to upload these audio data and material files to the server.

[0375] Step 2:

[0376] The terminal packages the uploaded audio data and document files into a data package and prepares it for transmission to the server. It sends the data to the server using a secure protocol (e.g., HTTPS).

[0377] Step 3:

[0378] The server adds the received audio data to the analysis queue and starts the speech recognition engine. The engine converts the audio data into text data and visualizes the content of the presentation.

[0379] Step 4:

[0380] The server continuously feeds the audio data into the emotion recognition engine. The emotion recognition engine analyzes emotional patterns from the tone, pitch, and speed of the voice to identify the emotional state during the presentation (tension, joy, calmness).

[0381] Step 5:

[0382] The server passes the presentation file to an image recognition engine for analysis. It evaluates the layout, design, and information placement within the document, and analyzes the visual consistency and appropriateness of the amount of information.

[0383] Step 6:

[0384] The server integrates the results of speech recognition, emotion recognition, and document analysis to generate feedback. This feedback includes suggestions for improving the presentation's overall quality, as well as areas for improvement in emotional expression, which should be particularly noteworthy.

[0385] Step 7:

[0386] The server packets the generated feedback and sends it to the terminal. The terminal receives the feedback and launches an interface that displays detailed error reports and success evaluations to the user.

[0387] Step 8:

[0388] Based on the feedback displayed, users revise the content, delivery, and emotional expression of their presentations. If necessary, they re-enter the data into the system to obtain further feedback for improvement.

[0389] (Example 2)

[0390] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0391] In today's presentation environment, presenters spend a great deal of time preparing their speaking style and materials, but they often struggle to identify specific areas for improvement and receive effective feedback. Furthermore, while emotional expression significantly impacts the success of a presentation, it's difficult for presenters to evaluate this aspect themselves.

[0392] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0393] In this invention, the server includes a speech recognition device that receives audio data and converts the audio into text, an emotion analysis device that detects the speaker's emotional state from the audio data, and a device that receives presentation materials and evaluates the structure and content of the materials using image recognition technology. This allows presenters to receive specific and improveable feedback on each element of their presentation: audio, emotion, and materials.

[0394] "Audio data" refers to information that represents the waveform of sound acquired through an input device such as a microphone in digital format.

[0395] A "speech recognition device" refers to a technology and device that receives speech data as input, analyzes that speech, and converts it into corresponding text information.

[0396] An "emotion analysis device" refers to a technology and device that analyzes a speaker's emotional state from voice data and its associated information, and identifies the type and intensity of that emotion.

[0397] "Emotional state" refers to the speaker's psychological state inferred from the characteristics of their voice, and includes, for example, joy, sadness, and tension.

[0398] "Image recognition technology" refers to technologies that analyze image data and recognize its content and structure, with examples including layout analysis and visual content recognition.

[0399] "Presentation materials" refer to visual content used during presentations or explanations, and may be provided in the form of slides, graphs, and other visual aids.

[0400] "Information equipment" refers to devices used for inputting and displaying data, and includes computers and smartphones.

[0401] A "user interface" refers to the screens and operating methods that operate on information devices and allow users to interact with the system.

[0402] This invention is realized through a system that combines speech and image recognition technology, as well as emotion analysis technology, to improve the quality of presentations. This system consists of a terminal operated by the user, a server that analyzes data, and a network that connects them.

[0403] Users prepare presentation audio and materials using their devices. Audio data is recorded using a microphone and saved as an electronic file. Material data is uploaded in electronic file formats such as PDF and PPT. This data is transmitted from the device to the server via the internet.

[0404] The server consists of hardware and software that perform multiple processes. Audio data is converted to text by speech recognition software (e.g., Google Speech-to-Text API or IBM Watson). Simultaneously, sentiment analysis software analyzes the emotional state from the audio data. This analysis includes elements such as voice tone, pitch, and speaking speed.

[0405] For the document data, image recognition software (e.g., OpenCV or Tesseract) is used to analyze the layout, consistency, and visual effect of the document. This allows us to evaluate whether the document is clearly structured and easy for viewers to understand.

[0406] The server integrates the analysis results of voice, emotion, and materials to generate specific feedback. This feedback includes suggestions for improving speaking style and revisions to materials. The generated feedback is sent back to the terminal via the internet. The feedback is then displayed in the application on the terminal, presented in a format that is easy for the user to understand.

[0407] One concrete example is a scenario where a user gives a presentation introducing a new product. When a user uploads the audio and materials of their presentation, they receive feedback from the server such as, "You sound nervous in the introduction, so try to relax when you speak," or "There are too many graphs on slide 2, making it difficult to read, so you should emphasize the important points." This feedback allows the user to improve their presentation.

[0408] Using a generative AI model, an example of a prompt message could be: "When a user records a presentation introducing a new product and sends the materials to the server, analyze the user's emotions from the audio data and evaluate the effectiveness of the materials. The feedback should include specific guidance regarding the user's proficiency."

[0409] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0410] Step 1:

[0411] The user launches a dedicated application on their device and simultaneously records the presentation audio using a microphone while preparing the presentation materials as electronic files (PDF or PPT). The device saves this audio and material data and sends them to the server via Wi-Fi or a mobile network. The input consists of the audio and material files, and the output is the transmission of data to the server.

[0412] Step 2:

[0413] The server converts the received audio file into text using a speech recognition device. Specifically, it uses software such as the Google Speech-to-Text API or IBM Watson to analyze the audio data and convert it into corresponding text. The input to this process is audio data, and the output is text data.

[0414] Step 3:

[0415] The server uses an emotion analysis device to analyze voice data and detect the user's emotional state based on indicators such as voice tone, pitch, and speaking speed. The input to this process is voice data, and the output is the detected emotion information. The emotion analysis uses an algorithm to determine the speaker's psychological state.

[0416] Step 4:

[0417] The server analyzes the received document files using image recognition technology. Specifically, it uses OpenCV and Tesseract to evaluate the layout, consistency, and visual effect of the documents. The input is the document file, and the output is the analysis result of the document.

[0418] Step 5:

[0419] The server integrates speech-to-text, sentiment information, and document analysis results to generate user-facing feedback. This feedback includes suggestions for improving speaking style, guidance on emotional expression, and slide revisions. The input consists of the results of each analysis, while the output is the integrated feedback.

[0420] Step 6:

[0421] The server sends the generated feedback to the terminal. The terminal displays this feedback within the application, presenting it in a format easily understandable to the user. The input is the generated feedback, and the output is the information displayed to the user.

[0422] Step 7:

[0423] The user modifies each element of the presentation based on the displayed feedback. If necessary, the user can resend the newly modified data from their device to the server to receive additional feedback. The input is the feedback-based modification process, and the output is the improved presentation.

[0424] (Application Example 2)

[0425] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0426] In today's commercial environment, interactive communication directly impacts customer satisfaction, making the improvement of store staff's customer service skills a crucial issue. However, traditional customer service training methods struggle to accurately evaluate individual staff members' emotional expressions and conversational flow. This creates a problem in providing specific and effective feedback tailored to each staff member's abilities.

[0427] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0428] In this invention, the server includes an acoustic recognition means for converting audio data into text, a method for analyzing the layout and content of materials, an algorithm for integrating the analysis results of audio and materials to generate feedback, and a function for evaluating emotions and providing guidance for improving customer service skills. This makes it possible to improve customer service skills by providing specific areas for improvement tailored to each individual staff member.

[0429] "Acoustic recognition means" refers to a function that converts audio data into text, and is a technology that enables the processing of audio information as textual information.

[0430] "Structure" refers to a mechanism for analyzing the flow, tempo, and pacing of speech from text, and is an analytical method for evaluating the characteristics of linguistic expression.

[0431] A "method" is a way of analyzing the layout and content of a document using image recognition technology, and is a process for evaluating the structure and effectiveness of a document based on visual information.

[0432] An "algorithm" is a series of processing steps for integrating the results of audio and data analysis to generate feedback, and is a computational method for combining data to produce useful information.

[0433] A "device" is hardware or software equipped with the functionality to transmit generated feedback to a terminal, and is a physical or virtual interface for enabling information transmission.

[0434] "Function" refers to the ability to evaluate emotions and provide guidance for improving customer service skills; it is a system for analyzing individual emotional expressions and providing appropriate advice.

[0435] "Applied technology" refers to techniques that recognize emotions, integrate the results, and enhance feedback, representing innovative technologies for improving information delivery by utilizing diverse data.

[0436] A "generative AI model" is a machine learning model that uses prompt sentences based on training data to suggest effective ways to improve conversations, and is an artificial intelligence technology for automated knowledge processing.

[0437] A "prompt statement" is an input phrase used to elicit a specific response from a generative AI model, and is an instruction statement that facilitates interaction with the AI.

[0438] The system for realizing this application consists of an application installed on a terminal and a server. The user uses the terminal to record audio data and uploads related materials to the server. The server converts the received audio data into text using acoustic recognition. It also utilizes emotion recognition to analyze the user's emotions from the audio data and generates appropriate feedback based on that information.

[0439] Regarding the materials, image recognition technology is used to analyze the consistency and design of the layout and to make suggestions for enhancing the visual effect. The server integrates these results to generate feedback. This feedback includes specific guidelines for improving customer service skills and is sent to the terminal.

[0440] The specific hardware used will be smartphones and tablets operated by the user. For the software, speech recognition and sentiment analysis will be implemented using Python, while machine learning frameworks such as TensorFlow will be used for image recognition. AWS Lambda will be used as the server backend, enabling serverless data processing.

[0441] For example, if a user uploads an audio recording of themselves explaining the features of a new product to a customer in a store, the server will generate feedback such as, "Your explanation is too fast; please speak a little slower." In this process, a generative AI model is used, and an example of a prompt might be, "Please tell me some phrases that will effectively introduce the product to the customer." This allows users to receive specific advice to improve their customer service skills.

[0442] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0443] Step 1:

[0444] The user records audio data using their device and prepares related materials in electronic file format. The audio data is saved on the device as an audio file, and the material data is formatted as an electronic file. The user uploads this data to the server via an application on their device.

[0445] Step 2:

[0446] The server converts the received audio data into text using acoustic recognition. Specifically, it analyzes the audio data using the Google Speech-to-Text API and outputs the results in text format. This process stores the linguistic content of the audio data as text in the database.

[0447] Step 3:

[0448] The server uses emotion recognition capabilities to evaluate emotions from transcribed speech. The input text data is passed through an emotion analysis algorithm, and the user's emotional state (e.g., tension, anxiety, confidence) is analyzed and output. This output is then used to generate feedback.

[0449] Step 4:

[0450] The server applies image recognition technology to analyze the layout and design of the materials. It uses machine learning frameworks such as TensorFlow to analyze the material data, and based on the analysis results, it evaluates the visual effects and identifies areas for improvement. These results provide information to determine how the materials will be received by users.

[0451] Step 5:

[0452] The server integrates the results of voice analysis and document analysis to generate feedback. Using a generative AI model, it generates feedback that suggests specific improvement measures and guidelines, taking prompts into consideration, and outputs this feedback in text format. The feedback may include specific instructions such as, "Please tell me some phrases that will effectively introduce the product to customers."

[0453] Step 6:

[0454] The feedback is sent to the device and displayed to the user. Based on this feedback, the user takes action to improve their skills. Based on the feedback, the user can re-enter data and make further improvements.

[0455] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0456] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0457] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0458] [Third Embodiment]

[0459] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0460] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0461] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0462] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0463] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0464] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0465] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0466] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0467] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0468] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0469] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0470] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0471] This invention is a presentation editing system that uses speech recognition and image recognition technology. The system configuration and specific embodiments for carrying out the invention are described below.

[0472] System Configuration

[0473] This system consists of a user terminal, a server that processes data, and a network that connects them. The user uses the terminal to input audio data and materials for their presentation and sends them to the server. The server analyzes this data, generates feedback, and then sends it back to the terminal for the user to see.

[0474] Method for carrying out the invention

[0475] 1. Data entry and transmission

[0476] Users practice their presentations on their own devices and record the audio. They also prepare presentation materials (slides and documents) as electronic files. These audio data and material files are then sent to the server via a dedicated application.

[0477] 2. Voice Analysis

[0478] The server performs speech recognition using the received audio data. This speech recognition engine converts the audio into text and analyzes the way the speaker speaks, their tone, and their pacing. For example, it generates specific feedback such as, "You're speaking too fast in the introduction."

[0479] 3. Material analysis

[0480] The server analyzes the received presentation materials using image recognition technology. It evaluates the structure, layout, text readability, and placement of visual elements of the materials, and generates suggestions such as, "Slide 3 contains too much information and needs to be rearranged for clarity."

[0481] 4. Feedback generation and display

[0482] The server integrates the results of audio and document analysis to create comprehensive feedback. This feedback is sent to the terminal in the form of multiple areas for improvement, providing users with information that allows them to efficiently improve their presentations.

[0483] 5. User modification and verification

[0484] Users review the feedback displayed on their devices and revise the content and delivery of their presentations. Afterward, they re-enter data into the system as needed to receive further feedback, thereby improving the quality of their presentations.

[0485] Specific example

[0486] For example, suppose a user is preparing a five-minute presentation on a marketing strategy. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improving the presentation style, such as "slow down the pace of the introduction and take longer pauses when stating the problem," as well as feedback on the materials, such as "the market analysis chart on the second slide is too complex; we suggest simplifying it visually."

[0487] This format allows users to receive multifaceted feedback and autonomously refine their presentations.

[0488] The following describes the processing flow.

[0489] Step 1:

[0490] The user uses their device to prepare the audio data and document files for the presentation and opens a dedicated application. Through the application, they upload the audio data and document files to the server.

[0491] Step 2:

[0492] The device packages the uploaded data and sends it to the server using a secure communication protocol. Additional information, such as the presentation topic and time constraints specified by the user, is also sent at the same time.

[0493] Step 3:

[0494] The server places the received audio data into an analysis queue and activates the speech recognition module. This module converts the audio data into text and extracts speech characteristics from it. These characteristics include speed, intonation, and pauses.

[0495] Step 4:

[0496] The server uses an image recognition module to analyze presentation files. This module evaluates the placement of text, shapes, and images within the slides, checking for visual consistency and readability of the materials.

[0497] Step 5:

[0498] The server compares the results of speech recognition and image recognition to generate integrated feedback. This feedback includes specific areas for improvement and recommended corrections.

[0499] Step 6:

[0500] The server repackages the generated feedback and sends it to the terminal. A secure protocol is used for transmission, ensuring the confidentiality of the data.

[0501] Step 7:

[0502] The device analyzes the received feedback and displays it in a format that is easy for the user to understand. The interface includes features that provide detailed explanations for each area of ​​improvement, making it clear to the user which parts of the presentation need to be revised.

[0503] Step 8:

[0504] Users can revise their presentation materials and delivery based on the feedback provided, and use the system again to obtain new feedback as needed. This allows for a gradual improvement in the quality of the presentation.

[0505] (Example 1)

[0506] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0507] In today's business environment, the quality of presentations and presentations is a crucial factor in success. However, there is a lack of effective tools for objectively evaluating and improving one's speaking style and the content of presentation materials. Therefore, there is a need for support in improving one's presentation skills by comprehensively analyzing audio and visual information and providing specific areas for improvement.

[0508] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0509] In this invention, the server includes recognition means for receiving acoustic information and converting that acoustic information into textual information, means for analyzing the flow, speed, and intervals of speech from the textual information, and means for receiving visual materials and analyzing the structure and content of the materials using image processing technology. This makes it possible to comprehensively analyze acoustic and visual information and provide the user with effective suggestions for improvement.

[0510] "Acoustic information" refers to audio data and related sound information, and is primarily the data targeted for speech recognition.

[0511] "Textual information" refers to text data generated by speech recognition based on acoustic information, and is information expressed in a format that is easily understood visually by humans.

[0512] "Recognition means" refers to the technologies and devices used to process acoustic information and convert it into textual information, and usually includes speech recognition algorithms.

[0513] "Speech flow" refers to the structure and order of spoken language, and is an element that should be considered in order to maintain the natural flow of speech.

[0514] "Speech rate" refers to the tempo of speech and is usually measured by the amount of sound uttered per unit of time.

[0515] "Pauses" refer to pauses or breaks that a speaker creates during utterances, and are an important element for improving the clarity of speech.

[0516] "Visual materials" refer to media used to present information visually, such as slides and documents created for presentations.

[0517] "Image processing technology" refers to the technology of using computers to analyze and transform image data and extract useful information.

[0518] "Composition" refers to the arrangement and combination of text and image elements in visual materials, and is important for effectively conveying information.

[0519] "Content" refers to the information that visual materials attempt to convey, and plays a central role in expressing the theme and message of the material.

[0520] "Areas for improvement" refers to specific suggestions and advice from the user to improve their presentation, derived from the analysis results of acoustic and visual information.

[0521] This invention is a system for analyzing acoustic and visual information and providing users with suggestions for improvement. The following describes how to implement the system in detail.

[0522] The server receives acoustic information and uses a speech recognition engine to convert that sound into text. For example, it utilizes speech recognition technologies such as "Google Speech-to-Text" or "Amazon Transcribe." The server analyzes the converted text and evaluates the flow, speed, and spacing of the speech. In this process, it uses a generative AI model to generate specific feedback regarding the emphasis of particular words and phrases, as well as the pace of the speech.

[0523] The terminal is responsible for transmitting audio information and visual materials recorded by the user to the server. Visual materials include slides and documents, and the server analyzes the received materials using image processing technology. By using image recognition tools such as OpenCV and Tesseract, the structure and content of the materials are evaluated, and improvement suggestions are generated.

[0524] Users can improve the quality of their presentations by receiving feedback from the server and repeatedly practicing. Based on the improvements, users can record new practice sessions and input them into the system again.

[0525] As a concrete example of implementation, let's say a user is preparing a 5-minute presentation introducing a new product. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improvement, such as "improve your delivery in the introduction and simplify the design of slide 2." Based on these suggestions, the user can practice repeatedly.

[0526] An example of a prompt message that could be sent to the server is: "Analyze the audio data and materials of the presentation and generate specific areas for improvement. The audio data is what the user said, and the materials include slides and documents."

[0527] This embodiment of the invention provides a concrete method for efficiently improving the quality of presentations through collaboration between servers, terminals, and users.

[0528] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0529] Step 1:

[0530] Users practice their presentations on their own devices and record the audio information (audio data). They also prepare visual materials (slides and documents) for their presentations. The input consists of audio information and visual materials, which are then sent to the server using a dedicated application.

[0531] Step 2:

[0532] The server receives acoustic information sent by the user and converts it into text information using a speech recognition engine. The input is acoustic information, and the server analyzes the waveform of the voice to obtain output as a string of characters. In this process, the server uses technologies such as "Google Speech-to-Text" or "Amazon Transcribe".

[0533] Step 3:

[0534] The server analyzes the speed, flow, and spacing of speech from the converted text information. Text information is input, and a generative AI model is used to evaluate the speaking style and tempo, outputting feedback for improvement. The server specifically analyzes certain words and speech patterns to provide appropriate suggestions.

[0535] Step 4:

[0536] Simultaneously, the server receives visual materials and analyzes their structure and content using image processing techniques. The input for this step is visual materials, and the layout and text legibility are automatically checked. The output generates specific suggestions for improvement regarding the materials. Technically, image recognition is performed using OpenCV or Tesseract.

[0537] Step 5:

[0538] The server integrates feedback from acoustic and visual materials to generate comprehensive improvement suggestions. The output includes detailed feedback, including areas for improvement, providing information to enhance the user's presentation skills from multiple perspectives.

[0539] Step 6:

[0540] The server sends feedback to the user's device. The user uses this feedback to improve their presentation. By reviewing the content displayed on the device and practicing again, it is possible to improve the quality of the presentation.

[0541] (Application Example 1)

[0542] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0543] In modern content distribution platforms, providing high-quality presentations and content is crucial, but it's not easy for individual users to improve their own presentation skills and the quality of their materials. Furthermore, there's a lack of systems that effectively analyze user-recorded audio and slides and provide specific suggestions for improvement. There's also a need for support in enhancing the quality of content, including presentations, through the input of prompts using generative AI models.

[0544] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0545] In this invention, the server includes a speech recognition means for receiving audio data and converting the audio into text, a means for analyzing the flow, tempo, and pauses of speech from the text, and a means for receiving presentation materials and analyzing the layout and content of the materials using image recognition technology. This makes it possible to analyze audio files and slide images recorded by the user and use the feedback to qualitatively improve the materials for the content distribution platform. Furthermore, by generating prompt sentences and inputting them into a generation AI model, it is possible to provide feedback that improves the overall quality of the content.

[0546] "Speech recognition means" refers to a device or software that has the function of analyzing speech data and converting it into text.

[0547] "Image recognition technology" is a technology that analyzes image data and determines its content and structure.

[0548] A "feedback generation method" is a function that provides users with suggestions for improvement based on the results of processed audio and image data.

[0549] A "terminal" is an electronic device that a user operates to input voice and image data and receive feedback.

[0550] A "content distribution platform" is an online service that provides users with various digital content and enables interaction.

[0551] "Means of qualitatively improving materials" refer to methods and techniques for evaluating provided materials and presentations and improving their content.

[0552] A "generative AI model" is an artificial intelligence algorithm that generates new information or results based on given data.

[0553] A "prompt statement" is an instruction given to a generative AI model, and it is a phrase used to obtain a specific output.

[0554] The system for implementing this invention consists of a user terminal, a data processing server, and a network that connects them. The user inputs audio data and presentation materials via the terminal and sends them to the server. At this time, the terminal records the audio and uploads the presentation materials as image files to the server.

[0555] The server converts the received audio data into text using speech recognition technology and analyzes the flow and tempo of the speech. The "speech_recognition" library is used for this process. Furthermore, for presentation materials, image recognition technology is used to analyze the content and layout, and the readability and visual arrangement of the information are evaluated. The "PIL" and "pytesseract" libraries are used at this stage.

[0556] The analysis results are integrated, and a feedback generation mechanism generates improvement suggestions for the user. These suggestions, based on both audio and document data, help the user prepare qualitatively improved material for the content distribution platform. The feedback is input into the AI ​​model as prompts, resulting in an optimized presentation. The user can receive the feedback on their device and use it to improve their presentation.

[0557] For example, when a user prepares an online lecture, they upload audio and slide materials to the system, and the server provides specific feedback such as "adjust the tempo of the audio introduction" or "simplify the slide structure." This allows the user to significantly improve the overall quality of the delivered content. An example of a prompt to the generative AI model would be, "Analyze the audio and materials of the presentation and point out specific areas for improvement. For example, include feedback on the tempo of speaking and the visual elements of the slides."

[0558] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0559] Step 1:

[0560] The user uses a device to record the audio of the presentation and prepare the materials as image files. The input consists of audio and image files, which serve as the basic data for subsequent analysis.

[0561] Step 2:

[0562] The terminal sends the prepared audio and image files to the server. At this point, the input is the audio and image files, and the output is the transmission of data to the server. Once the data reaches the server, the analysis process is ready to begin.

[0563] Step 3:

[0564] The server uses speech recognition to convert audio files into text data. The input is an audio file, and the output is the converted text. Here, the "speech_recognition" library is used to convert audio data into text.

[0565] Step 4:

[0566] The server analyzes text data and evaluates the flow, tempo, and pacing of speech. The input is text data, and the output is feedback data as a result of the analysis. It specifically extracts characteristics of the user's speech and indicates which areas can be improved.

[0567] Step 5:

[0568] The server uses image recognition technology to analyze presentation materials. The input is an image file, and the output is an evaluation of the layout and content. Here, the "PIL" and "pytesseract" libraries are used to extract text information from the slides and evaluate the design.

[0569] Step 6:

[0570] The server integrates the analysis results of audio and image data and uses a generative AI model to generate optimal feedback. The input is the analysis results data, and the output is feedback that includes detailed improvement suggestions. Prompt sentences are also generated during this process and input into the generative AI model.

[0571] Step 7:

[0572] The server sends the generated feedback to the terminal. The input is feedback data, and the output is a feedback presentation to the user. The suggestions are displayed on the terminal, and the user can use them to improve the quality of their presentation.

[0573] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0574] This invention is a presentation editing system that combines speech recognition and image recognition technologies with an emotion engine that recognizes the user's emotions. The specific system configuration and embodiments of the invention are shown below.

[0575] System Configuration

[0576] The system consists of a user-operated terminal, a server that performs data analysis, and a communication network connecting them. The user inputs audio and presentation data through the terminal and sends it to the server. The server performs speech recognition, image recognition, and emotion recognition, generates feedback based on the analysis results, and sends it back to the terminal.

[0577] Method for carrying out the invention

[0578] 1. Data entry and transmission

[0579] When preparing a presentation, users record audio data on their device and upload presentation materials as electronic files. This data is sent to the server via the application.

[0580] 2. Voice Analysis and Emotion Recognition

[0581] The server converts the received audio data into text using a speech recognition engine and simultaneously analyzes it using an emotion engine. The emotion engine recognizes the user's emotions (e.g., nervousness, anxiety, confidence) from factors such as tone of voice and speaking speed. This makes it possible to evaluate the effectiveness of emotional expression in a presentation.

[0582] 3. Material analysis

[0583] The server uses an image recognition engine to analyze the structure, design, and consistency of the presentation materials. It evaluates whether the materials are easy to understand and visually effective, and suggests areas for improvement as needed.

[0584] 4. Feedback generation and display

[0585] The server integrates voice and sentiment analysis results with document analysis results to generate feedback. This feedback includes specific areas for improvement in speaking style during presentations, guidelines for emotional expression, and suggestions for document revisions. The generated feedback is sent to the terminal and displayed to the user in an easy-to-understand format.

[0586] 5. User improvements and verification

[0587] Users can revise each element of their presentation based on the generated feedback. If necessary, they can re-enter data, receive new feedback, and further improve the quality of their presentation.

[0588] Specific example

[0589] Consider a scenario where a user is giving a presentation introducing a new product. The user uploads the audio and materials of their presentation to the system, and receives feedback from the server such as, "You sound nervous in the introduction; try to relax more," or "Slide 2 has too many graphs and is difficult to read; highlight the important data." This provides specific guidance for improving the presentation.

[0590] The following describes the processing flow.

[0591] Step 1:

[0592] The user records the audio of the presentation on their device and prepares the presentation materials as electronic files. Next, they launch a dedicated application and perform the input procedure to upload these audio data and material files to the server.

[0593] Step 2:

[0594] The terminal packages the uploaded audio data and document files into a data package and prepares it for transmission to the server. It sends the data to the server using a secure protocol (e.g., HTTPS).

[0595] Step 3:

[0596] The server adds the received audio data to the analysis queue and starts the speech recognition engine. The engine converts the audio data into text data and visualizes the content of the presentation.

[0597] Step 4:

[0598] The server continuously feeds the audio data into the emotion recognition engine. The emotion recognition engine analyzes emotional patterns from the tone, pitch, and speed of the voice to identify the emotional state during the presentation (tension, joy, calmness).

[0599] Step 5:

[0600] The server passes the presentation file to an image recognition engine for analysis. It evaluates the layout, design, and information placement within the document, and analyzes the visual consistency and appropriateness of the amount of information.

[0601] Step 6:

[0602] The server integrates the results of speech recognition, emotion recognition, and document analysis to generate feedback. This feedback includes suggestions for improving the presentation's overall quality, as well as areas for improvement in emotional expression, which should be particularly noteworthy.

[0603] Step 7:

[0604] The server packets the generated feedback and sends it to the terminal. The terminal receives the feedback and launches an interface that displays detailed error reports and success evaluations to the user.

[0605] Step 8:

[0606] Based on the feedback displayed, users revise the content, delivery, and emotional expression of their presentations. If necessary, they re-enter the data into the system to obtain further feedback for improvement.

[0607] (Example 2)

[0608] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0609] In today's presentation environment, presenters spend a great deal of time preparing their speaking style and materials, but they often struggle to identify specific areas for improvement and receive effective feedback. Furthermore, while emotional expression significantly impacts the success of a presentation, it's difficult for presenters to evaluate this aspect themselves.

[0610] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0611] In this invention, the server includes a speech recognition device that receives audio data and converts the audio into text, an emotion analysis device that detects the speaker's emotional state from the audio data, and a device that receives presentation materials and evaluates the structure and content of the materials using image recognition technology. This allows presenters to receive specific and improveable feedback on each element of their presentation: audio, emotion, and materials.

[0612] "Audio data" refers to information that represents the waveform of sound acquired through an input device such as a microphone in digital format.

[0613] A "speech recognition device" refers to a technology and device that receives speech data as input, analyzes that speech, and converts it into corresponding text information.

[0614] An "emotion analysis device" refers to a technology and device that analyzes a speaker's emotional state from voice data and its associated information, and identifies the type and intensity of that emotion.

[0615] "Emotional state" refers to the speaker's psychological state inferred from the characteristics of their voice, and includes, for example, joy, sadness, and tension.

[0616] "Image recognition technology" refers to technologies that analyze image data and recognize its content and structure, with examples including layout analysis and visual content recognition.

[0617] "Presentation materials" refer to visual content used during presentations or explanations, and may be provided in the form of slides, graphs, and other visual aids.

[0618] "Information equipment" refers to devices used for inputting and displaying data, and includes computers and smartphones.

[0619] A "user interface" refers to the screens and operating methods that operate on information devices and allow users to interact with the system.

[0620] This invention is realized through a system that combines speech and image recognition technology, as well as emotion analysis technology, to improve the quality of presentations. This system consists of a terminal operated by the user, a server that analyzes data, and a network that connects them.

[0621] Users prepare presentation audio and materials using their devices. Audio data is recorded using a microphone and saved as an electronic file. Material data is uploaded in electronic file formats such as PDF and PPT. This data is transmitted from the device to the server via the internet.

[0622] The server consists of hardware and software that perform multiple processes. Audio data is converted to text by speech recognition software (e.g., Google Speech-to-Text API or IBM Watson). Simultaneously, sentiment analysis software analyzes the emotional state from the audio data. This analysis includes elements such as voice tone, pitch, and speaking speed.

[0623] For the document data, image recognition software (e.g., OpenCV or Tesseract) is used to analyze the layout, consistency, and visual effect of the document. This allows us to evaluate whether the document is clearly structured and easy for viewers to understand.

[0624] The server integrates the analysis results of voice, emotion, and materials to generate specific feedback. This feedback includes suggestions for improving speaking style and revisions to materials. The generated feedback is sent back to the terminal via the internet. The feedback is then displayed in the application on the terminal, presented in a format that is easy for the user to understand.

[0625] One concrete example is a scenario where a user gives a presentation introducing a new product. When a user uploads the audio and materials of their presentation, they receive feedback from the server such as, "You sound nervous in the introduction, so try to relax when you speak," or "There are too many graphs on slide 2, making it difficult to read, so you should emphasize the important points." This feedback allows the user to improve their presentation.

[0626] Using a generative AI model, an example of a prompt message could be: "When a user records a presentation introducing a new product and sends the materials to the server, analyze the user's emotions from the audio data and evaluate the effectiveness of the materials. The feedback should include specific guidance regarding the user's proficiency."

[0627] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0628] Step 1:

[0629] The user launches a dedicated application on their device and simultaneously records the presentation audio using a microphone while preparing the presentation materials as electronic files (PDF or PPT). The device saves this audio and material data and sends them to the server via Wi-Fi or a mobile network. The input consists of the audio and material files, and the output is the transmission of data to the server.

[0630] Step 2:

[0631] The server converts the received audio file into text using a speech recognition device. Specifically, it uses software such as the Google Speech-to-Text API or IBM Watson to analyze the audio data and convert it into corresponding text. The input to this process is audio data, and the output is text data.

[0632] Step 3:

[0633] The server uses an emotion analysis device to analyze voice data and detect the user's emotional state based on indicators such as voice tone, pitch, and speaking speed. The input to this process is voice data, and the output is the detected emotion information. The emotion analysis uses an algorithm to determine the speaker's psychological state.

[0634] Step 4:

[0635] The server analyzes the received document files using image recognition technology. Specifically, it uses OpenCV and Tesseract to evaluate the layout, consistency, and visual effect of the documents. The input is the document file, and the output is the analysis result of the document.

[0636] Step 5:

[0637] The server integrates speech-to-text, sentiment information, and document analysis results to generate user-facing feedback. This feedback includes suggestions for improving speaking style, guidance on emotional expression, and slide revisions. The input consists of the results of each analysis, while the output is the integrated feedback.

[0638] Step 6:

[0639] The server sends the generated feedback to the terminal. The terminal displays this feedback within the application, presenting it in a format easily understandable to the user. The input is the generated feedback, and the output is the information displayed to the user.

[0640] Step 7:

[0641] The user modifies each element of the presentation based on the displayed feedback. If necessary, the user can resend the newly modified data from their device to the server to receive additional feedback. The input is the feedback-based modification process, and the output is the improved presentation.

[0642] (Application Example 2)

[0643] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0644] In today's commercial environment, interactive communication directly impacts customer satisfaction, making the improvement of store staff's customer service skills a crucial issue. However, traditional customer service training methods struggle to accurately evaluate individual staff members' emotional expressions and conversational flow. This creates a problem in providing specific and effective feedback tailored to each staff member's abilities.

[0645] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0646] In this invention, the server includes an acoustic recognition means for converting audio data into text, a method for analyzing the layout and content of materials, an algorithm for integrating the analysis results of audio and materials to generate feedback, and a function for evaluating emotions and providing guidance for improving customer service skills. This makes it possible to improve customer service skills by providing specific areas for improvement tailored to each individual staff member.

[0647] "Acoustic recognition means" refers to a function that converts audio data into text, and is a technology that enables the processing of audio information as textual information.

[0648] "Structure" refers to a mechanism for analyzing the flow, tempo, and pacing of speech from text, and is an analytical method for evaluating the characteristics of linguistic expression.

[0649] A "method" is a way of analyzing the layout and content of a document using image recognition technology, and is a process for evaluating the structure and effectiveness of a document based on visual information.

[0650] An "algorithm" is a series of processing steps for integrating the results of audio and data analysis to generate feedback, and is a computational method for combining data to produce useful information.

[0651] A "device" is hardware or software equipped with the functionality to transmit generated feedback to a terminal, and is a physical or virtual interface for enabling information transmission.

[0652] "Function" refers to the ability to evaluate emotions and provide guidance for improving customer service skills; it is a system for analyzing individual emotional expressions and providing appropriate advice.

[0653] "Applied technology" refers to techniques that recognize emotions, integrate the results, and enhance feedback, representing innovative technologies for improving information delivery by utilizing diverse data.

[0654] A "generative AI model" is a machine learning model that uses prompt sentences based on training data to suggest effective ways to improve conversations, and is an artificial intelligence technology for automated knowledge processing.

[0655] A "prompt statement" is an input phrase used to elicit a specific response from a generative AI model, and is an instruction statement that facilitates interaction with the AI.

[0656] The system for realizing this application consists of an application installed on a terminal and a server. The user uses the terminal to record audio data and uploads related materials to the server. The server converts the received audio data into text using acoustic recognition. It also utilizes emotion recognition to analyze the user's emotions from the audio data and generates appropriate feedback based on that information.

[0657] Regarding the materials, image recognition technology is used to analyze the consistency and design of the layout and to make suggestions for enhancing the visual effect. The server integrates these results to generate feedback. This feedback includes specific guidelines for improving customer service skills and is sent to the terminal.

[0658] The specific hardware used will be smartphones and tablets operated by the user. For the software, speech recognition and sentiment analysis will be implemented using Python, while machine learning frameworks such as TensorFlow will be used for image recognition. AWS Lambda will be used as the server backend, enabling serverless data processing.

[0659] For example, if a user uploads an audio recording of themselves explaining the features of a new product to a customer in a store, the server will generate feedback such as, "Your explanation is too fast; please speak a little slower." In this process, a generative AI model is used, and an example of a prompt might be, "Please tell me some phrases that will effectively introduce the product to the customer." This allows users to receive specific advice to improve their customer service skills.

[0660] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0661] Step 1:

[0662] The user records audio data using their device and prepares related materials in electronic file format. The audio data is saved on the device as an audio file, and the material data is formatted as an electronic file. The user uploads this data to the server via an application on their device.

[0663] Step 2:

[0664] The server converts the received audio data into text using acoustic recognition. Specifically, it analyzes the audio data using the Google Speech-to-Text API and outputs the results in text format. This process stores the linguistic content of the audio data as text in the database.

[0665] Step 3:

[0666] The server uses emotion recognition capabilities to evaluate emotions from transcribed speech. The input text data is passed through an emotion analysis algorithm, and the user's emotional state (e.g., tension, anxiety, confidence) is analyzed and output. This output is then used to generate feedback.

[0667] Step 4:

[0668] The server applies image recognition technology to analyze the layout and design of the materials. It uses machine learning frameworks such as TensorFlow to analyze the material data, and based on the analysis results, it evaluates the visual effects and identifies areas for improvement. These results provide information to determine how the materials will be received by users.

[0669] Step 5:

[0670] The server integrates the results of voice analysis and document analysis to generate feedback. Using a generative AI model, it generates feedback that suggests specific improvement measures and guidelines, taking prompts into consideration, and outputs this feedback in text format. The feedback may include specific instructions such as, "Please tell me some phrases that will effectively introduce the product to customers."

[0671] Step 6:

[0672] The feedback is sent to the device and displayed to the user. Based on this feedback, the user takes action to improve their skills. Based on the feedback, the user can re-enter data and make further improvements.

[0673] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0674] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0675] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0676] [Fourth Embodiment]

[0677] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0678] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0679] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0680] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0681] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0682] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0683] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0684] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0685] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0686] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0687] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0688] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0689] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0690] This invention is a presentation editing system that uses speech recognition and image recognition technology. The system configuration and specific embodiments for carrying out the invention are described below.

[0691] System Configuration

[0692] This system consists of a user terminal, a server that processes data, and a network that connects them. The user uses the terminal to input audio data and materials for their presentation and sends them to the server. The server analyzes this data, generates feedback, and then sends it back to the terminal for the user to see.

[0693] Method for carrying out the invention

[0694] 1. Data entry and transmission

[0695] Users practice their presentations on their own devices and record the audio. They also prepare presentation materials (slides and documents) as electronic files. These audio data and material files are then sent to the server via a dedicated application.

[0696] 2. Voice Analysis

[0697] The server performs speech recognition using the received audio data. This speech recognition engine converts the audio into text and analyzes the way the speaker speaks, their tone, and their pacing. For example, it generates specific feedback such as, "You're speaking too fast in the introduction."

[0698] 3. Material analysis

[0699] The server analyzes the received presentation materials using image recognition technology. It evaluates the structure, layout, text readability, and placement of visual elements of the materials, and generates suggestions such as, "Slide 3 contains too much information and needs to be rearranged for clarity."

[0700] 4. Feedback generation and display

[0701] The server integrates the results of audio and document analysis to create comprehensive feedback. This feedback is sent to the terminal in the form of multiple areas for improvement, providing users with information that allows them to efficiently improve their presentations.

[0702] 5. User modification and verification

[0703] Users review the feedback displayed on their devices and revise the content and delivery of their presentations. Afterward, they re-enter data into the system as needed to receive further feedback, thereby improving the quality of their presentations.

[0704] Specific example

[0705] For example, suppose a user is preparing a five-minute presentation on a marketing strategy. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improving the presentation style, such as "slow down the pace of the introduction and take longer pauses when stating the problem," as well as feedback on the materials, such as "the market analysis chart on the second slide is too complex; we suggest simplifying it visually."

[0706] This format allows users to receive multifaceted feedback and autonomously refine their presentations.

[0707] The following describes the processing flow.

[0708] Step 1:

[0709] The user uses their device to prepare the audio data and document files for the presentation and opens a dedicated application. Through the application, they upload the audio data and document files to the server.

[0710] Step 2:

[0711] The device packages the uploaded data and sends it to the server using a secure communication protocol. Additional information, such as the presentation topic and time constraints specified by the user, is also sent at the same time.

[0712] Step 3:

[0713] The server places the received audio data into an analysis queue and activates the speech recognition module. This module converts the audio data into text and extracts speech characteristics from it. These characteristics include speed, intonation, and pauses.

[0714] Step 4:

[0715] The server uses an image recognition module to analyze presentation files. This module evaluates the placement of text, shapes, and images within the slides, checking for visual consistency and readability of the materials.

[0716] Step 5:

[0717] The server compares the results of speech recognition and image recognition to generate integrated feedback. This feedback includes specific areas for improvement and recommended corrections.

[0718] Step 6:

[0719] The server repackages the generated feedback and sends it to the terminal. A secure protocol is used for transmission, ensuring the confidentiality of the data.

[0720] Step 7:

[0721] The device analyzes the received feedback and displays it in a format that is easy for the user to understand. The interface includes features that provide detailed explanations for each area of ​​improvement, making it clear to the user which parts of the presentation need to be revised.

[0722] Step 8:

[0723] Users can revise their presentation materials and delivery based on the feedback provided, and use the system again to obtain new feedback as needed. This allows for a gradual improvement in the quality of the presentation.

[0724] (Example 1)

[0725] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0726] In today's business environment, the quality of presentations and presentations is a crucial factor in success. However, there is a lack of effective tools for objectively evaluating and improving one's speaking style and the content of presentation materials. Therefore, there is a need for support in improving one's presentation skills by comprehensively analyzing audio and visual information and providing specific areas for improvement.

[0727] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0728] In this invention, the server includes recognition means for receiving acoustic information and converting that acoustic information into textual information, means for analyzing the flow, speed, and intervals of speech from the textual information, and means for receiving visual materials and analyzing the structure and content of the materials using image processing technology. This makes it possible to comprehensively analyze acoustic and visual information and provide the user with effective suggestions for improvement.

[0729] "Acoustic information" refers to audio data and related sound information, and is primarily the data targeted for speech recognition.

[0730] "Textual information" refers to text data generated by speech recognition based on acoustic information, and is information expressed in a format that is easily understood visually by humans.

[0731] "Recognition means" refers to the technologies and devices used to process acoustic information and convert it into textual information, and usually includes speech recognition algorithms.

[0732] "Speech flow" refers to the structure and order of spoken language, and is an element that should be considered in order to maintain the natural flow of speech.

[0733] "Speech rate" refers to the tempo of speech and is usually measured by the amount of sound uttered per unit of time.

[0734] "Pauses" refer to pauses or breaks that a speaker creates during utterances, and are an important element for improving the clarity of speech.

[0735] "Visual materials" refer to media used to present information visually, such as slides and documents created for presentations.

[0736] "Image processing technology" refers to the technology of using computers to analyze and transform image data and extract useful information.

[0737] "Composition" refers to the arrangement and combination of text and image elements in visual materials, and is important for effectively conveying information.

[0738] "Content" refers to the information that visual materials attempt to convey, and plays a central role in expressing the theme and message of the material.

[0739] "Areas for improvement" refers to specific suggestions and advice from the user to improve their presentation, derived from the analysis results of acoustic and visual information.

[0740] This invention is a system for analyzing acoustic and visual information and providing users with suggestions for improvement. The following describes how to implement the system in detail.

[0741] The server receives acoustic information and uses a speech recognition engine to convert that sound into text. For example, it utilizes speech recognition technologies such as "Google Speech-to-Text" or "Amazon Transcribe." The server analyzes the converted text and evaluates the flow, speed, and spacing of the speech. In this process, it uses a generative AI model to generate specific feedback regarding the emphasis of particular words and phrases, as well as the pace of the speech.

[0742] The terminal is responsible for transmitting audio information and visual materials recorded by the user to the server. Visual materials include slides and documents, and the server analyzes the received materials using image processing technology. By using image recognition tools such as OpenCV and Tesseract, the structure and content of the materials are evaluated, and improvement suggestions are generated.

[0743] Users can improve the quality of their presentations by receiving feedback from the server and repeatedly practicing. Based on the improvements, users can record new practice sessions and input them into the system again.

[0744] As a concrete example of implementation, let's say a user is preparing a 5-minute presentation introducing a new product. The user uploads a recording of their presentation and the presentation materials to the system. The server provides specific suggestions for improvement, such as "improve your delivery in the introduction and simplify the design of slide 2." Based on these suggestions, the user can practice repeatedly.

[0745] An example of a prompt message that could be sent to the server is: "Analyze the audio data and materials of the presentation and generate specific areas for improvement. The audio data is what the user said, and the materials include slides and documents."

[0746] This embodiment of the invention provides a concrete method for efficiently improving the quality of presentations through collaboration between servers, terminals, and users.

[0747] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0748] Step 1:

[0749] Users practice their presentations on their own devices and record the audio information (audio data). They also prepare visual materials (slides and documents) for their presentations. The input consists of audio information and visual materials, which are then sent to the server using a dedicated application.

[0750] Step 2:

[0751] The server receives acoustic information sent by the user and converts it into text information using a speech recognition engine. The input is acoustic information, and the server analyzes the waveform of the voice to obtain output as a string of characters. In this process, the server uses technologies such as "Google Speech-to-Text" or "Amazon Transcribe".

[0752] Step 3:

[0753] The server analyzes the speed, flow, and spacing of speech from the converted text information. Text information is input, and a generative AI model is used to evaluate the speaking style and tempo, outputting feedback for improvement. The server specifically analyzes certain words and speech patterns to provide appropriate suggestions.

[0754] Step 4:

[0755] Simultaneously, the server receives visual materials and analyzes their structure and content using image processing techniques. The input for this step is visual materials, and the layout and text legibility are automatically checked. The output generates specific suggestions for improvement regarding the materials. Technically, image recognition is performed using OpenCV or Tesseract.

[0756] Step 5:

[0757] The server integrates feedback from acoustic and visual materials to generate comprehensive improvement suggestions. The output includes detailed feedback, including areas for improvement, providing information to enhance the user's presentation skills from multiple perspectives.

[0758] Step 6:

[0759] The server sends feedback to the user's device. The user uses this feedback to improve their presentation. By reviewing the content displayed on the device and practicing again, it is possible to improve the quality of the presentation.

[0760] (Application Example 1)

[0761] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0762] In modern content distribution platforms, providing high-quality presentations and content is crucial, but it's not easy for individual users to improve their own presentation skills and the quality of their materials. Furthermore, there's a lack of systems that effectively analyze user-recorded audio and slides and provide specific suggestions for improvement. There's also a need for support in enhancing the quality of content, including presentations, through the input of prompts using generative AI models.

[0763] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0764] In this invention, the server includes a speech recognition means for receiving audio data and converting the audio into text, a means for analyzing the flow, tempo, and pauses of speech from the text, and a means for receiving presentation materials and analyzing the layout and content of the materials using image recognition technology. This makes it possible to analyze audio files and slide images recorded by the user and use the feedback to qualitatively improve the materials for the content distribution platform. Furthermore, by generating prompt sentences and inputting them into a generation AI model, it is possible to provide feedback that improves the overall quality of the content.

[0765] "Speech recognition means" refers to a device or software that has the function of analyzing speech data and converting it into text.

[0766] "Image recognition technology" is a technology that analyzes image data and determines its content and structure.

[0767] A "feedback generation method" is a function that provides users with suggestions for improvement based on the results of processed audio and image data.

[0768] A "terminal" is an electronic device that a user operates to input voice and image data and receive feedback.

[0769] A "content distribution platform" is an online service that provides users with various digital content and enables interaction.

[0770] "Means of qualitatively improving materials" refer to methods and techniques for evaluating provided materials and presentations and improving their content.

[0771] A "generative AI model" is an artificial intelligence algorithm that generates new information or results based on given data.

[0772] A "prompt statement" is an instruction given to a generative AI model, and it is a phrase used to obtain a specific output.

[0773] The system for implementing this invention consists of a user terminal, a data processing server, and a network that connects them. The user inputs audio data and presentation materials via the terminal and sends them to the server. At this time, the terminal records the audio and uploads the presentation materials as image files to the server.

[0774] The server converts the received audio data into text using speech recognition technology and analyzes the flow and tempo of the speech. The "speech_recognition" library is used for this process. Furthermore, for presentation materials, image recognition technology is used to analyze the content and layout, and the readability and visual arrangement of the information are evaluated. The "PIL" and "pytesseract" libraries are used at this stage.

[0775] The analysis results are integrated, and a feedback generation mechanism generates improvement suggestions for the user. These suggestions, based on both audio and document data, help the user prepare qualitatively improved material for the content distribution platform. The feedback is input into the AI ​​model as prompts, resulting in an optimized presentation. The user can receive the feedback on their device and use it to improve their presentation.

[0776] For example, when a user prepares an online lecture, they upload audio and slide materials to the system, and the server provides specific feedback such as "adjust the tempo of the audio introduction" or "simplify the slide structure." This allows the user to significantly improve the overall quality of the delivered content. An example of a prompt to the generative AI model would be, "Analyze the audio and materials of the presentation and point out specific areas for improvement. For example, include feedback on the tempo of speaking and the visual elements of the slides."

[0777] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0778] Step 1:

[0779] The user uses a device to record the audio of the presentation and prepare the materials as image files. The input consists of audio and image files, which serve as the basic data for subsequent analysis.

[0780] Step 2:

[0781] The terminal sends the prepared audio and image files to the server. At this point, the input is the audio and image files, and the output is the transmission of data to the server. Once the data reaches the server, the analysis process is ready to begin.

[0782] Step 3:

[0783] The server uses speech recognition to convert audio files into text data. The input is an audio file, and the output is the converted text. Here, the "speech_recognition" library is used to convert audio data into text.

[0784] Step 4:

[0785] The server analyzes text data and evaluates the flow, tempo, and pacing of speech. The input is text data, and the output is feedback data as a result of the analysis. It specifically extracts characteristics of the user's speech and indicates which areas can be improved.

[0786] Step 5:

[0787] The server uses image recognition technology to analyze presentation materials. The input is an image file, and the output is an evaluation of the layout and content. Here, the "PIL" and "pytesseract" libraries are used to extract text information from the slides and evaluate the design.

[0788] Step 6:

[0789] The server integrates the analysis results of audio and image data and uses a generative AI model to generate optimal feedback. The input is the analysis results data, and the output is feedback that includes detailed improvement suggestions. Prompt sentences are also generated during this process and input into the generative AI model.

[0790] Step 7:

[0791] The server sends the generated feedback to the terminal. The input is feedback data, and the output is a feedback presentation to the user. The suggestions are displayed on the terminal, and the user can use them to improve the quality of their presentation.

[0792] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0793] This invention is a presentation editing system that combines speech recognition and image recognition technologies with an emotion engine that recognizes the user's emotions. The specific system configuration and embodiments of the invention are shown below.

[0794] System Configuration

[0795] The system consists of a user-operated terminal, a server that performs data analysis, and a communication network connecting them. The user inputs audio and presentation data through the terminal and sends it to the server. The server performs speech recognition, image recognition, and emotion recognition, generates feedback based on the analysis results, and sends it back to the terminal.

[0796] Method for carrying out the invention

[0797] 1. Data entry and transmission

[0798] When preparing a presentation, users record audio data on their device and upload presentation materials as electronic files. This data is sent to the server via the application.

[0799] 2. Voice Analysis and Emotion Recognition

[0800] The server converts the received audio data into text using a speech recognition engine and simultaneously analyzes it using an emotion engine. The emotion engine recognizes the user's emotions (e.g., nervousness, anxiety, confidence) from factors such as tone of voice and speaking speed. This makes it possible to evaluate the effectiveness of emotional expression in a presentation.

[0801] 3. Material analysis

[0802] The server uses an image recognition engine to analyze the structure, design, and consistency of the presentation materials. It evaluates whether the materials are easy to understand and visually effective, and suggests areas for improvement as needed.

[0803] 4. Feedback generation and display

[0804] The server integrates voice and sentiment analysis results with document analysis results to generate feedback. This feedback includes specific areas for improvement in speaking style during presentations, guidelines for emotional expression, and suggestions for document revisions. The generated feedback is sent to the terminal and displayed to the user in an easy-to-understand format.

[0805] 5. User improvements and verification

[0806] Users can revise each element of their presentation based on the generated feedback. If necessary, they can re-enter data, receive new feedback, and further improve the quality of their presentation.

[0807] Specific example

[0808] Consider a scenario where a user is giving a presentation introducing a new product. The user uploads the audio and materials of their presentation to the system, and receives feedback from the server such as, "You sound nervous in the introduction; try to relax more," or "Slide 2 has too many graphs and is difficult to read; highlight the important data." This provides specific guidance for improving the presentation.

[0809] The following describes the processing flow.

[0810] Step 1:

[0811] The user records the audio of the presentation on their device and prepares the presentation materials as electronic files. Next, they launch a dedicated application and perform the input procedure to upload these audio data and material files to the server.

[0812] Step 2:

[0813] The terminal packages the uploaded audio data and document files into a data package and prepares it for transmission to the server. It sends the data to the server using a secure protocol (e.g., HTTPS).

[0814] Step 3:

[0815] The server adds the received audio data to the analysis queue and starts the speech recognition engine. The engine converts the audio data into text data and visualizes the content of the presentation.

[0816] Step 4:

[0817] The server continuously feeds the audio data into the emotion recognition engine. The emotion recognition engine analyzes emotional patterns from the tone, pitch, and speed of the voice to identify the emotional state during the presentation (tension, joy, calmness).

[0818] Step 5:

[0819] The server passes the presentation file to an image recognition engine for analysis. It evaluates the layout, design, and information placement within the document, and analyzes the visual consistency and appropriateness of the amount of information.

[0820] Step 6:

[0821] The server integrates the results of speech recognition, emotion recognition, and document analysis to generate feedback. This feedback includes suggestions for improving the presentation's overall quality, as well as areas for improvement in emotional expression, which should be particularly noteworthy.

[0822] Step 7:

[0823] The server packets the generated feedback and sends it to the terminal. The terminal receives the feedback and launches an interface that displays detailed error reports and success evaluations to the user.

[0824] Step 8:

[0825] Based on the feedback displayed, users revise the content, delivery, and emotional expression of their presentations. If necessary, they re-enter the data into the system to obtain further feedback for improvement.

[0826] (Example 2)

[0827] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0828] In today's presentation environment, presenters spend a great deal of time preparing their speaking style and materials, but they often struggle to identify specific areas for improvement and receive effective feedback. Furthermore, while emotional expression significantly impacts the success of a presentation, it's difficult for presenters to evaluate this aspect themselves.

[0829] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0830] In this invention, the server includes a speech recognition device that receives audio data and converts the audio into text, an emotion analysis device that detects the speaker's emotional state from the audio data, and a device that receives presentation materials and evaluates the structure and content of the materials using image recognition technology. This allows presenters to receive specific and improveable feedback on each element of their presentation: audio, emotion, and materials.

[0831] "Audio data" refers to information that represents the waveform of sound acquired through an input device such as a microphone in digital format.

[0832] A "speech recognition device" refers to a technology and device that receives speech data as input, analyzes that speech, and converts it into corresponding text information.

[0833] An "emotion analysis device" refers to a technology and device that analyzes a speaker's emotional state from voice data and its associated information, and identifies the type and intensity of that emotion.

[0834] "Emotional state" refers to the speaker's psychological state inferred from the characteristics of their voice, and includes, for example, joy, sadness, and tension.

[0835] "Image recognition technology" refers to technologies that analyze image data and recognize its content and structure, with examples including layout analysis and visual content recognition.

[0836] "Presentation materials" refer to visual content used during presentations or explanations, and may be provided in the form of slides, graphs, and other visual aids.

[0837] "Information equipment" refers to devices used for inputting and displaying data, and includes computers and smartphones.

[0838] A "user interface" refers to the screens and operating methods that operate on information devices and allow users to interact with the system.

[0839] This invention is realized through a system that combines speech and image recognition technology, as well as emotion analysis technology, to improve the quality of presentations. This system consists of a terminal operated by the user, a server that analyzes data, and a network that connects them.

[0840] Users prepare presentation audio and materials using their devices. Audio data is recorded using a microphone and saved as an electronic file. Material data is uploaded in electronic file formats such as PDF and PPT. This data is transmitted from the device to the server via the internet.

[0841] The server consists of hardware and software that perform multiple processes. Audio data is converted to text by speech recognition software (e.g., Google Speech-to-Text API or IBM Watson). Simultaneously, sentiment analysis software analyzes the emotional state from the audio data. This analysis includes elements such as voice tone, pitch, and speaking speed.

[0842] For the document data, image recognition software (e.g., OpenCV or Tesseract) is used to analyze the layout, consistency, and visual effect of the document. This allows us to evaluate whether the document is clearly structured and easy for viewers to understand.

[0843] The server integrates the analysis results of voice, emotion, and materials to generate specific feedback. This feedback includes suggestions for improving speaking style and revisions to materials. The generated feedback is sent back to the terminal via the internet. The feedback is then displayed in the application on the terminal, presented in a format that is easy for the user to understand.

[0844] One concrete example is a scenario where a user gives a presentation introducing a new product. When a user uploads the audio and materials of their presentation, they receive feedback from the server such as, "You sound nervous in the introduction, so try to relax when you speak," or "There are too many graphs on slide 2, making it difficult to read, so you should emphasize the important points." This feedback allows the user to improve their presentation.

[0845] Using a generative AI model, an example of a prompt message could be: "When a user records a presentation introducing a new product and sends the materials to the server, analyze the user's emotions from the audio data and evaluate the effectiveness of the materials. The feedback should include specific guidance regarding the user's proficiency."

[0846] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0847] Step 1:

[0848] The user launches a dedicated application on their device and simultaneously records the presentation audio using a microphone while preparing the presentation materials as electronic files (PDF or PPT). The device saves this audio and material data and sends them to the server via Wi-Fi or a mobile network. The input consists of the audio and material files, and the output is the transmission of data to the server.

[0849] Step 2:

[0850] The server converts the received audio file into text using a speech recognition device. Specifically, it uses software such as the Google Speech-to-Text API or IBM Watson to analyze the audio data and convert it into corresponding text. The input to this process is audio data, and the output is text data.

[0851] Step 3:

[0852] The server uses an emotion analysis device to analyze voice data and detect the user's emotional state based on indicators such as voice tone, pitch, and speaking speed. The input to this process is voice data, and the output is the detected emotion information. The emotion analysis uses an algorithm to determine the speaker's psychological state.

[0853] Step 4:

[0854] The server analyzes the received document files using image recognition technology. Specifically, it uses OpenCV and Tesseract to evaluate the layout, consistency, and visual effect of the documents. The input is the document file, and the output is the analysis result of the document.

[0855] Step 5:

[0856] The server integrates speech-to-text, sentiment information, and document analysis results to generate user-facing feedback. This feedback includes suggestions for improving speaking style, guidance on emotional expression, and slide revisions. The input consists of the results of each analysis, while the output is the integrated feedback.

[0857] Step 6:

[0858] The server sends the generated feedback to the terminal. The terminal displays this feedback within the application, presenting it in a format easily understandable to the user. The input is the generated feedback, and the output is the information displayed to the user.

[0859] Step 7:

[0860] The user modifies each element of the presentation based on the displayed feedback. If necessary, the user can resend the newly modified data from their device to the server to receive additional feedback. The input is the feedback-based modification process, and the output is the improved presentation.

[0861] (Application Example 2)

[0862] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0863] In today's commercial environment, interactive communication directly impacts customer satisfaction, making the improvement of store staff's customer service skills a crucial issue. However, traditional customer service training methods struggle to accurately evaluate individual staff members' emotional expressions and conversational flow. This creates a problem in providing specific and effective feedback tailored to each staff member's abilities.

[0864] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0865] In this invention, the server includes an acoustic recognition means for converting audio data into text, a method for analyzing the layout and content of materials, an algorithm for integrating the analysis results of audio and materials to generate feedback, and a function for evaluating emotions and providing guidance for improving customer service skills. This makes it possible to improve customer service skills by providing specific areas for improvement tailored to each individual staff member.

[0866] "Acoustic recognition means" refers to a function that converts audio data into text, and is a technology that enables the processing of audio information as textual information.

[0867] "Structure" refers to a mechanism for analyzing the flow, tempo, and pacing of speech from text, and is an analytical method for evaluating the characteristics of linguistic expression.

[0868] A "method" is a way of analyzing the layout and content of a document using image recognition technology, and is a process for evaluating the structure and effectiveness of a document based on visual information.

[0869] An "algorithm" is a series of processing steps for integrating the results of audio and data analysis to generate feedback, and is a computational method for combining data to produce useful information.

[0870] A "device" is hardware or software equipped with the functionality to transmit generated feedback to a terminal, and is a physical or virtual interface for enabling information transmission.

[0871] "Function" refers to the ability to evaluate emotions and provide guidance for improving customer service skills; it is a system for analyzing individual emotional expressions and providing appropriate advice.

[0872] "Applied technology" refers to techniques that recognize emotions, integrate the results, and enhance feedback, representing innovative technologies for improving information delivery by utilizing diverse data.

[0873] A "generative AI model" is a machine learning model that uses prompt sentences based on training data to suggest effective ways to improve conversations, and is an artificial intelligence technology for automated knowledge processing.

[0874] A "prompt statement" is an input phrase used to elicit a specific response from a generative AI model, and is an instruction statement that facilitates interaction with the AI.

[0875] The system for realizing this application consists of an application installed on a terminal and a server. The user uses the terminal to record audio data and uploads related materials to the server. The server converts the received audio data into text using acoustic recognition. It also utilizes emotion recognition to analyze the user's emotions from the audio data and generates appropriate feedback based on that information.

[0876] Regarding the materials, image recognition technology is used to analyze the consistency and design of the layout and to make suggestions for enhancing the visual effect. The server integrates these results to generate feedback. This feedback includes specific guidelines for improving customer service skills and is sent to the terminal.

[0877] The specific hardware used will be smartphones and tablets operated by the user. For the software, speech recognition and sentiment analysis will be implemented using Python, while machine learning frameworks such as TensorFlow will be used for image recognition. AWS Lambda will be used as the server backend, enabling serverless data processing.

[0878] For example, if a user uploads an audio recording of themselves explaining the features of a new product to a customer in a store, the server will generate feedback such as, "Your explanation is too fast; please speak a little slower." In this process, a generative AI model is used, and an example of a prompt might be, "Please tell me some phrases that will effectively introduce the product to the customer." This allows users to receive specific advice to improve their customer service skills.

[0879] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0880] Step 1:

[0881] The user records audio data using their device and prepares related materials in electronic file format. The audio data is saved on the device as an audio file, and the material data is formatted as an electronic file. The user uploads this data to the server via an application on their device.

[0882] Step 2:

[0883] The server converts the received audio data into text using acoustic recognition. Specifically, it analyzes the audio data using the Google Speech-to-Text API and outputs the results in text format. This process stores the linguistic content of the audio data as text in the database.

[0884] Step 3:

[0885] The server uses emotion recognition capabilities to evaluate emotions from transcribed speech. The input text data is passed through an emotion analysis algorithm, and the user's emotional state (e.g., tension, anxiety, confidence) is analyzed and output. This output is then used to generate feedback.

[0886] Step 4:

[0887] The server applies image recognition technology to analyze the layout and design of the materials. It uses machine learning frameworks such as TensorFlow to analyze the material data, and based on the analysis results, it evaluates the visual effects and identifies areas for improvement. These results provide information to determine how the materials will be received by users.

[0888] Step 5:

[0889] The server integrates the results of voice analysis and document analysis to generate feedback. Using a generative AI model, it generates feedback that suggests specific improvement measures and guidelines, taking prompts into consideration, and outputs this feedback in text format. The feedback may include specific instructions such as, "Please tell me some phrases that will effectively introduce the product to customers."

[0890] Step 6:

[0891] The feedback is sent to the device and displayed to the user. Based on this feedback, the user takes action to improve their skills. Based on the feedback, the user can re-enter data and make further improvements.

[0892] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0893] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (Internet Search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0894] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0895] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0896] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0897] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0898] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0899] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0900] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0901] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0902] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0903] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0904] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0905] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0906] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0907] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0908] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0909] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0910] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0911] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0912] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0913] The following is further disclosed regarding the embodiments described above.

[0914] (Claim 1)

[0915] A speech recognition means that receives audio data and converts that audio into text,

[0916] A method for analyzing the flow, tempo, and timing of speech from text,

[0917] A means of receiving presentation materials and analyzing the layout and content of the materials using image recognition technology,

[0918] A means of integrating the results of audio analysis and the results of document analysis to generate feedback,

[0919] A system that includes means for sending generated feedback to a terminal.

[0920] (Claim 2)

[0921] The system according to claim 1, wherein the terminal transmits audio data and presentation materials to a server.

[0922] (Claim 3)

[0923] The system according to claim 1, which displays an interface on a terminal that suggests ways to improve a presentation using feedback.

[0924] "Example 1"

[0925] (Claim 1)

[0926] A recognition means that receives acoustic information and converts that sound into textual information,

[0927] A method for analyzing the flow, speed, and intervals of speech from textual information,

[0928] A means for receiving visual materials and analyzing the structure and content of the materials using image processing technology,

[0929] A means for generating information that suggests areas for improvement by integrating the results of acoustic analysis and the results of data analysis,

[0930] A system that includes means for transmitting generated improvement information to an information terminal.

[0931] (Claim 2)

[0932] The system according to claim 1, wherein an information terminal transmits acoustic information and visual materials to a data processing device.

[0933] (Claim 3)

[0934] The system according to claim 1, which displays a human-machine interface for presenting improvement information on an information terminal.

[0935] "Application Example 1"

[0936] (Claim 1)

[0937] A speech recognition means that receives audio data and converts that audio into text,

[0938] A method for analyzing the flow, tempo, and timing of speech from text,

[0939] A means of receiving presentation materials and analyzing the layout and content of the materials using image recognition technology,

[0940] A means of integrating the results of audio analysis and the results of document analysis to generate feedback,

[0941] A means of sending the generated feedback to the terminal,

[0942] A means of qualitatively improving material for content distribution platforms using feedback,

[0943] A means of generating prompt sentences and inputting them into a generative AI model,

[0944] A method for analyzing audio files and slide images recorded on a device,

[0945] A system that includes this.

[0946] (Claim 2)

[0947] The system according to claim 1, wherein the terminal transmits audio data and presentation materials to a server.

[0948] (Claim 3)

[0949] The system according to claim 1, which displays an interface on a terminal that suggests ways to improve a presentation using feedback.

[0950] "Example 2 of combining an emotion engine"

[0951] (Claim 1)

[0952] A speech recognition device that receives audio data and converts that audio into text,

[0953] An emotion analysis device that detects the speaker's emotional state from audio data,

[0954] A device that receives presentation materials and evaluates the structure and content of the materials using image recognition technology,

[0955] A device that generates feedback by integrating the results of voice analysis, emotion analysis, and document analysis,

[0956] A system including a device that transmits the generated feedback to an information device.

[0957] (Claim 2)

[0958] The system according to claim 1, wherein an information device transmits audio data and presentation materials to a processing device.

[0959] (Claim 3)

[0960] The system according to claim 1, which displays a user interface on an information device that suggests ways to improve a presentation using feedback.

[0961] "Application example 2 when combining with an emotional engine"

[0962] (Claim 1)

[0963] A sound recognition means that receives audio data and converts that audio into text,

[0964] A structure that analyzes the flow, tempo, and timing of speech from the text,

[0965] A method that receives documents and analyzes their layout and content using image recognition technology,

[0966] An algorithm that integrates the results of audio analysis and document analysis to generate feedback,

[0967] A device that transmits the generated feedback to the terminal,

[0968] It has a function to evaluate emotions and provide guidance on improving customer service skills,

[0969] Applied technologies that recognize emotions during the communication process, integrate the results, and enhance feedback,

[0970] A system that includes a process that uses a generative AI model to suggest effective ways to improve conversations based on prompt sentences.

[0971] (Claim 2)

[0972] The system according to claim 1, wherein the terminal transmits voice data and materials to a server to support training in customer service skills.

[0973] (Claim 3)

[0974] The system according to claim 1, comprising a display device on the terminal that provides feedback for suggesting areas for improvement in customer service. [Explanation of Symbols]

[0975] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. A speech recognition means that receives audio data and converts that audio into text, A method for analyzing the flow, tempo, and timing of speech from text, A means of receiving presentation materials and analyzing the layout and content of the materials using image recognition technology, A means of integrating the results of audio analysis and the results of document analysis to generate feedback, A system that includes means for sending generated feedback to a terminal.

2. The system according to claim 1, wherein the terminal transmits audio data and presentation materials to a server.

3. The system according to claim 1, which displays an interface on a terminal that suggests ways to improve a presentation using feedback.