Response output system and response output device

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The described system improves user interaction through a large language model-based response output system, addressing configuration gaps in existing AI response technologies by integrating local and external models for enhanced user engagement.

JP2026096875APending Publication Date: 2026-06-15MAXELL LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: MAXELL LTD
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Application Information

Patent Timeline

03 Dec 2024

Application

15 Jun 2026

Publication

JP2026096875A

IPC: G06F16/78

AI Tagging

Application Domain

Metadata video data retrieval

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Search method and search device for video material segments, electronic device
CN122173678AVideo data clustering/classificationMetadata video data retrieval Computer graphics (images)Data profiling
A video intelligent clipping method and system based on multi-modal semantic analysis
CN122248230ATelevision system details Electronic editing digitised analogue information signals
Hospital clean area dynamic monitoring and regulation method based on laser radar
CN122176631AMechanical apparatus Lighting and heating apparatus Automatic control Point cloud
Personalized media guide for offline media devices
US12659547B2Metadata video data retrieval Selective content distribution
A remote acceptance data management system and method based on a visualized model
CN122175530AVideo data indexing Metadata video data retrievalVisualization modelUnstructured data

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing response output technologies using artificial intelligence lack sufficient configuration for optimal user interaction and response generation.

Method used

A response output system comprising a large language model, a control unit, and an output unit, configured to generate and deliver responses based on the large language model's output, with optional integration with local or external large-scale language models and multimodal capabilities.

Benefits of technology

Enhances the suitability and effectiveness of response output technology by providing more tailored and efficient interactions with users.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026096875000001_ABST

Patent Text Reader

Abstract

To provide a more suitable artificial intelligence response output technology. According to this invention, it will contribute to Sustainable Development Goals (SDGs) "9. Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation" and "11. Make cities and human settlements inclusive, safe, resilient and sustainable." [Solution] A response output system comprising a large-scale language model, a control unit that acquires responses from the large-scale language model to instructions given to the large-scale language model, and an output unit that outputs based on the responses of the large-scale language model acquired by the control unit, wherein the control unit has a control state in which it outputs the output generated based on the responses of the large-scale language model via the output unit.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to a response output system and a response output device.

Background Art

[0002] Regarding response output technology using artificial intelligence such as a language model, for example, it is disclosed in Patent Document 1.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] However, in the disclosure of Patent Document 1, the consideration regarding the configuration for more suitably providing the response output technology using artificial intelligence to the user was not sufficient.

[0005] An object of the present invention is to provide a more suitable response output technology.

Means for Solving the Problems

[0006] In order to solve the above problems, for example, the configuration described in the claims is adopted. Although this application includes a plurality of means for solving the above problems, if an example is given, a response output system including a large language model, a control unit that obtains a response from the large language model for an instruction sentence to the large language model; and an output unit that performs an output based on the response of the large language model obtained by the control unit, and the control state by the control unit includes a state in which control for outputting a response generated based on the response of the large language model via the output unit is performed, may be configured.

Effects of the Invention

[0007] According to the present invention, a more suitable response output technology can be provided. Other problems, configurations, and effects will be clarified in the following description of embodiments. [Brief explanation of the drawing]

[0008] [Figure 1A] This figure shows an example of an artificial intelligence response output device and system according to one embodiment of the present invention. [Figure 1B] This figure shows an example of an artificial intelligence response output device according to one embodiment of the present invention. [Figure 1C] This figure shows an example of the operation of an artificial intelligence response output device and system according to one embodiment of the present invention. [Figure 2A] This is an explanatory diagram of an example of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2B] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2C] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2D] This is an explanatory diagram illustrating an example of a conversation in a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2E] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2F] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2G] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2H] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2I] This is an explanatory diagram illustrating an example of the operation of a character conversation device and character conversation system according to one embodiment of the present invention. [Figure 2J] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 2K] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 2L] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3A] It is an explanatory diagram of an example of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3B] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3C] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3D] It is an explanatory diagram of an example of a conversation in a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3E] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3F] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3G] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3H] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 3I] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 4A] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 4B] It is an explanatory diagram of an example of the operation of a character conversation device and a character conversation system according to an embodiment of the present invention. [Figure 5A] It is an explanatory diagram of an example of the operation of an artificial intelligence response output device according to an embodiment of the present invention. [Figure 5B] It is an explanatory diagram of an example of a display example of an artificial intelligence response output device according to an embodiment of the present invention. [Figure 5C] It is an explanatory diagram of an example of a display example of an artificial intelligence response output device according to an embodiment of the present invention. [Figure 5D] It is an explanatory diagram of an example of a display example of an artificial intelligence response output device according to an embodiment of the present invention. [Figure 6] It is an explanatory diagram of an example of the response generation process of an artificial intelligence response output device according to an embodiment of the present invention. [Figure 7] It is a block diagram for explaining the configuration of an artificial intelligence response output system according to an embodiment of the present invention. [Figure 8] It is a diagram for explaining an example of video information and search information according to an embodiment of the present invention. [Figure 9] It is a flowchart showing an example of the flow of response generation in an artificial intelligence response output system according to an embodiment of the present invention. [Figure 10] It is a diagram for explaining the search information acquisition process according to an embodiment of the present invention. [Figure 11] It is a diagram for explaining the search information according to an embodiment of the present invention. [Figure 12] It is a diagram showing an example of the display state of a display unit as a user interface according to an embodiment of the present invention. [Figure 13] It is a flowchart showing an example of the flow of response generation in an artificial intelligence response output system according to an embodiment of the present invention. [Figure 14] It is a diagram for explaining an example of the search accuracy improvement process according to an embodiment of the present invention. [Figure 15] It is a diagram showing an example of the display state of a display unit as a user interface according to an embodiment of the present invention. [Figure 16A]This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 16B] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 17] This is a flowchart showing an example of the response generation flow in an artificial intelligence response output system according to one embodiment of the present invention. [Figure 18A] This figure illustrates an example of a search accuracy improvement process according to one embodiment of the present invention. [Figure 18B] This figure illustrates an example of a search accuracy improvement process according to one embodiment of the present invention. [Figure 19] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 20A] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 20B] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 21A] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 21B] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 22A] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 22B] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 23] This figure illustrates an example of a search accuracy improvement process according to one embodiment of the present invention. [Figure 24A] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 24B]This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 25] This figure shows an example of an artificial intelligence response output device according to one embodiment of the present invention. [Figure 26] This figure illustrates an example of video information and searchable information according to one embodiment of the present invention. [Figure 27] This figure illustrates an example of search information related to one embodiment of the present invention. [Figure 28] This figure illustrates an example of input information and search accuracy improvement processing according to one embodiment of the present invention. [Figure 29A] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 29B] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Figure 30] This figure illustrates an example of video information, special information, and searchable information according to one embodiment of the present invention. [Figure 31] This figure illustrates an example of search information related to one embodiment of the present invention. [Figure 32] This is a block diagram illustrating the configuration of an artificial intelligence response output system according to one embodiment of the present invention. [Figure 33] This figure illustrates an example of a search information acquisition process according to one embodiment of the present invention. [Figure 34] This figure illustrates an example of a search information acquisition process and a search accuracy improvement process according to one embodiment of the present invention. [Figure 35] This figure shows an example of the display state of a display unit as a user interface according to one embodiment of the present invention. [Modes for carrying out the invention]

[0009] Embodiments of the present invention will be described in detail below with reference to the drawings. However, the present invention is not limited to the examples described herein, and various modifications and alterations are possible by those skilled in the art within the scope of the technical ideas disclosed herein. Furthermore, in all the figures used to illustrate the present invention, components having the same function are given the same reference numerals, and repeated descriptions may be omitted.

[0010] Furthermore, if an artificial intelligence response output device according to each embodiment of the present invention has a display screen, it may be called a display device. If an artificial intelligence response output device has a voice output function, it may be called a voice output device. The artificial intelligence response output device may simply be called an information processing device. A system including an artificial intelligence response output device and a large-scale language model server that holds a large-scale language model may be called an artificial intelligence response output system. Also, if the artificial intelligence response output device provides a response service of a large-scale language model, which is artificial intelligence, to the user and assists the user, the artificial intelligence response output device or the display output of the artificial intelligence response output device can become an artificial intelligence (AI) assistant for the user. Therefore, in this case, the artificial intelligence response output device may be called an AI assistant device or an AI assistant display device. Similarly, in this case, a system including an artificial intelligence response output device and a large-scale language model server that holds a large-scale language model may be called an AI assistant system or an AI assistant display system. Also, in this case, since the artificial intelligence response output device becomes an interface between the user and artificial intelligence, it may be called an artificial intelligence interface device. In this case, a system including an artificial intelligence response output device and a large-scale language model server that holds a large-scale language model may be called an artificial intelligence interface system.

[0011] <Example 1> As Embodiment 1 of the present invention, an artificial intelligence response output device and system that outputs a response from a large-scale language model artificial intelligence will be described.

[0012] An example of the artificial intelligence response output device 10010 of the present invention will be described using Figure 1A. Furthermore, an example of a system in which the artificial intelligence response output device 10010 cooperates with a large-scale language model server 19001 through communication or other means will be described, including the large-scale language model server 19001 and / or a multimodal large-scale language model server 20001.

[0013] In the example shown in Figure 1A, the artificial intelligence response output device 10010 has a display unit 10011. In the example shown in Figure 1A, the display unit 10011 may be a flat panel display, a screen that projects images from the back, or a floating image that projects an optical image into the air. If the display unit 10011 is a flat panel display, it may be a liquid crystal display having a liquid crystal panel and a backlight. The display unit 10011 may also be a plasma display. The display unit 10011 may also be an organic EL display in which pixels emit light themselves. Furthermore, the display unit 10011 may be equipped with a touch operation input sensor and configured as a touch panel.

[0014] In the example shown in Figure 1A, the voice output unit 1140 of the artificial intelligence response output device 10010 is composed of a speaker. The artificial intelligence response output device 10010 is also equipped with a microphone 1139 that can pick up the user's voice. Through voice input from the microphone 1139 and user operation input via the operation input unit described later, the artificial intelligence response output device 10010 can acquire user input that forms the basis of instructions (prompts) for the large-scale language model, which is an artificial intelligence.

[0015] The artificial intelligence response output device 10010 may have a local large-scale language model within itself. In this case, the response of the large-scale language model may be output as the display output of the display unit 10011 and / or as the audio output of the audio output unit 1140.

[0016] Furthermore, the artificial intelligence response output device 10010 may not have a local large-scale language model, but instead communicate with an external large-scale language model server 19001 and output the response received from the large-scale language model server 19001 as the display output of the display unit 10011 and / or as the audio output of the audio output unit 1140.

[0017] Alternatively, the artificial intelligence response output device 10010 may also include a local large-scale language model and be configured to communicate with an external large-scale language model server 19001 having a large-scale language model or an external large-scale language model server 20001 having a multimodal large-scale language model. In this case, the response of the local large-scale language model and the response received from the large-scale language model server 19001 or the multimodal large-scale language model of the multimodal large-scale language model server 20001 may be switched and output as the display output of the display unit 10011 and / or as the audio output of the audio output unit 1140. Alternatively, the response generated based on both the response of the local large-scale language model and the response received from the large-scale language model server 19001 or the multimodal large-scale language model of the multimodal large-scale language model server 20001 may be output as the display output of the display unit 10011 and / or as the audio output of the audio output unit 1140.

[0018] The configuration when the artificial intelligence response output device 10010 communicates and cooperates with an external large-scale language model server 19001 or large-scale language model server 20001 is as follows. The artificial intelligence response output device 10010 can communicate with a communication device 19011 connected to the internet 19000 via a communication unit 1132. In the example in Figure 1A, the communication between the communication unit 1132 and the communication device 19011 is shown as a wireless example, but wired communication is also acceptable. The communication path from the communication unit 1132 to the communication device 19011 may have both wired and wireless sections, or it may pass through routers or repeaters. Similarly, the communication path from the communication unit 1132 to the internet 19000 may also have both wired and wireless sections, or it may pass through routers or repeaters. The artificial intelligence response output device 10010 can communicate with the large-scale language model server 19001 via the communication device 19011 and the internet 19000. Furthermore, the artificial intelligence response output device 10010 can communicate with the large-scale language model server 19001 or the large-scale language model server 20001, and a second server 19002 that is different from these servers, via the communication device 19011 and the internet 19000. The configuration including the artificial intelligence response output device 10010 and the large-scale language model server 19001 or the large-scale language model server 20001 may be considered as a single system.

[0019] In the following explanation, unless otherwise specified, the term "large-scale language model" should be understood as encompassing the local large-scale language model provided by the artificial intelligence response output device 10010, the large-scale language model provided by the large-scale language model server 19001, and the multimodal large-scale language model provided by the large-scale language model server 20001.

[0020] In the example in Figure 1A, the display unit 10011 shows an example where each element is displayed in two display areas: an instruction display area 10051 where the user inputs an instruction (prompt) to a large-scale language model which is artificial intelligence, and an artificial intelligence response display area 10061 which displays the response from the large-scale language model. In the example in Figure 1A, the instruction display area 10051 shows an example where an icon 10052 representing the user, text such as natural language or software code 10053 as a component of the instruction, an image 10054 as a component of the instruction, a video 10055 as a component of the instruction, etc. In the example in Figure 1A, the artificial intelligence response display area 10061 shows an example where an icon 10062 representing artificial intelligence or an artificial intelligence assistant, text such as natural language or software code 10063 as a component of the response from artificial intelligence, an image 10064 as a component of the response from artificial intelligence, a video 10065 as a component of the response from artificial intelligence, etc. Note that the display example of the display unit 10011 of the artificial intelligence response output device 10010 shown in Figure 1A is merely an example. Depending on the implementation example in which the artificial intelligence response output device 10010 is used, a different display from the example shown in Figure 1A may be used.

[0021] Here, we will explain large-scale language models. Large-scale language models are also referred to as LLMs (Large Language Models). Specifically, various models such as GPT-1, GPT-2, GPT-3, InstructGPT, and ChatGPT have been made publicly available. These technologies can be used in this embodiment as well. These large-scale language models are artificial intelligence models that have been generated through extensive pre-training on natural language contained in a large number of documents and texts that exist in the human world. The number of parameters of these artificial intelligence models exceeds hundreds of millions. Furthermore, in addition to this, there are also models that incorporate reinforcement learning based on human feedback. An example of a base model is a model called Transformer. As an example of training these models, for example, Reference 1 is publicly available.

[0022] [Reference 1] Long Ouyang, et. al. “Training language models to follow instructions with human feedback”, https: / / arxiv.org / pdf / 2203.02155.pdf

[0023] These large-scale language models are capable of natural language translation, natural language text proofreading, and natural language text summarization. More advanced models can even perform natural language question answering (also called dialogue or conversation), natural language suggestion generation, and programming code generation. Because these artificial intelligence models have a very large number of parameters, training requires enormous amounts of data and computing resources. Therefore, training this level of artificial intelligence for a specific application is extremely resource-inefficient. To address this, large-scale pre-training is performed to generate foundation models that can be applied to various uses. For example, the large-scale language model server 19001 shown in Figure 1A may be equipped with such a large-scale language model and configured to be usable by various terminals via an API (Application Programming Interface). Alternatively, the artificial intelligence response output device 10010 shown in Figure 1A may be equipped with a local large-scale language model and configured to use it itself. The training of any large-scale language model can be performed separately through large-scale pre-training to generate it, and the generated large-scale language model can then be duplicated and provided to the large-scale language model server 19001, artificial intelligence response output device 10010, etc. In this way, instead of performing pre-training for each application or terminal, duplicating the large-scale language model, which is the foundation model generated through large-scale pre-training, and using it on individual servers and terminals allows for the sharing of resource consumption used for training, resulting in better resource efficiency.

[0024] Furthermore, even if a large-scale language model is generated as a foundational model through extensive pre-training, it may be configured to perform additional training, such as transfer learning, on individual servers or devices, depending on the application and purpose.

[0025] Furthermore, large-scale language models can pre-train on natural language and perform input / output processing targeting natural language. In addition, multimodal large-scale language model artificial intelligence capable of processing not only natural language text information but also other types of information is also applicable to the embodiments of the present invention. In Figure 1A, a server having a multimodal large-scale language model is shown as the large-scale language model server 20001. For example, specific examples of multimodal large-scale language model artificial intelligence include GPT-4 (see Reference 2) and Gato (see Reference 3), which have been made publicly available. These technologies may also be used in this embodiment. These multimodal large-scale language models are artificial intelligence models generated by performing large-scale pre-training on natural language and other types of information (e.g., images, videos, audio, etc.) contained in numerous documents and texts existing in the human world. Furthermore, there are also models that incorporate reinforcement learning based on human feedback. Hereinafter, information other than natural language text information, such as images, videos, and audio, may be referred to as non-natural language information sources.

[0026] [Reference 2] Open AI “GPT-4 Technical Report”, https: / / cdn.openai.com / papers / gpt-4.pdf [Reference 3] Scott Reed, et. al. “A Generalist Agent”, https: / / arxiv.org / pdf / 2205.06175.pdf

[0027] Next, using Figure 1B, we will describe an example configuration of an artificial intelligence response output device 10010 that receives user input to artificial intelligence such as these large-scale language models and outputs a response from the artificial intelligence such as the large-scale language model to the user input.

[0028] The artificial intelligence response output device 10010 includes a display unit 10011, a control unit 1110, a memory 1109, a non-volatile memory 1108, an external power input interface 1111, an operation input unit 1107, a power supply 1106, a secondary battery 1112, a storage unit 1170, a video control unit 1160, a posture sensor 1113, a communication unit 1132, an audio output unit 1140, a microphone 1139, a video signal input unit 1131, an audio signal input unit 1133, an imaging unit 1180, and the like. The artificial intelligence response output device 10010 may have a large screen, such as a so-called monitor or television.

[0029] The display unit 10011 may be a flat panel display, a screen that projects images from the back, or a display that projects an optical image into the air to show a floating image. If the display unit 10011 is a flat panel display, it may be a liquid crystal display having a liquid crystal panel and a backlight. Alternatively, the display unit 10011 may be a plasma display. The display unit 10011 may be an organic EL display in which pixels emit light themselves. If the display unit 10011 is a panel, it may be called a display panel. The display unit 10011 may be equipped with a touch operation input sensor and configured to accept touch operation input from the user 230's finger. In this case, the display unit 10011 may be configured as a touch panel. Through the user's operation input via the touch panel, the artificial intelligence response output device 10010 can acquire user input that forms the basis of instructions (prompts) to the large-scale language model, which is artificial intelligence.

[0030] The communication unit 1132 may be configured with a Wi-Fi communication interface, a Bluetooth® communication interface, or a mobile communication interface such as 4G or 5G. Using these communication methods, the communication unit 1132 of the artificial intelligence response output device 10010 can communicate with the communication device 19011 connected to the internet 19000. The communication path between the communication unit 1132 and the communication device 19011 may include both wired and wireless sections, and may also pass through routers or repeaters. In the case of a wired connection, the communication unit 1132 may have an Ethernet connection interface as hardware and communicate using a LAN communication method. This allows the artificial intelligence response output device 10010 to communicate with various servers connected to the internet 19000.

[0031] The artificial intelligence response output device 10010 is equipped with a control unit 1110 such as a CPU and a memory 1109, and the control unit 1110 controls the display unit 10011 and the communication unit 1132, etc.

[0032] The power supply 1106 converts the AC current input from an external source via the external power input interface 1111 into DC current and supplies the necessary DC current to each part of the artificial intelligence response output device 10010. The secondary battery 1112 stores the power supplied from the power supply 1106. In addition, the secondary battery 1112 supplies power to each part that requires power via the external power input interface 1111 when power is not supplied from an external source.

[0033] The operation input unit 1107 is, for example, an operation button, a signal receiving unit such as a remote controller, or an infrared light receiving unit, and inputs signals for operations other than touch operations by the user to the touch operation input sensor of the display unit 10011. Separately from the user who touches the touch operation input sensor of the display unit 10011, the operation input unit 1107 may be used, for example, by an administrator to operate the artificial intelligence response output device 10010. Through the user's operation input via the operation input unit 1107, the artificial intelligence response output device 10010 can obtain user input that forms the basis of instructions (prompts) to the large-scale language model, which is artificial intelligence. Note that there may also be a modified configuration in which the touch operation input sensor of the display unit 10011 is included as part of the operation input unit 1107.

[0034] The video signal input unit 1131 receives video data by connecting an external video output device. Various digital video input interfaces are possible for the video signal input unit 1131. For example, it can be configured with an HDMI (High-Definition Multimedia Interface) standard video input interface, a DVI (Digital Visual Interface) standard video input interface, or a DisplayPort standard video input interface. Alternatively, an analog video input interface such as analog RGB or composite video may be provided. The video signal input unit 1131 may also use various USB interfaces.

[0035] The audio signal input unit 1133 receives audio data by connecting an external audio output device. The audio signal input unit 1133 may be configured as an HDMI standard audio input interface, an optical digital terminal interface, or a coaxial digital terminal interface. The audio signal input unit 1133 may also be various USB interfaces. In the case of an HDMI standard interface, the video signal input unit 1131 and the audio signal input unit 1133 may be configured as an interface with integrated terminals and cables.

[0036] The audio output unit 1140 is capable of outputting audio based on audio data input to the audio signal input unit 1133. The audio output unit 1140 is also capable of outputting audio based on audio data stored in the storage unit 1170. The audio output unit 1140 may be configured as a speaker. The audio output unit 1140 may also output built-in operation sounds or error warning sounds. Alternatively, the audio output unit 1140 may be configured to output audio signals as digital signals to external devices, such as the Audio Return Channel function specified in the HDMI standard. Alternatively, the audio output unit 1140 may be configured to output audio signals as analog signals to external devices such as headphones.

[0037] Microphone 1039 is a microphone that picks up sounds from the vicinity of the artificial intelligence response output device 10010, converts them into signals, and generates audio signals. The microphone may be configured to record human voices, such as the user's voice, and the control unit 1110, described later, may perform speech recognition processing on the generated audio signal to obtain textual information from the audio signal. Through the audio input from microphone 1139, the artificial intelligence response output device 10010 can obtain user input that forms the basis of instructions (prompts) for the large-scale language model, which is an artificial intelligence.

[0038] The imaging unit 1180 is a camera having an image sensor. The camera may be provided on the front of the display unit 10011 side of the artificial intelligence response output device 10010, or on the back of the display unit 10011 side. Both a front camera and a rear camera may be provided. In this embodiment, the imaging unit 1180 will be described as having both a front camera and a rear camera.

[0039] The storage unit 1170 is a storage device that records various types of information, such as video data, image data, and audio data. The storage unit 1170 may be composed of a magnetic recording medium such as a hard disk drive (HDD) or a semiconductor memory such as a solid-state drive (SSD). For example, the storage unit 1170 may have various types of information, such as video data, image data, and audio data, pre-recorded in it at the time of product shipment. The storage unit 1170 may also record various types of information, such as video data, image data, and audio data, acquired from external devices or external servers via the communication unit 1132. The video data, image data, etc., recorded in the storage unit 1170 are output to the display unit 10011. The video data, image data, etc., recorded in the storage unit 1170 may also be output to external devices or external servers via the communication unit 1132.

[0040] The video control unit 1160 performs various controls related to the video signal input to the display unit 10011. The video control unit 1160 may also be called a video processing circuit and may be composed of hardware such as an ASIC, FPGA, or video processor. The video control unit 1160 may also be called a video processing unit or image processing unit. The video control unit 1160 performs video switching control, such as determining which video signal to input to the display unit 10011 from among the video signals stored in the memory 1109 and the video signals (video data) input to the video signal input unit 1131. The video control unit 1160 may also perform image processing control on the video signals input from the video signal input unit 1131 and the video signals stored in the memory 1109. Examples of image processing include scaling processing such as enlarging, reducing, and transforming images; brightness adjustment processing to change the brightness; contrast adjustment processing to change the contrast curve of an image; and retinex processing which decomposes an image into its light components and changes the weighting of each component.

[0041] The attitude sensor 1113 is a sensor composed of a gravity sensor, an acceleration sensor, or a combination thereof, and can detect the attitude of the artificial intelligence response output device 10010. Based on the attitude detection result of the attitude sensor 1113, the control unit 1110 may control the operation of each connected part.

[0042] The non-volatile memory 1108 stores various data used by the artificial intelligence response output device 10010. The data stored in the non-volatile memory 1108 includes, for example, data for various operations displayed on the display unit 10011 of the artificial intelligence response output device 10010, display icons, data and layout information for objects operated by the user. Memory 1109 stores video data and device control data displayed on the display unit 10011. The control unit 1110 may read various software from the storage unit 1170, expand it into memory 1109, and store it.

[0043] The local LLM processing unit 10028 has memory capable of holding a large-scale language model (LLM) and can perform inference on the LLM based on the control of the control unit 1110. The hardware can be a so-called GPU (Graphics Processing Unit). The local LLM processing unit 10028 may perform training as well as inference. Note that the local LLM processing unit 10028 is not necessarily required if the execution of LLM inference on the LLM in the local environment of the artificial intelligence response output device 10010 is not necessary.

[0044] The control unit 1110 controls the operation of each connected part. The control unit 1110 may also work in cooperation with a program stored in memory 1109 to perform calculation processing based on information acquired from each part within the artificial intelligence response output device 10010. One of the control states by the control unit 1110 is, for example, the output of responses from the large-scale language model of the local LLM processing unit 10028, or responses from the large-scale language model of the large-scale language model server 19001 or the multimodal large-scale language model of the multimodal large-scale language model server 20001, acquired via the communication unit 1132, through the display unit 10011 or the audio output unit 1140, which is a speaker or the like.

[0045] Furthermore, when input is received from the user via the touch panel, microphone 1139, or operation input unit 1107 as described above, the control unit 1110 can perform the control to generate an instruction sentence based on that input and send it to the local large-scale language model of the local LLM processing unit 10028 of the artificial intelligence response output device 10010, the large-scale language model of the large-scale language model server 19001, or the multimodal large-scale language model of the large-scale language model server 20001, and to obtain a response from these large-scale language models.

[0046] Furthermore, the storage unit 1170 may store a standard response database (which may also be referred to as a standard response DB) for outputting standard phrases as responses to instructions from the artificial intelligence response output device 10010. The control unit 1110 can then perform control to generate the response to be output using the data stored in the standard response database. Figure 1C shows an example of a standard response database. In the example in Figure 1C, the standard response to be output by the artificial intelligence response output device 10010 is stored for each condition assigned a condition number. For example, if the user inputs "Good morning" via the touch panel, microphone 1139, or operation input unit 1107, as in condition number 1, the response should be output using "Good morning" or "Today is the XXth of XX." as the standard response. The XX part, such as "XXth of XX," can be generated using information stored in the memory 1109 or other memory of the artificial intelligence response output device 10010.

[0047] Furthermore, in the example of a standard response phrase in the database shown in Figure 1C, if multiple standard response phrases separated by / are stored, the control unit 1110 can be controlled to randomly select one of the standard response phrases using a random number or the like and output a response. This can resolve and improve the situation where responses under the same conditions become monotonous. The explanation for the examples of condition numbers 2, 3, and 4 is the same as for the example of condition number 1. The control unit 1110 should be controlled to output using the standard response phrases shown in Figure 1C for each example of the condition content shown in Figure 1C.

[0048] Next, we will explain an example of condition number 5 shown in Figure 1C. Condition number 5 is an example of control in which, when the control unit 1110 cannot understand the meaning of user input obtained via the touch panel, microphone 1139, or operation input unit 1107 as natural language, or when there is an obvious grammatical error in the user input, the control unit 1110 outputs a response using a standard response phrase such as "I didn't quite catch that" or "I may not know about that." By responding in this way, the system can prompt the user to input again and wait for corrected user input.

[0049] Next, we will explain an example of condition number 6 shown in Figure 1C. Condition number 6 is an example where the control unit 1110 detects an error (abnormal state) in any of the parts that make up the artificial intelligence response output device 10010 shown in Figure 1B, and user input is received via the touch panel, microphone 1139, or operation input unit 1107. In this case, the control unit 1110 controls the system to output a response using the standard response phrase "It seems to be malfunctioning." By responding in this way, the system can explain to the user that the artificial intelligence response output device 10010 is malfunctioning and prompt the user to take action to address the error.

[0050] The artificial intelligence response output device 10010 may output a response using the response template database (response template DB) described with reference to Figure 1C, instead of a response from a large-scale language model such as the local large-scale language model provided by the artificial intelligence response output device 10010, the large-scale language model provided by the large-scale language model server 19001, or the multimodal large-scale language model provided by the large-scale language model server 20001. Alternatively, it may output a response that combines the responses from these large-scale language models with a response using the response template database (response template DB).

[0051] The response template database (response template DB) shown in Figure 1C described above is stored in the storage unit 1170, and the control unit 1110 of the artificial intelligence response output device 10010 can use it. However, the response template database (response template DB) shown in Figure 1C may also be provided on the large-scale language model server 19001 side or the large-scale language model server 20001 side. In this case, the control unit of the large-scale language model server 19001 or the control unit of the large-scale language model server 20001 should generate responses using the response template database (response template DB). The control unit of the large-scale language model server 19001 or the control unit of the large-scale language model server 20001 should send the response generated using the response template database (response template DB) to the artificial intelligence response output device 10010 instead of the response generated by the large-scale language model stored in their respective servers. In this way, even if the artificial intelligence response output device 10010 is not equipped with a response template database (response template DB), it becomes possible to generate responses using a response template database (response template DB).

[0052] In the above description, the artificial intelligence response output device 10010 was described as having a display panel for a display screen using fixed pixels. This concept may also include a projection-type image display device (projector) in which a projection optical system is provided after the display panel for a display screen using fixed pixels, and the optical image of the image on the display panel for the display screen is projected onto a screen or wall.

[0053] In the examples shown in Figures 1A and 1B, an example was described in which the artificial intelligence response output device 10010 includes a display unit 10011. However, the artificial intelligence response output device 10010 according to the embodiment of the present invention does not necessarily have to include a display unit 10011. For example, even without a display unit 10011, the device can be configured to receive user input to the artificial intelligence via an audio signal input unit 1133 or a microphone 1139, and to output a response from the artificial intelligence, such as a large-scale language model, to the user input via an audio output unit 1140.

[0054] According to the artificial intelligence response output device and artificial intelligence response output system of Embodiment 1 of the present invention described above, it is possible to receive input from a user to artificial intelligence such as a large-scale language model and output a response to the user input generated by the inference of the artificial intelligence, such as a large-scale language model on a server device on the network or a local large-scale language model on the artificial intelligence response output device itself.

[0055] <Example 2> Next, as Embodiment 2 of the present invention, we will describe an example in which the artificial intelligence response output device 10010 described in Embodiment 1 is connected to the internet and operates by connecting to a server equipped with a large-scale language model artificial intelligence via the internet. In this embodiment, we will explain the differences from Embodiment 1, and repeating explanations of configurations similar to those in these embodiments will be omitted.

[0056] Using Figure 2A, an example of the connection state between the artificial intelligence response output device 10010 and the large-scale language model server 19001 of Embodiment 2 of the present invention will be explained. The artificial intelligence response output device 10010 according to Embodiment 2 may also be called a character conversation device. Furthermore, the system including the artificial intelligence response output device 10010 and the large-scale language model server 19001 according to Embodiment 2 may also be called a character conversation system. The display unit 10011 displayed by the artificial intelligence response output device 10010 shows an image of character 19051. The image of character 19051 is an image generated by rendering a 3D model of the character in a virtual space.

[0057] Furthermore, the character in this embodiment can provide the user with the services of a large-scale language model, which is an artificial intelligence, and can assist the user. Therefore, the character can become an artificial intelligence (AI) assistant for the user. In this case, the character conversation device or character conversation system in this embodiment may also be called an AI assistant conversation device, an AI assistant display device, an AI assistant response output device, an AI assistant conversation system, an AI assistant display system, or an AI assistant response output system.

[0058] In the example shown in Figure 2A, the voice output unit 1140 of the artificial intelligence response output device 10010 is composed of a speaker. The artificial intelligence response output device 10010 is also equipped with a microphone 1139 that can pick up the user's voice. The artificial intelligence response output device 10010 can communicate with a communication device 19011 connected to the internet 19000 via a communication unit 1132. In the example shown in Figure 2A, the communication between the communication unit 1132 and the communication device 19011 is shown as wireless, but wired communication is also acceptable. The communication path from the communication unit 1132 to the internet 19000 may have both wired and wireless sections. The artificial intelligence response output device 10010 can communicate with a large-scale language model server 19001 via the communication device 19011 and the internet 19000. Furthermore, the artificial intelligence response output device 10010 can communicate with a second server 19002, different from the large-scale language model server 19001, via the communication device 19011 and the internet 19000. The configuration including the artificial intelligence response output device 10010 and the large-scale language model server 19001 may be considered as a single system.

[0059] Next, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described using Figure 2B. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Note that in Figure 2B, the communication paths such as the internet 19000 shown in Figure 2A have been omitted. In Figure 2B, the user 230 of the artificial intelligence response output device 10010 is also shown.

[0060] Here, we will explain the sequence of operations of the artificial intelligence response output device 10010. The artificial intelligence response output device 10010 loads the character operation program stored in the storage unit 1170 or the like into the memory 1109, and the control unit 1110 executes the character operation program, thereby enabling the various processes described below.

[0061] First, the artificial intelligence response output device 10010 is equipped with a microphone 1139. When user 230 speaks to character 19051, the microphone 1139 picks up the user's voice (words from the user) and converts it into an audio signal. The character operation program executed by the control unit 1110 then extracts the text of the words spoken by user 230 from the audio signal. This text is in natural language. The extraction of the text of the words spoken by user 230 may be performed continuously for all words, or it may be started when the user speaks within a predetermined period following a trigger keyword. For example, the trigger keyword could be when the user says "Hello" followed by the character's name. For example, if character 19051's name is "Koto," then "Hello, Koto!" can be used as the trigger keyword.

[0062] The character operation program of the artificial intelligence response output device 10010 creates an instruction (prompt) based on the text of the words spoken by the user 230, and sends the instruction to the large-scale language model server 19001 using an API. Here, the instruction may be metadata containing information written using notation such as markup format of a markup language using tags, notation using predetermined symbols such as Markdown format, or object notation of a predetermined script such as JSON. The instruction contains natural language text information as the main message. The types of instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001 include setting instruction statements that store instructions such as initial settings, and user instruction statements that reflect instructions from the user. Type identification information that identifies whether the instruction statement is a setting instruction statement or a user instruction statement may be stored in a part of the instruction statement other than the main message. When the character motion program of the artificial intelligence response output device 10010 creates an instruction sentence (prompt) based on the text of the words spoken by the user 230, it creates a user instruction sentence and sends it to the large-scale language model server 19001.

[0063] Next, the large-scale language model of the artificial intelligence in the large-scale language model server 19001 performs inference based on the instruction sent from the artificial intelligence response output device 10010, and generates a response containing natural language text information based on the result. The large-scale language model server 19001 sends the response to the artificial intelligence response output device 10010 using an API. The response contains natural language text information as the main message. Here, the response may also be metadata containing information written in the same format as the instruction mentioned above (notation using tags such as the markup format of a markup language, notation using predetermined symbols such as the Markdown format, or object notation of a predetermined script such as JSON). If the same format as the instruction mentioned above is used in the response, type identification information may be stored in a part other than the main message to indicate that it is a different type of information from the initial setting instruction and the user instruction mentioned above. For example, information indicating that it is a response from the large-scale language model may be stored.

[0064] Next, the artificial intelligence response output device 10010 receives a response from the large-scale language model server 19001 and extracts the natural language text information stored as the main message in that response. Based on the natural language text information extracted from the aforementioned response, the character operation program of the artificial intelligence response output device 10010 uses speech synthesis technology to generate natural language speech that serves as a response to the user, and outputs it from the speaker, the voice output unit 1140, so that it sounds as if it were the voice of character 19051. This process may also be described as the character "uttering".

[0065] As described above, the processing by the artificial intelligence response output device 10010 and the large-scale language model server 19001 allows for specific examples of the voice responses of character 19051 to words from user 230, as shown in conversation examples 1-5 in Figure 2C. In this way, user 230 can converse with character 19051 as if it were a real person.

[0066] As described above, with the artificial intelligence response output device 10010 or the system including the artificial intelligence response output device 10010 shown in Figure 2B, it is not necessary to install the large-scale language model itself, which requires a vast amount of data and computing resources for training, into the artificial intelligence response output device 10010 itself. Furthermore, the advanced natural language processing capabilities of the large-scale language model can be utilized via an API, enabling the character to provide more appropriate responses and engage in more suitable conversations when the user speaks to it.

[0067] Next, using Figure 2D, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, Figure 2D is an example of the natural language text of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the natural language text of the main message of the server response that is the response.

[0068] Furthermore, Figure 2D shows the exchange of instructions and responses in chronological order, from the display setting instruction to the first round of user instructions and their responses, up to the fourth round of user instructions and their responses.

[0069] As shown in Figure 2D, the large-scale language model server 19001 can be instructed by the configuration instructions to provide initial settings to the large-scale language model of the artificial intelligence, such as the name of the large-scale language model itself, the role it should play, and the characteristics of the conversation. The user's name can also be made to understand the initial settings. As a result, the large-scale language model generates responses from the first round onward while adhering to the assigned role. When a user hears the voice of character 19051 based on these responses from the first round onward, it will feel as if character 19051 is embodying the settings and personality of the person described in the configuration instructions. Furthermore, the large-scale language model server 19001 in this embodiment is equipped with memory that stores the content of the conversation until the series of conversations is completed, and is configured to generate responses after storing a series of user instructions and their responses. This makes it possible to realize a conversation like the one shown in Figure 2D.

[0070] Next, using Figure 2E, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, Figure 2E is an example of the natural language text of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the natural language text of the main message of the server response that is the response.

[0071] Figure 2E shows an example of a new conversation that takes place when user 230 speaks to character 19051 again after the series of conversations shown in Figure 2D has ended. In Figure 2E, the exchange of instructions and responses is shown chronologically, from the first round of user instructions and their responses to the third round of user instructions and their responses.

[0072] Here, "termination" of "continuation of a series of conversations" refers to the process by which the large-scale language model server 19001 erases the conversation memory it held while the series of conversations was continuing, when predetermined conditions are met. An example of predetermined conditions is when the artificial intelligence response output device 10010 instructs the large-scale language model server 19001 to "terminate" the "continuation of a series of conversations" using an instruction statement. Another example of predetermined conditions is when a predetermined amount of time has passed since the artificial intelligence response output device 10010 stopped sending instruction statements to the large-scale language model server 19001 regarding the series of conversations (timeout). Furthermore, in the connection between the artificial intelligence response output device 10010 and the large-scale language model server 19001, authentication may be lost due to factors such as communication interruption or the power of the artificial intelligence response output device 10010 being turned off while the instruction statements and responses are being exchanged after the authentication process has been performed.

[0073] Furthermore, once the "continuation of a series of conversations" ends, the large-scale language model server 19001 erases the memories of the conversations it held while the series of conversations was ongoing. Therefore, even though the conversation shown in Figure 2E takes place after the series of conversations shown in Figure 2D, the server response to the user instruction is one in which the server has no memory whatsoever of the character's name, the role to be played, the characteristics of the conversation, the user's name, etc., that were included in the setting instruction shown in Figure 2D. Similarly, the conversation shown in Figure 2E is a response in which there is no memory whatsoever of the series of conversations shown in Figure 2D. In other words, the "end" of the "continuation of a series of conversations" shown in Figure 2D means that the conversation in Figure 2E starts from a state in which the large-scale language model of the artificial intelligence of the large-scale language model server 19001 has been initialized.

[0074] This can make user 230 feel as if character 19051 has lost their memories of them, or as if they are a completely different person. From user 230's perspective, the character's response feels very unnatural, resulting in a lonely and disappointing experience. Such behavior presents a challenge in that it is impossible to ensure the identity of settings and memories such as the name, role, conversational characteristics, and personality of character 19051 displayed on the artificial intelligence response output device 10010.

[0075] Next, using Figure 2F, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, Figure 2F is an example of the natural language text of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the natural language text of the main message of the server response that is the response.

[0076] Figure 2F shows an example of a case where, after the continuation of the series of conversations shown in Figure 2D has ended, user 230 speaks to character 19051 again to start a new conversation. Unlike the process in Figure 2E, in the process in Figure 2F, when a new conversation is started, the artificial intelligence response output device 10010 sends a setting instruction as the first instruction to the large-scale language model server 19001. This setting instruction contains the same natural language text as the initial setting instruction in Figure 2D. This may also be referred to as the reconfiguration text. This setting instruction then contains natural language text that explains the history of past conversations. This may also be referred to as the conversation history text. The history of past conversations can be recorded by the artificial intelligence response output device 10010 as natural language text information in the storage unit 1170, linked to the date and time information of the conversation, while the continuation of the series of conversations described in Figure 2D is taking place. If there are conversations on different dates, each conversation can be recorded linked to the date and time information, and the conversation history can be accumulated. When generating the initial instruction for a later conversation, as shown in Figure 2F, the natural language text information of the conversation and the date and time information of the conversation recorded in the storage unit 1170 can be read and used to generate the instruction.

[0077] When using natural language text information from past conversation history to generate the setting instruction statement, the format can be determined somewhat freely, as this data is sent to a large-scale language model. However, as shown in Figure 2F, it is advisable to prepare natural language prefixes and suffixes such as "I said the following on [date]," and "You said the following on [date]," and then combine them with the recorded conversation's natural language text information to generate the text of the setting instruction statement. Additionally, the date and time information of the conversation read from the storage unit 1170 may be combined with the "[date]" portion and used as part of the text of the setting instruction statement.

[0078] Even if user 230 speaks to character 19051 again to initiate a new conversation after a series of conversations has ended, performing the generation and transmission process of the setting instruction text shown in Figure 2F as described above will ensure that subsequent user instruction texts reflect the character's role, name, conversational characteristics, personality, and / or conversational characteristics settings and conversation history from the previous conversation. This is preferable because it allows the user to perceive a greater degree of consistency in the character's role, name, conversational characteristics, or personality settings and memories from the previous conversation.

[0079] Next, using Figure 2G, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, Figure 2G is an example of the natural language text of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the natural language text of the main message of the server response that is the response.

[0080] Figure 2G shows an example of a series of conversations in the same conversation as shown in Figure 2F, specifically the first round of user instructions and their responses, followed by the third round of user instructions and their responses. In Figure 2G, the exchange of instructions and responses is shown chronologically. The content of the setting instructions is the same as shown in Figure 2F, so repeated descriptions are omitted.

[0081] As shown in the natural language text of the server response in the table in Figure 2F, by using the setting instructions shown in Figure 2F, the server response by the large-scale language model artificial intelligence of the large-scale language model server 19001 will reflect the settings and conversation history of the character, such as the character's role, name, conversational characteristics, or personality, at the time of the previous conversation. This is preferable because, from the user's perspective, it is perceived that the identity of the character's settings and memories, such as the character's role, name, conversational characteristics, or personality, at the time of the previous conversation is better maintained. This can also be called pseudo-identity of the character from the user's perspective, as the character can be perceived as identical.

[0082] Furthermore, from the user's perspective, they can share memories with the character, resulting in a more enjoyable character conversation experience.

[0083] Next, using Figure 2H, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, Figure 2H shows an example of operation in which the character to be displayed on the display unit 10011 of the artificial intelligence response output device 10010 is switched from among multiple character candidates. The character operation program executed by the control unit 1110 of the artificial intelligence response output device 10010 can switch the displayed character based, for example, on operation inputs input to the operation input unit 1107 or operations detected by the touch operation input sensor of the display unit 10011.

[0084] In the example in Figure 2H, in addition to character 19051 (named "Koto") used in the explanations of Figures 2A to 2G, characters 19052 (named "Tom") and 19053 (named "Necco") are shown. Characters 19051 (named "Koto") and 19052 (named "Tom") are human characters, while character 19053 (named "Necco") is a cat character. Switching the display of characters on the display unit 10011 can be done by rendering different virtual 3D space characters for each character and switching the displayed image on the display unit 10011. The processing to realize the display of the rendered image of the 3D model of each character can be done by performing one of the first to third processing examples explained in Figure 15A, for example. In addition, for some characters, a dynamic 2D image may be displayed.

[0085] Furthermore, when the character operation program executed by the control unit 1110 switches the display of the characters shown on the display unit 10011, it is preferable that the synthesized voice used for each character's "utterance" is also changed. This can be done by pre-storing synthesized voice data with corresponding voice to each character in the storage unit 1170, and then performing the synthesized voice change process when switching the display of the characters.

[0086] In the example shown in Figure 2H, the system is configured so that user 230 can converse with any of the characters. The artificial intelligence response output device 10010 in Figure 2H assigns different roles, names, conversational characteristics, or personalities to each of these characters. Furthermore, the memories of each character based on their conversation history are managed separately for each character.

[0087] Therefore, the artificial intelligence response output device 10010 constructs the database shown in Figure 2I in the storage unit 1170, and uses this database to manage character settings and the character's conversation history.

[0088] Next, using Figure 2I, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, Figure 2I is an explanatory diagram of the database 19200 for managing character settings and character conversation history for multiple characters displayed on the display unit 10011 of the artificial intelligence response output device 10010.

[0089] The character operation program executed by the control unit 1110 of the artificial intelligence response output device 10010 constructs the database 19200 in the storage unit 1170, for example. The character ID is an identification number that identifies each of the multiple characters that can be displayed by the artificial intelligence response output device 10010, and may be a natural number or use the alphabet, etc. The name is data of the name of each of the multiple characters that can be displayed by the artificial intelligence response output device 10010.

[0090] The initial setup instruction is natural language text information that describes the settings of each of the multiple characters that can be displayed by the artificial intelligence response output device 10010, such as their role, name, conversational characteristics, or personality. Since this initial setup instruction is the main data of the setting instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, it is desirable that the content be such that the large-scale language model of the artificial intelligence in the large-scale language model server 19001 can read it directly.

[0091] The conversation history, numbered 1, 2, ..., is a record of the conversations between each character and the user, and is recorded separately for each character. Since this conversation history will be included in the natural language text information, which is the main data of the setting instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, it is desirable that the content be such that the large-scale language model of the artificial intelligence in the large-scale language model server 19001 can read it directly.

[0092] The character operation program executed by the control unit 1110 of the artificial intelligence response output device 10010, when the character displayed on the display unit 10011 of the artificial intelligence response output device 10010 is switched, uses the database 19200 in Figure 2I to select and switch the initial setting instruction text and conversation history used for the natural language text information, which is the main data of the setting instruction text sent from the artificial intelligence response output device 10010 to the large-scale language model server 19001, so that they correspond to the character displayed on the display unit 10011 of the artificial intelligence response output device 10010. In addition, each time a conversation takes place between the user 230 and the character, the character operation program records the history of that conversation in the conversation history area of the database 19200 in Figure 2I that corresponds to the character displayed on the display unit 10011.

[0093] By using the database 19200, the character operation program executed by the control unit 1110 of the artificial intelligence response output device 10010 uses the same large-scale language model of the same artificial intelligence from the same large-scale language model server 19001 to establish a conversation between the user 230 and the character. However, from the user's perspective, the uniqueness of each character's personality and other settings is preserved, and the memory of different conversations continues for each character. From the user's perspective, this is preferable because it is perceived that the identity of the character's role, name, conversational characteristics, or personality settings and memories from previous conversations is more readily maintained for each character. This can also be expressed as ensuring a pseudo-identity of each character from the user's perspective.

[0094] Therefore, even when the artificial intelligence response output device 10010 is configured to switch between displaying characters from among multiple character candidates on the display unit 10011, the operation using the database 19200 described above will result in a less jarring experience for the user in conversations with each character, and will allow them to share memories with each of the multiple characters, providing a more enjoyable character conversation experience.

[0095] Furthermore, if the user is prevented from editing the initial setting instructions for multiple characters, the settings for each character, such as their role, name, conversational characteristics, or personality, can be maintained in a state close to the intentions of the provider of the artificial intelligence response output device 10010 or the creator of the character's content. Alternatively, the user may be allowed to edit the initial setting instructions for characters in response to input from the operation input unit 1107 or the like. In this case, the user can customize the character's role, name, conversational characteristics, or personality, and converse with a character they have set up themselves. In this case, the character's 3D model, its rendered image, and the type of synthesized voice may also be replaced accordingly.

[0096] Next, using Figure 2J, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 19001. Specifically, a method for providing a character conversation service using a character conversation device with the artificial intelligence response output device 10010, or a character conversation system with the artificial intelligence response output device 10010 and the large-scale language model server 19001, at a lower cost will be described.

[0097] As explained in Figure 2B, training a large-scale language model to achieve this level of artificial intelligence for a specific application is extremely resource-inefficient. Therefore, it is more resource-efficient to generate a foundation model that can be applied to various uses by performing large-scale training, and then making it available on various devices via an API (Application Programming Interface). In this case, the provider of the large-scale language model often recovers the cost used to train the model from the user of the device as a usage fee for the device's API. In natural language models, the API usage fee is often charged based on the number of tokens (word units that divide sentences) that are processed.

[0098] Therefore, in the artificial intelligence response output device 10010 of Embodiment 2 of the present invention, by reducing the number of tokens in the natural language text information transmitted between the artificial intelligence response output device 10010 and the large-scale language model server 19001 using an API, it is possible to provide users with a character conversation device using the artificial intelligence response output device 10010, and a character conversation service using a character conversation system with the artificial intelligence response output device 10010 and the large-scale language model server 19001 at a lower cost.

[0099] For example, by using the processing and configurations shown in Examples 1 to 3 in the table in Figure 2J, it is technically possible to reduce the number of tokens in the natural language text information transmitted between the artificial intelligence response output device 10010 and the large-scale language model server 19001 using an API.

[0100] Example 1 is an example of a method to reduce the number of tokens in the conversation history text stored and transmitted in the API configuration instructions, specifically by using document summarization processing to shorten the conversation history text and reduce the number of tokens. For example, the natural language of the conversation history with the character recorded in the storage unit 1170 is summarized and recorded. While document summarization can be performed at the start of the next conversation, it is more time-efficient to perform it at the end of a "series of conversations".

[0101] Alternatively, the text summarization process may be requested from the large-scale language model server 19001 itself. However, in this case, the token saving effect is low. Therefore, for example, if the second server 19002 provides natural language text summarization processing via an API at a lower cost than the large-scale language model server 19001, the text summarization processing can be requested from the second server 19002 via the API, and the text summary of the conversation history can be stored in a configuration instruction message to the large-scale language model server 19001 and transmitted.

[0102] Furthermore, if only text summarization processing is required, it can also be done on the terminal side. The control unit 1110 may execute a document summarization program, which is loaded into the memory 1109 of the artificial intelligence response output device 10010, to perform text summarization. In this case, the token saving effect is high. Also, even if the conversation history becomes long, by specifying an upper limit on the number of characters after summarization in the text summarization processing, the upper limit on the length of the conversation history text is determined, so an upper limit on tokens can be set, and token saving is possible.

[0103] Furthermore, since the amount of text information for initial character settings, such as character roles, names, conversational characteristics, or personalities, does not increase as much as the amount of conversation history, it is efficient and preferable to maintain the text information in the initial character setting instructions while reducing the number of text tokens in the conversation history.

[0104] The process described in Example 1 can be carried out by a character motion program executed by the control unit 1110, which controls each part.

[0105] Example 2 is another example of reducing the number of tokens in the conversation history text stored and transmitted in the API configuration instructions. For example, the number of tokens can be reduced by deleting older conversation histories from the conversation history with the character recorded in the storage unit 1170. Specifying an upper limit on the number of characters in the conversation history determines the upper limit on the length of the conversation history text, thus setting an upper limit on the number of tokens and enabling token saving. Alternatively, a predetermined period for the conversation history can be specified, and conversation histories exceeding that period can be deleted. In this case as well, token saving is possible. In Example 2 as well, since the text information of the character's initial settings, such as the character's role, name, conversation characteristics, or personality, does not increase as much as the conversation history, it is efficient and preferable to maintain the text information in the character's initial settings instructions and reduce the number of tokens in the conversation history text information.

[0106] The process described in Example 2 can be carried out by the character motion program executed by the control unit 1110, which controls each part.

[0107] Example 3 is a method to reduce the number of tokens by reducing the frequency of sending configuration instructions using the API. Specifically, after the device power is turned on or after the display character is switched, and even after the video settings and synthesized voice settings for the displayed character are completed, the configuration instructions are not sent in advance. Instead, the configuration instructions are sent to the large-scale language model server 19001 only when the control unit 1110 determines that the natural language text information contained in the user's speech picked up by the microphone 1139 is text information that should be used with a large-scale language model of artificial intelligence. This reduces the frequency of sending configuration instructions to the large-scale language model server 19001 and thus reduces the number of tokens.

[0108] Specifically, for example, after the device is powered on (ON) or after an operation input to switch the displayed character, the display unit 10011 is displayed on the display unit 10011 as shown in Figure 2H, due to the display processing of the display unit 10011 under the control of a character operation program executed by the control unit 1110. At this time, for example, if a synthesized voice for the character 19051's appearance is stored in the storage unit 1170 or the like, the synthesized voice for the character's appearance, such as "Good morning. I'm Koto," "Good afternoon. I'm Koto," or "Good evening. I'm Koto," may be output from the speaker, which is the audio output unit 1140. At this time, the image of character 19051 is already set as the image of the character displayed on the display unit 10011, and the synthesized voice output from the speaker, which is the audio output unit 1140, is set to the synthesized voice corresponding to character 19051.

[0109] Here, as already explained, the inference processing of the large-scale language model of the artificial intelligence in the large-scale language model server 19001 also takes time if the instruction sentence is long. In particular, if the setting instruction sentence includes text information about past conversation history, the number of tokens in the instruction sentence increases, and the inference processing time becomes especially long. The setting instruction sentence itself and its response are not output to the user 230. From the user's response to the setting instruction sentence, the synthesized voice as the character's "utterance" is output from the speaker, which is the voice output unit 1140. Then, it would seem preferable to send the setting instruction sentence in advance from the artificial intelligence response output device 10010 to the large-scale language model server 19001 and complete the inference processing of the large-scale language model for the setting instruction sentence in advance, because this would speed up the output of the synthesized voice of the character 19051's "utterance" after the user 230 speaks to the character 19051.

[0110] However, if the large-scale language model server 19001 is pre-processed by sending a setting instruction to the setting instruction before user 230 speaks, and the inference processing of the large-scale language model for the setting instruction is completed in advance, for example, if user 230 turns off the power of the artificial intelligence response output device 10010 by operating the operation input unit 1107 or the touch operation input sensor of the display unit 10011, or if user 230 switches the displayed character from character 19051 to another character by operating the operation input unit 1107 or the touch operation input sensor of the display unit 10011, then the number of tokens processed by the large-scale language model server 19001 after the setting instruction was pre-processed will be the number of processing tokens for which usage fees have been wasted. This hinders the provision of character conversation devices using the artificial intelligence response output device 10010, and character conversation services using the character conversation system using the artificial intelligence response output device 10010 and the large-scale language model server 19001, to users at a lower cost.

[0111] Therefore, it is desirable that the artificial intelligence response output device 10010, after the device is powered on (ON) or after an operation input to switch the displayed character, controls the character operation program executed by the control unit 1110 to set the image of character 19051 as the image of the character displayed on the display unit 10011, and sets the synthesized voice output from the speaker, which is the sound output unit 1140, to the synthesized voice corresponding to character 19051, but continues not to send setting instruction text to the large-scale language model server 19001 until the point in time when it recognizes that the user 230 is going to speak to character 19051.

[0112] Here, the point at which it is recognized that user 230 is speaking to character 19051 may be, for example, the point at which the trigger keyword described in Figure 2B is detected, or the point at which the text of the words spoken by user 230 is extracted. In this way, the number of processing tokens that unnecessarily waste usage fees can be reduced, and the character conversation device using the artificial intelligence response output device 10010, or the character conversation system using the artificial intelligence response output device 10010 and the large-scale language model server 19001, can be provided to users at a lower cost.

[0113] Furthermore, even after the point in time when the system recognizes that user 230 is speaking to character 19051, it is desirable to continue not sending setting instructions to the large-scale language model server 19001, for example, if the text information extracted from user 230's voice picked up by microphone 1139 corresponds to preset keywords that do not require inference processing by a large-scale language model. Specifically, examples of preset keywords include those used by user 230 to request character 19051 to react, such as by performing an animation or emitting synthesized speech, such as "Try jumping" or "Try dancing." In this case, the character motion program executed by the control unit 1110 can read the motion data, animation video, and / or synthesized speech data corresponding to the reaction stored in the storage unit, and use this data to generate the video to be displayed on the display unit 10011 and to output synthesized speech from the speaker, which is the audio output unit 1140.

[0114] Such processing does not necessarily require the inference processing of the large-scale language model on the large-scale language model server 19001. If, after such processing, the user 230 turns off the power of the artificial intelligence response output device 10010 by operating via the touch input sensor of the operation input unit 1107 or the display unit 10011, or if, for example, the user 230 switches the displayed character from character 19051 to another character by operating via the touch input sensor of the operation input unit 1107 or the display unit 10011, if the setting instruction statement is sent to the large-scale language model server 19001 first and processed by the inference processing of the large-scale language model, the number of tokens for that processing will be a waste of usage fees.

[0115] Therefore, even after the point in time when the system recognizes that user 230 is speaking to character 19051, it is desirable to continue not sending the configuration instruction to the large-scale language model server 19001 until, for example, it is determined whether the text information extracted from user 230's voice captured by microphone 1139 corresponds to text information of preset keywords that do not require inference processing of the large-scale language model. Only when this determination determines that inference processing of the large-scale language model is necessary should the configuration instruction be sent to the large-scale language model server 19001 and the inference processing of the large-scale language model be advanced.

[0116] Furthermore, the process described in Example 3 can be carried out by the character motion program executed by the control unit 1110, which controls each part.

[0117] As described above, the methods for reducing (saving) the number of processing tokens for large-scale language models, as shown in the examples in Figure 2J, allow us to provide users with character conversation services using the artificial intelligence response output device 10010 and the character conversation system using the artificial intelligence response output device 10010 and the large-scale language model server 19001 at a lower cost.

[0118] Next, an example of the display of the character conversation device (artificial intelligence response output device 10010) of Embodiment 2 of the present invention will be described using Figure 2K. The example in Figure 2K shows an example of displaying the response from the large-scale language model to the user instruction sentences described in Figures 2A to 2J on the display unit 10011 of the character conversation device (artificial intelligence response output device 10010). Specifically, it is an example of displaying the text 10063, which is the response from the large-scale language model, together with the image of character 19051 on the display unit 10011. The text 10063, which is the response from the large-scale language model, may be displayed superimposed in front of the image of character 19051, as shown in Figure 2K. Alternatively, the text 10063, which is the response from the large-scale language model, may be displayed together with the image of character 19051 without being superimposed on it.

[0119] The display in Figure 2K is just one example, but for instance, if user 230 adjusts the volume of the voice output unit 1140 of the character conversation device (artificial intelligence response output device 10010) to the minimum or sets the voice output to OFF by operating via the touch input sensors of the operation input unit 1107 or the display unit 10011, user 230 will not be able to confirm the response from the large-scale language model by voice.

[0120] Therefore, in this case, the control unit 1110 may be controlled to start a display mode in which the text 10063, which is the response from the large-scale language model, is displayed together with the image of the character 19051, as shown in Figure 2K. In this way, even when it is desired to reduce voice output, the user 230 can more conveniently use the character conversation device (artificial intelligence response output device 10010). Furthermore, the user 230 may be configured to manually switch ON / OFF the display mode in which the text 10063, which is the response from the large-scale language model, is displayed together with the image of the character 19051, by operating the operation input unit 1107 or the touch operation input sensor of the display unit 10011.

[0121] Next, using Figure 2L, we will explain an example of a response template database (response template DB) in a character conversation device (artificial intelligence response output device 10010) capable of displaying multiple characters, as described in Figures 2H and 2I. In the example in Figure 2L, the condition numbers and condition contents are the same as in Figure 1C. For these conditions, in the example in Figure 2L, individual response templates are set for each of the multiple characters. For example, for each of the three characters described in Figures 2H and 2I—character 1: Koto, character 2: Tom, and character 3: Necco—a response template for each condition is stored. The output control of the response templates is the same as in Figure 1C, so a repeated explanation will be omitted.

[0122] In the example shown in Figure 2L, the control unit 1110 selects a corresponding predefined response from the predefined response database (predefined response DB) based on the character displayed in the character conversation device (artificial intelligence response output device 10010) and the current conditions, and uses it to control the output as a response issued by the character. For example, in the example of the predefined response database (predefined response DB) in Figure 2L, even under the same conditions, the predefined response is changed to an expression or content that corresponds to the personality of each character. As a result, the character conversation device (artificial intelligence response output device 10010) can provide the user with conversations that correspond to the personality of the displayed character. The user can feel that each character is a being with a more consistent personality. This makes it possible to realize a character conversation device (artificial intelligence response output device 10010) that gives multiple characters a greater sense of reality.

[0123] The response template database (response template DB) shown in Figure 2L, as described above, is stored in the storage unit 1170, and the control unit 1110 of the artificial intelligence response output device 10010 can use it. However, the response template database (response template DB) shown in Figure 2L may also be provided on the large-scale language model server 19001 side. In this case, the control unit of the large-scale language model server 19001 should generate responses using the said response template database (response template DB). The control unit of the large-scale language model server 19001 should send the response generated using the said response template database (response template DB) to the artificial intelligence response output device 10010 instead of the response generated by the large-scale language models stored in each server. In this way, even if the artificial intelligence response output device 10010 is not equipped with a response template database (response template DB), it is possible to generate responses using the response template database (response template DB).

[0124] As described above, the character conversation device and character conversation system according to Embodiment 2 can reduce the sense of incongruity that users feel when conversing with the character displayed on the artificial intelligence response output device 10010. Furthermore, the character conversation device and character conversation system according to Embodiment 2 can provide character conversation services to users at a lower cost.

[0125] In the above description of Example 2, an example was described in which the large-scale language model possessed by the large-scale language model server 19001 is used as the large-scale language model. In contrast, the character conversation device (artificial intelligence response output device 10010) may be configured to include the local LLM processing unit 10028 shown in Figure 1B, and the large-scale language model possessed by the local LLM processing unit 10028 may be used instead of the large-scale language model possessed by the large-scale language model server 19001. In this case, in the above description of Example 2, the large-scale language model possessed by the large-scale language model server 19001 should be read as the large-scale language model possessed by the local LLM processing unit 10028 of the character conversation device (artificial intelligence response output device 10010).

[0126] In this case as well, the sense of incongruity the user feels from conversations with the character displayed on the artificial intelligence response output device 10010 can be reduced. Furthermore, if the large-scale language model of the local LLM processing unit 10028 is used instead of the large-scale language model of the large-scale language model server 19001, the need to consider usage fees based on the number of processing tokens will decrease. However, even with the large-scale language model of the local LLM processing unit 10028, reducing the number of processing tokens can reduce the consumption of resources such as power required for inference. In this case, a character conversation service with lower power consumption can be provided to the user.

[0127] In the above description of Example 2, an example was described in which the conversation history with the character is recorded and stored in the storage unit 1170 of the character conversation device (artificial intelligence response output device 10010). Alternatively, the conversation history with the character may be recorded and stored in a second server 19002 connected to the Internet 19000 or other cloud server. In this case, when a new conversation is started between the user and the character, the character conversation device (artificial intelligence response output device 10010) communicates with the second server 19002 or other cloud server, obtains (downloads) the past conversation history between the character and the user, stores it in the storage unit 1170 or memory 1109 of the character conversation device (artificial intelligence response output device 10010), and uses it to create instruction sentences for the large-scale language model. The specific method of using past conversation history to create instruction sentences for the large-scale language model is as described in the figures of Example 2, so a repeated explanation will be omitted.

[0128] Furthermore, the character conversation device (artificial intelligence response output device 10010) only needs to transmit (upload) the conversation history of the character up to that point to the second server 19002 or other cloud server at predetermined times, such as each time a conversation takes place between the user and the character, or when the conversation between the user and the character ends. In other words, the character conversation device (artificial intelligence response output device 10010) uploads the conversation history with the character to the second server 19002 or other cloud server at predetermined timings, and when the user starts a conversation with the character, the character conversation device (artificial intelligence response output device 10010) downloads the latest conversation history from the second server 19002 or other cloud server and uses it to generate instruction sentences for the large-scale language model. In this way, the character conversation device (artificial intelligence response output device 10010) used by the user the previous day and the character conversation device (artificial intelligence response output device 10010) that the user will use now are different devices capable of displaying the same character. When the user and the same character between these different devices converse multiple times at different times, it is possible to realize a conversation that appears as if the character's memory has been pseudo-carried over from the previous conversation, which is more preferable for the user.

[0129] The process described above, in which the character conversation device (artificial intelligence response output device 10010) uploads and downloads the conversation history with a character to a second server 19002 or other cloud server to simulate the transfer of the character's memory, is also effective when dealing with a database 19200 containing the conversation histories of multiple characters, as described in Figures 2H and 2I. In other words, by configuring the database 19200 described in Figure 2I to be uploaded and downloaded to a second server 19002 or other cloud server, it is possible to achieve a conversation that simulates the transfer of each character's memory from the previous conversation, not only for one character but for multiple characters, when the user converses with each character multiple times at different times between different devices, between different characters, which is more preferable for the user.

[0130] <Example 3> Next, Embodiment 3 of the present invention is an improvement on the character conversation device (artificial intelligence response output device 10010) and character conversation system described in the figures of Embodiment 2. In this embodiment, the differences from Embodiment 2 will be explained, and repeating explanations of configurations similar to those in these embodiments will be omitted.

[0131] Similar to Example 2, the character in Example 3 can provide the user with the services of a large-scale language model, which is an artificial intelligence, and can assist the user. Therefore, the character can become an artificial intelligence (AI) assistant for the user. In this case, the character conversation device or character conversation system in this embodiment may also be called an AI assistant conversation device, an AI assistant display device, an AI assistant response output device, an AI assistant conversation system, an AI assistant display system, or an AI assistant response output system.

[0132] An example of a character conversation device and character conversation system according to Embodiment 3 of the present invention will be described using Figure 3A. In the character conversation system of Embodiment 3, a large-scale language model server 20001 is provided instead of the large-scale language model server 19001 in Figure 2A, and it is connected to the Internet 19000.

[0133] Here, the large-scale language model server 20001 is a server equipped with a large-scale language model artificial intelligence, but it is a multimodal large-scale language model artificial intelligence that can process not only natural language text information, which could be processed by the large-scale language model server 19001, but also other types of information besides natural language text information.

[0134] Furthermore, the artificial intelligence response output device 10010, which is a character conversation device, will be described as having the same configuration as the character conversation device (artificial intelligence response output device 10010) of Embodiment 2, as an example.

[0135] In Embodiment 3, the artificial intelligence response output device 10010, which is a character conversation device, can communicate with the large-scale language model of the large-scale language model server 20001 via the internet 19000 using an API.

[0136] The character conversation system in Example 3 includes a mobile information processing terminal 20010 used by user 230. The mobile information processing terminal 20010 is a so-called smartphone or tablet information processing terminal.

[0137] Here, an example of a mobile information processing terminal 20010 will be described using Figure 3B. The mobile information processing terminal 20010 includes a display panel 20011 which is a touch operation input panel, a control unit 20012, an external power input interface 20013, a power supply 20014, a secondary battery 20015, a storage unit 20016, a video control unit 20017, a posture sensor 20018, a communication unit 20020, an audio output unit 20021, a microphone 20022, a video signal input unit 20023, an audio signal input unit 20024, an imaging unit 20025, and the like.

[0138] The display panel 20011 is equipped with a touch input sensor and can accept touch input from the user 230's finger. The display panel 20011 displays images using a liquid crystal panel or an organic EL panel and can display images. The display panel 20011 may also be called a display unit.

[0139] The communication unit 20020 can be configured with a Wi-Fi communication interface, a Bluetooth communication interface, or a mobile communication interface such as 4G or 5G. Using these communication methods, the communication unit 20020 of the mobile information processing terminal 20010 can communicate with the communication unit 1132 of the character conversation device (artificial intelligence response output device 10010). The mobile information processing terminal 20010 is equipped with a control unit such as a CPU and memory, and the control unit controls the display panel 20011 and the communication unit 20020. Furthermore, the communication unit 20020 can communicate with a communication device 19011 connected to the Internet 19000 using one of the communication methods of the communication unit 20020. As a result, the mobile information processing terminal 20010 can communicate with various servers connected to the Internet 19000.

[0140] Power supply 20014 converts AC current input from an external source via the external power input interface 20013 into DC current and supplies the necessary DC current to each part of the mobile information processing terminal 20010. The secondary battery 20015 stores the power supplied by power supply 20014. In addition, the secondary battery 20015 supplies power to each part that requires power via the external power input interface 20013 when external power is not supplied.

[0141] The video signal input section 20023 receives video data by connecting an external video output device. Various digital video input interfaces are possible for the video signal input section 20023. For example, it can be configured with an HDMI (High-Definition Multimedia Interface) standard video input interface, a DVI (Digital Visual Interface) standard video input interface, or a DisplayPort standard video input interface. Alternatively, analog video input interfaces such as analog RGB or composite video may be provided. The video signal input section 20023 may also use various USB interfaces.

[0142] The audio signal input unit 20024 receives audio data by connecting an external audio output device. The audio signal input unit 20024 may be configured as an HDMI audio input interface, an optical digital terminal interface, or a coaxial digital terminal interface, etc. The audio signal input unit 20024 may also be various USB interfaces, etc. In the case of an HDMI interface, the video signal input unit 20023 and the audio signal input unit 20024 may be configured as an interface with integrated terminals and cables.

[0143] The audio output unit 20021 is capable of outputting audio based on audio data input to the audio signal input unit 20024. The audio output unit 20021 is also capable of outputting audio based on audio data stored in the storage unit 20016. The audio output unit 20021 may be configured as a speaker. In addition, the audio output unit 20021 may output built-in operation sounds or error warning sounds. Alternatively, the audio output unit 20021 may be configured to output as a digital signal to an external device, such as the Audio Return Channel function specified in the HDMI standard.

[0144] Microphone 20022 is a microphone that picks up sounds from the surrounding area of the mobile information processing terminal 20010, converts them into signals, and generates audio signals. The microphone may be configured to record human voices, such as the user's voice, and the control unit 20012, described later, may perform speech recognition processing on the generated audio signal to obtain text information from the audio signal.

[0145] The imaging unit 20025 is a camera having an image sensor. The camera may be provided on the front of the display panel 20011 side of the mobile information processing terminal 20010, or on the back of the display panel 20011 side. Both a front camera and a rear camera may be provided. In this embodiment, the imaging unit 20025 will be described as having both a front camera and a rear camera.

[0146] The storage unit 20016 is a storage device that records various types of information, such as video data, image data, and audio data. The storage unit 20016 may be composed of a magnetic recording medium such as a hard disk drive (HDD) or a semiconductor memory such as a solid-state drive (SSD). For example, the storage unit 20016 may have various types of information, such as video data, image data, and audio data, pre-recorded in it at the time of product shipment. The storage unit 20016 may also record various types of information, such as video data, image data, and audio data, acquired from external devices or external servers via the communication unit 20020. The video data, image data, etc., recorded in the storage unit 20016 are output to the display panel 20011. The video data, image data, etc., recorded in the storage unit 20016 may also be output to external devices or external servers via the communication unit 20020.

[0147] The video control unit 20017 performs various controls related to the video signals input to the display panel 20011. The video control unit 20017 may also be called a video processing circuit and may be composed of hardware such as an ASIC, FPGA, or video processor. The video control unit 20017 may also be called a video processing unit or image processing unit. For example, the video control unit 20017 controls video switching, such as determining which video signal to input to the display panel 20011 from among the video signals to be stored in memory 20026 and the video signals (video data) input to the video signal input unit 20023. The video control unit 20017 may also perform control to perform image processing on the video signals input from the video signal input unit 20023 and the video signals to be stored in memory 20026. Examples of image processing include scaling processing, such as enlarging, reducing, and transforming images; brightness adjustment processing, which changes the brightness; contrast adjustment processing, which changes the contrast curve of an image; and retinex processing, which decomposes an image into its light components and changes the weighting of each component.

[0148] The attitude sensor 20018 is a sensor composed of a gravity sensor, an acceleration sensor, or a combination thereof, and can detect the attitude of the mobile information processing terminal 20010. Based on the attitude detection result of the attitude sensor 20018, the control unit 20012 may control the operation of each connected part.

[0149] The non-volatile memory 20027 stores various data used by the mobile information processing terminal 20010. The data stored in the non-volatile memory 20027 includes, for example, data for various operations displayed on the display panel 20011 of the mobile information processing terminal 20010, display icons, data and layout information for objects used by the user. Memory 20026 stores video data and device control data displayed on the display panel 20011. The control unit 20012 may read various software from the storage unit 20016, expand it into memory 20026, and store it there.

[0150] The control unit 20012 controls the operation of each connected component. The control unit 20012 may also work in cooperation with a program stored in memory 20026 to perform calculations based on information acquired from each component within the mobile information processing terminal 20010.

[0151] Next, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described using Figure 3C. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 20001. In Embodiment 3 as well, the character conversation device (artificial intelligence response output device 10010) loads the character operation program stored in the storage unit 1170 or the like into the memory 1109, and the control unit 1110 executes the character operation program, thereby enabling the various processes described below to be realized.

[0152] In Example 2, the actions performed by User 230 to the character conversation device (artificial intelligence response output device 10010) were mainly through User 230's voice. In Example 2, the character conversation device (artificial intelligence response output device 10010) performed a series of operations starting with the process of picking up User 230's voice with a microphone. In contrast, the character conversation device (artificial intelligence response output device 10010) in Example 3 is also capable of performing the series of operations described in Example 2, starting with the process of picking up User 230's voice with a microphone. In addition, in the character conversation device (artificial intelligence response output device 10010) in Example 3, User 230 can perform actions to the character conversation device (artificial intelligence response output device 10010) through user operation via the operation input unit 1107 in Figure 1B. Here, an example of the operation input unit 1107 in Figure 1B is a mouse, keyboard, touch panel, etc.

[0153] Furthermore, in the character conversation device (artificial intelligence response output device 10010) of Embodiment 3, the user 230 can perform an action on the character conversation device (artificial intelligence response output device 10010) by touch operation detected by the user's touch operation input sensor on the display unit 10011 in Figure 1B.

[0154] Furthermore, user 230 can also input user 230's operation input to the character conversation device (artificial intelligence response output device 10010) by operating the mobile information processing terminal 20010 and communicating from the mobile information processing terminal 20010 to the character conversation device (artificial intelligence response output device 10010).

[0155] Alternatively, the display panel 20011 of the mobile information processing terminal 20010 may display an information-storage image, such as a two-dimensional code containing information that the user wants to convey to the character conversation device (artificial intelligence response output device 10010), and the imaging unit 1180 of the character conversation device (artificial intelligence response output device 10010) may capture this display. The control unit 1110 of the character conversation device (artificial intelligence response output device 10010) may extract information from the information-storage image, such as a two-dimensional code, captured by the imaging unit 1180, and obtain the information. Alternatively, the display panel 20011 of the mobile information processing terminal 20010 may display an image that the user wants to convey to the character conversation device (artificial intelligence response output device 10010), and the imaging unit 1180 of the character conversation device (artificial intelligence response output device 10010) may capture this display. The control unit 1110 of the character conversation device (artificial intelligence response output device 10010) may perform image recognition processing on the image captured by the imaging unit 1180 and obtain the result of said image recognition processing.

[0156] Thus, in the character conversation device (artificial intelligence response output device 10010) of Example 3, the types of actions that the user 230 can perform on the character conversation device (artificial intelligence response output device 10010) are greater than those of the character conversation device (artificial intelligence response output device 10010) described in Example 2. As a result, the character conversation device (artificial intelligence response output device 10010) of Example 3 can acquire the results of actions performed by the user 230 other than the user's voice, and generate an instruction sentence (prompt) to send to the large-scale language model server 20001 based on that. This makes it possible to more favorably include types of information other than natural language text information extracted from the user's voice in the instruction sentence sent to the large-scale language model server 20001. Examples of types of information other than natural language text information extracted from the user's voice include images, videos, and audio.

[0157] Next, the character conversation device (artificial intelligence response output device 10010) of this embodiment sends instruction texts to the large-scale language model server 20001 using an API. In this embodiment as well, instruction texts may be metadata containing information written using notation such as markup format of a markup language using tags, notation such as Markdown format using predetermined symbols, or object notation of a predetermined script such as JSON. In this embodiment as well, there are two types of instruction texts: setting instruction texts that store instructions such as initial settings, and user instruction texts that reflect instructions from the user. Type identification information that identifies whether an instruction text is a setting instruction text or a user instruction text may be stored in a part of the instruction text other than the main message. In this case, the instruction text includes natural language text information as the main message. Furthermore, in this embodiment, in addition to natural language text information, the main message of the instruction text may include non-natural language information sources such as images, videos, or audio as a type of information other than natural language text information. A specific method for including non-natural language information sources in instruction texts will be described later.

[0158] The large-scale language model server 20001 in this embodiment has a multimodal large-scale language model that can process non-natural language information sources in conjunction with natural language text information. The large-scale language model server 20001 receives an instruction sentence from a character conversation device (artificial intelligence response output device 10010). Based on the instruction sentence, the multimodal large-scale language model performs inference and generates a response that includes natural language text information as a result of the inference. Here, since the artificial intelligence of the large-scale language model server 20001 is a multimodal large-scale language model, the response can include non-natural language information sources such as images, videos, or audio in addition to natural language text information.

[0159] The character conversation device (artificial intelligence response output device 10010) receives a response from the large-scale language model server 20001 and extracts natural language text information and non-natural language information sources such as images, videos, or audio stored as the main message in the response. The character operation program of the character conversation device (artificial intelligence response output device 10010) may use speech synthesis technology to generate natural language audio as a response to the user based on the natural language text information extracted from the aforementioned response, and output it from the audio output unit 1140, which is a speaker, so that it sounds as if it were the voice of the character 19051 displayed on the display screen.

[0160] Furthermore, the character operation program of the character conversation device (artificial intelligence response output device 10010) may display natural language characters that serve as a response to the user on the display screen of the character conversation device (artificial intelligence response output device 10010), based on the natural language text information extracted from the aforementioned response. In this case, the characters may be displayed together with character 19051, superimposed on the image of character 19051, or displayed in place of the image of character 19051. The video control unit 1160 may perform these specific processes.

[0161] Furthermore, the character operation program of the character conversation device (artificial intelligence response output device 10010) may display an image on the display screen of the character conversation device (artificial intelligence response output device 10010) in order to present it to the user, based on the image information of the non-natural language information source extracted from the aforementioned response. In this case, the image may be displayed together with character 19051, superimposed on the image of character 19051, or displayed in place of the image of character 19051. These specific processes can be executed by the image control unit 1160.

[0162] Furthermore, the character operation program of the character conversation device (artificial intelligence response output device 10010) may display the video information of the non-natural language information source extracted from the aforementioned response on the display screen of the character conversation device (artificial intelligence response output device 10010) in order to present it to the user. In this case, the video may be displayed together with character 19051, superimposed on the video of character 19051, or displayed in place of the video of character 19051. These specific processes can be executed by the video control unit 1160.

[0163] Furthermore, the character operation program of the character conversation device (artificial intelligence response output device 10010) may output speech generated based on the speech information of the non-natural language information source extracted from the aforementioned response from the speech output unit 1140, which is a speaker.

[0164] As described above, with the character conversation device (artificial intelligence response output device 10010) shown in Figure 3C, or the character conversation system including the character conversation device (artificial intelligence response output device 10010) and the large-scale language model server 20001, it is not necessary to install the large-scale language model itself, which requires a massive amount of data and computing resources for training, within the character conversation device (artificial intelligence response output device 10010). Furthermore, the advanced natural language processing and non-natural language information processing capabilities of the multimodal large-scale language model can be utilized via an API. In addition to responses based on natural language text, responses based on non-natural language information sources can be provided in response to user actions towards the character, enabling more appropriate conversations.

[0165] Next, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described using Figure 3D. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 20001. Specifically, Figure 3D shows an example of the natural language text and non-natural language information source such as an image of the main message of the instruction sent from the character conversation device (artificial intelligence response output device 10010) to the large-scale language model server 20001, and an example of the natural language text and non-natural language information source such as an image of the main message of the server response that is the response to it. In this embodiment, images, videos, audio, etc. can be used as non-natural language information sources, but Figure 3D shows an example of an image as a non-natural language information source.

[0166] Furthermore, Figure 3D shows the exchange of instructions and responses in chronological order, from the first round of setting instructions and user instructions and their responses to the second round of user instructions and their responses. Here, the instructions and responses shown in Figure 3D include non-natural language information sources 20061 and 20062, which were not shown in Figure 2D of Example 2. In the example of Figure 3D, both non-natural language information sources 20061 and 20062 are images.

[0167] In Figure 3D, for the sake of simplicity, an image of the non-natural language information source 20061 is shown embedded within the instruction text. However, there are multiple methods for transmitting or specifying the data of the non-natural language information source 20061 in the instruction text sent from the character conversation device (artificial intelligence response output device 10010) to the large-scale language model server 20001. The character conversation device (artificial intelligence response output device 10010) can use any one of these methods, or switch between them. An example of each method will be explained below.

[0168] The first method for transmitting or specifying non-natural language information source data in an instruction is used, for example, when the non-natural language information source to be specified is located on a server or other location connected to a network such as the Internet. A specific example of the first method is to specify a non-natural language information source file located on a network such as the Internet using information such as tags and symbols within the instruction, along with the network location information (so-called URL, etc.) and file name.

[0169] For example, a tag used to specify an image in a markup language. <img src=""****”"> By using this tag and writing the location and filename information of the image file in the **** part, you can specify an image that exists on a network such as the internet. Alternatively, you can use a tag that specifies a video in a markup language. <video src=""****”">You can also specify a video that exists on a network such as the internet by using the **** part and writing the location information and file name information of the video file. Alternatively, you can use a tag that specifies audio in a markup language. <audio src=""****”">You can specify audio files located on a network such as the internet by using the format and writing the location and filename information of the audio file in the **** section. Alternatively, if using JSON notation, you can specify images located on a network such as the internet by preparing a key such as img_src and writing the location and filename information of the image file as the value. For video and audio files, you just need to prepare the respective keys and values. The example given is just one example, and you may use other proprietary formats. In any case, the information specifying the location and filename information of the non-natural language information source file should be stored in the instruction statement.

[0170] As in the first method, when information specifying the location and filename of a non-natural language information source file is stored in the instruction statement, the instruction statement itself does not need to store the data of the non-natural language information source file. Therefore, the amount of data in the instruction statement can be reduced. In the first method, the large-scale language model server 20001 that receives an instruction statement specifying non-natural language information source data can use the location and filename information of the non-natural language information source file stored in the instruction statement to obtain the non-natural language information source file located on a server or other location connected to a network such as the Internet.

[0171] Here, we will explain how location information and file name information are input when the character conversation device (artificial intelligence response output device 10010) specifies non-natural language information source data in an instruction sentence using the first method. In Figure 3C, we have explained that in this embodiment, the types of actions that the user 230 can perform on the character conversation device (artificial intelligence response output device 10010) have increased compared to Embodiment 2, in addition to the voice of the user 230. Therefore, for example, the user 230 may input location information such as a URL for specifying non-natural language information source data, file name information, etc., through user operation (e.g., mouse, keyboard, touch panel) via the operation input unit 1107 in Figure 1B.

[0172] Furthermore, in the character conversation device (artificial intelligence response output device 10010), the control unit 1110 may work with the memory 1109 to execute a web browser program and display the GUI of the web browser program on the display screen of the character conversation device (artificial intelligence response output device 10010). User operations on the GUI of the web browser program may be received via the operation input unit 1107 (for example, mouse, keyboard, touch panel) or by user touch operations detectable by the touch operation input sensor of the display unit 10011, and non-natural language information source data such as images, videos, and audio selected on the browser screen of the web browser program may be used as the data to be specified in the instruction statement. In this case, the web browser program should acquire the location information and file name information of the non-natural language information source data and pass it to the character operation program.

[0173] Alternatively, user 230 may operate the mobile information processing terminal 20010 to communicate with the character conversation device (artificial intelligence response output device 10010) and input location information such as a URL for specifying non-natural language information source data into the character conversation device (artificial intelligence response output device 10010). Alternatively, as explained in Figure 3C, location information such as a URL for specifying non-natural language information source data, file name information, etc., may be input by displaying an information-storing image such as a two-dimensional code on the display panel 20011 of the mobile information processing terminal 20010, performing image recognition processing on the image captured by the imaging unit 1180 of the character conversation device (artificial intelligence response output device 10010), and obtaining the result of the image recognition processing.

[0174] Furthermore, the use of the first method for transmitting or specifying non-natural language information source data in an instruction is not limited to cases where the non-natural language information source file already exists on a server or other location connected to a network such as the Internet. For example, if it is desired to include non-natural language information source data such as images, videos, and audio stored in the storage unit 1170 of the character conversation device (artificial intelligence response output device 10010) in an instruction, the character conversation device (artificial intelligence response output device 10010) may upload the non-natural language information source data to a second server 19002 via the Internet 19000 and include the Internet location information (so-called URL, etc.) and file name of the uploaded non-natural language information source data on the second server 19002 in the instruction. In this case, the second server 19002 functions as a so-called intermediate server.

[0175] Similarly, if it is desired to include non-natural language information source data such as images, videos, and audio stored in the storage unit 20016 of the mobile information processing terminal 20010 in the instruction text, the mobile information processing terminal 20010 may upload the non-natural language information source data to the second server 19002 via the internet 19000. The mobile information processing terminal 20010 or the second server 19002 may transmit the internet location information (so-called URL, etc.) and file name of the non-natural language information source data on the second server 19002 to the character conversation device (artificial intelligence response output device 10010), and the character operation program of the character conversation device (artificial intelligence response output device 10010) may include the acquired internet location information (so-called URL, etc.) and file name of the non-natural language information source data uploaded to the second server 19002 in the instruction text.

[0176] Furthermore, the character operation program of the character conversation device (artificial intelligence response output device 10010) may work in cooperation with the memory 1109 and the storage unit 1170 to construct a media server within the character conversation device (artificial intelligence response output device 10010) that can be accessed from other servers via the internet 19000. In this case, when the character conversation device (artificial intelligence response output device 10010) specifies non-natural language information source data in an instruction statement using the first method, it only needs to store in the instruction statement location information on the internet (such as a URL) indicating the media server constructed within the character conversation device (artificial intelligence response output device 10010) itself, and the file name of the corresponding non-natural language information source data.

[0177] Next, a second method for specifying the transmission or designation of non-natural language information source data in an instruction is, for example, simply to store (attach) the non-natural language information source data itself in the instruction (prompt) and send it. Generally, non-natural language information source data such as images, videos, and audio are larger in data size than natural language text information. Therefore, in this case, the data size of the instruction (prompt) itself will be larger than in the first method. The character operation program of the character conversation device (artificial intelligence response output device 10010) can store the non-natural language information source data that it wants to store (attach) in the instruction (prompt) in memory 1109, and when sending the instruction (prompt), it can store (attach) the data in the instruction (prompt) via the communication unit 1132 and output it to the large-scale language model server 20001. The non-natural language information source data that the character operation program of the character conversation device (artificial intelligence response output device 10010) stores in memory 1109 may be acquired by the communication unit 1132 via the internet 19000, acquired by the communication unit 1132 from the mobile information processing terminal 20010, or read from the storage unit 1170 and stored in memory 1109.

[0178] As described above, the character conversation device (artificial intelligence response output device 10010) is capable of transmitting or specifying non-natural language information source data using instruction sentences.

[0179] The large-scale language model server 20001 is a multimodal large-scale language model that can process non-natural language information sources in conjunction with natural language text information. As shown in the example in Figure 3D, through the first round of user instructions, it can acquire images of a swimming pool and poolside, which are non-natural language information sources 20061, and natural language text information. As a result of this inference, it can output natural language text information as shown in the figure, in response to the first round of user instructions.

[0180] Furthermore, since the large-scale language model server 20001 is a multimodal large-scale language model that can process non-natural language information sources together with natural language text information, as shown in the example in Figure 3D, in the response to the second round of user instructions, the large-scale language model server 20001 can include the non-natural language information source 20062 generated by the inference of the multimodal large-scale language model in its response and send it to the character conversation device (artificial intelligence response output device 10010). In Figure 3D, the non-natural language information source 20062 is an example of an image with a circle added to the image of a swimming pool and poolside, which is the non-natural language information source 20061. Note that the non-natural language information source 20062 stored in the response is not limited to the image shown in Figure 3D, but may also be a video or audio.

[0181] When the response from the large-scale language model server 20001 includes non-natural language information sources other than natural language text information, the method can be the first method or a method similar to the second method used by the character conversation device (artificial intelligence response output device 10010) to transmit or specify non-natural language information source data in the instruction statement.

[0182] Specifically, in a method similar to the first method described above, the large-scale language model server 20001 may store information specifying the location and file name of the non-natural language information source file in the instruction statement in its response. The non-natural language information source 20062 itself, such as images, videos, and audio, may be kept by the large-scale language model server 20001, or it may be transferred to and kept by the second server 19002, which functions as an intermediate server. In either case, the large-scale language model server 20001 may store information specifying the location and file name of the non-natural language information source file in the instruction statement in its response. The character conversation device (artificial intelligence response output device 10010) that has received the response may use the location and file name information of the non-natural language information source file described in the instruction statement to access the large-scale language model server 20001 or the second server 19002 to obtain the non-natural language information source 20062.

[0183] Furthermore, specifically, as a method similar to the second method described above, the large-scale language model server 20001 may store (attach) the file data of the non-natural language information source 20062 itself in the response and send it to the character conversation device (artificial intelligence response output device 10010). The character conversation device (artificial intelligence response output device 10010) can acquire the data of the non-natural language information source 20062 stored (attached) in the instruction and use it for various outputs to the user 230.

[0184] As described above using Figure 3D, the operation of the character conversation device (artificial intelligence response output device 10010) and character conversation system of Embodiment 3 involves the transmission and reception of instruction sentences and responses between the character displayed on the character conversation device (artificial intelligence response output device 10010) and the user 230, enabling conversation using non-natural language information such as images, videos, and audio. This makes it possible to achieve more sophisticated and natural conversations, as shown in each message in Figure 3D.

[0185] Next, using Figure 3E, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 20001. Specifically, Figure 3E is an example of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 20001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the main message of the server response that is the response.

[0186] Figure 3E shows an example of a new conversation that takes place after the series of conversations shown in Figure 3D has ended, when user 230 speaks to character 19051 again. In the example in Figure 3E, no processing using the conversation history is performed, as explained in Figures 2F, 2G, and 2I of Example 2. Therefore, Figure 3E, like Figure 2E of Example 2, shows a response in which the name of the large-scale language model itself, the role to be played, the characteristics of the conversation, the user's name, and the conversation history, which were included in the setting instruction, are not remembered at all.

[0187] Next, using Figure 3F, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 20001. Specifically, Figure 3F is an example of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 20001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the main message of the server response that is the response.

[0188] Figure 3F shows an example of a case where, after the series of conversations shown in Figure 3D has ended, user 230 speaks to character 19051 again to initiate a new conversation. Here, in Figure 3F, the method of storing a message explaining the history of past conversations in the setting instruction statement, as explained in Figure 2F of Embodiment 2, is also applied to the character conversation device (artificial intelligence response output device 10010) of Embodiment 3. Specifically, the message that constitutes the content of the setting instruction statement in Figure 3D is stored as a reset message in Figure 3F, and following the reset message, a message explaining the history of past conversations is stored as a conversation history message.

[0189] The large-scale language model server 20001 in Example 3 is a multimodal large-scale language model that can process non-natural language information sources together with natural language text information. Therefore, in past instructions and responses, non-natural language information source data may have been transmitted or specified. Accordingly, in the example in Figure 3F, the conversation history message reflects not only the natural language text information in past instructions and responses, but also the transmission or specification of non-natural language information source data in past instructions and responses. The specific method of transmitting or specifying non-natural language information source data in the instructions in Figure 3F is the same as the transmission or specification of non-natural language information source data as explained in Figure 3D, so a repeated explanation will be omitted.

[0190] In the example in Figure 3D, the method of transmitting or specifying non-natural language source data can involve either storing (attaching) the non-natural language source data itself to the instruction statement, or not storing (attaching) the non-natural language source data to the instruction statement. The same applies to the instruction statement in Figure 3F.

[0191] Next, using Figure 3G, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 20001. Specifically, Figure 3G is an example of the main message of the instruction sent from the artificial intelligence response output device 10010 to the large-scale language model server 20001, which forms the basis of the conversation between the character 19051 displayed on the artificial intelligence response output device 10010 and the user 230, and the main message of the server response that is the response.

[0192] Figure 3G shows an example of a series of conversations in the same conversation as shown in Figure 3F, specifically the first round of user instructions and their responses, followed by the third round of user instructions and their responses. In Figure 3G, the exchange of instructions and responses is shown chronologically. The content of the setting instructions is the same as shown in Figure 3F, so repeated descriptions are omitted.

[0193] As explained above, even when using the large-scale language model server 20001 which has a multimodal large-scale language model capable of processing non-natural language information sources together with natural language text information as in Example 3, even if user 230 speaks to character 19051 again to start a new conversation after a series of conversations has ended, if the setting instruction sentence generation process and transmission process shown in Figure 3F are performed, the subsequent user instruction sentence response will reflect the settings and conversation history of the character at the time of the previous conversation, such as the character's role, name, conversational characteristics, personality, and / or conversational characteristics, as shown in Figure 3G. This is preferable because it allows the user to perceive a greater degree of consistency in the settings and memories of the character's role, name, conversational characteristics, or personality at the time of the previous conversation.

[0194] Next, using Figure 3H, an example of the operation of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described. This can also be described as an example of the operation of a character conversation system including the artificial intelligence response output device 10010 and the large-scale language model server 20001. Specifically, Figure 3H is an explanatory diagram of the database 20200 for managing the character settings and character conversation history for multiple characters displayed on the display unit 10011 of the character conversation device (artificial intelligence response output device 10010). Here, Figure 3H uses the example described in Figure 2H of Embodiment 2 for the settings of multiple characters displayed on the display unit 10011 of the character conversation device (artificial intelligence response output device 10010). Therefore, repeated explanations of the settings of multiple characters will be omitted.

[0195] Furthermore, the database 20200, shown in Figure 3H, which manages character settings and character conversation history, has the same format as the database 19200 shown in Figure 2I of Example 2. In Figure 3H, only the differences from the database 19200 shown in Figure 2I will be explained. In addition, the contents of the character "Koto" in the database will be explained, and the contents of other characters will be omitted.

[0196] As described above, the large-scale language model server 20001 of Example 3 is a multimodal large-scale language model that can process non-natural language information sources in addition to natural language text information. Therefore, both the instruction sentences from the character conversation device (artificial intelligence response output device 10010) and the responses from the large-scale language model server 20001 include not only natural language text information but also the transmission or specification of non-natural language information source data. Accordingly, in the database 20200 shown in Figure 3H, the conversation history data records not only the natural language text information contained in these instruction sentences and responses but also the information on the transmission or specification of non-natural language information source data. The specific method of transmitting or specifying non-natural language information source data in the recording of the conversation history is the same as the specification of transmission or specification of non-natural language information source data described in Figure 3D, so a repeated explanation is omitted.

[0197] In the example in Figure 3D, there are two methods for transmitting or specifying non-natural language source data: one where the non-natural language source data itself is stored (attached) to the instruction, and another where the non-natural language source data is not stored (attached) to the instruction. The same applies to the conversation history in Figure 3H. However, in the conversation history in Figure 3H, if the method for specifying non-natural language source data involves specifying the location information and file name information of a non-natural language source file on a server located on a network such as the Internet (the second server 19002 that functions as an intermediate server or other cloud server), there is a possibility that the non-natural language source file on that server may be deleted if the conversation history period becomes long. In that case, it may become impossible to retrieve the non-natural language source file at a later date using the location information and file name information, potentially resulting in the loss of conversation record information.

[0198] To prevent this, when the character conversation device (artificial intelligence response output device 10010) converts instruction and response messages into a conversation history and records them, it can obtain the non-natural language information source file specified in the instruction and response from a server on the network using its location information and file name information, and store it in the storage unit 1170. Furthermore, the character operation program of the character conversation device (artificial intelligence response output device 10010) can rewrite the location information and file name of the non-natural language information source file to location information on the internet (so-called URL, etc.) indicating the media server of the media server built within the character conversation device (artificial intelligence response output device 10010), and then record it in the conversation record. In this way, unless the character conversation device (artificial intelligence response output device 10010) itself deletes the non-natural language information source file from the storage unit 1170, the non-natural language information source will not be lost from the conversation record information, making it more suitable for preserving the conversation record.

[0199] Using the database shown in Figure 3H described above, even when the character conversation device (artificial intelligence response output device 10010) is configured to switch between displaying multiple character candidates on the display unit 10011, the user experiences less discomfort in conversations with each character, can share memories with each of the multiple characters, and can obtain the effect shown in Figure 2I of Embodiment 2, resulting in a more enjoyable character conversation experience. Furthermore, this effect can also be achieved when the large-scale language model server 20001 is a multimodal large-scale language model that can process non-natural language information sources together with natural language text information.

[0200] In the character conversation device (artificial intelligence response output device 10010) or character conversation system of Example 3, a multimodal large-scale language model artificial intelligence is used in the large-scale language model server 20001, which is capable of processing not only natural language text information but also non-natural language information other than natural language text information.

[0201] Here, communication between the character conversation device (artificial intelligence response output device 10010) and the large-scale language model server 20001 is conducted using an API. In a multimodal large-scale language model, it is possible that API usage fees may be charged based on the amount of data from non-natural language information sources, in addition to the number of natural language text information units called tokens that are used to divide sentences.

[0202] Therefore, in order to provide the character conversation service using the character conversation system according to this embodiment to users at a lower cost, the following modifications may be used.

[0203] In the first variation, the conversation history record in the database shown in Figure 3H also records information about the transmission or specification of non-natural language information source data. However, the character and the user exchange conversations about the natural language information source data in natural language text information, and the content of these conversations is recorded in natural language text information. Therefore, even if the recording of information about the transmission or specification of the natural language information source data is omitted in the conversation history record in the database shown in Figure 3H, the conversation about the natural language information source data itself will still be recorded to some extent as natural language text information. Thus, if a certain degree of information reduction is acceptable, the recording of information about the transmission or specification of the natural language information source data may be omitted in the conversation history record in the database shown in Figure 3H. In this case, the information about the transmission or specification of the natural language information source data is also omitted from the conversation history message of the setting instruction in Figure 3F. This makes it possible to reduce the amount of data of non-natural language information sources communicated using the API.

[0204] Next, as a second variation, in the recording of the conversation history in the database of Figure 3H, instead of recording information on the transmission or specification of non-natural language information source data, natural language text information describing the content of the non-natural language information source data is recorded. The natural language text information describing the content of the non-natural language information source data may be obtained, for example, by initiating a conversation between the large-scale language model of the large-scale language model server 20001 and the character conversation device (artificial intelligence response output device 10010), separate from the conversation as a character, and having the large-scale language model server 20001 describe the content of the non-natural language information source data with a predetermined character limit. Alternatively, the content of the non-natural language information source data may be obtained by having a conversation with another large-scale language model on another server, which can be used at a lower cost than the large-scale language model of the large-scale language model server 20001, and having it describe with a predetermined character limit. Furthermore, if alternative text data is prepared from the time of acquisition of the non-natural language information source data, that alternative text data may be used as the natural language text information describing the content of the non-natural language information source data. A specific example of alternative text data for non-natural language information source data is the tags of a markup language. <img src=""”alt="****""> , <video src=""”alt="****""> 、 <audio src=""”alt="****"">This is text information that is written in the **** section, etc.

[0205] Furthermore, if using JSON format notation, an object can be stored that associates the location information and filename information of the non-natural language source data, which are key-value pairs indicating the location information of the non-natural language source data, with a key corresponding to the alternative text and a value that is the alternative text data itself.

[0206] In this case as well, the recording of information regarding the transmission or specification of the natural language information source data can be omitted in the conversation history recording of the database in Figure 3H, and the information regarding the transmission or specification of the natural language information source data is also omitted from the conversation history message of the setting instruction in Figure 3F. This makes it possible to reduce the amount of data of non-natural language information sources communicated using the API.

[0207] Next, as a third variation, at the point of the first round of user instructions in Figure 3D, information on the transmission or specification of non-natural language information source data is not stored in the user instructions, but is replaced with natural language text information that describes the content of the non-natural language information source data. For example, in the first round of user instructions in Figure 3D, the information on the transmission or specification of non-natural language information source data 20061 can be replaced with natural language text information such as, "This image is of a swimming pool with a seat and parasol by the poolside. There is water in the swimming pool. There are drinks on the table next to the seat." In this case, the description may be obtained by having a conversation with another large-scale language model on another server, which can be used at a lower cost than the large-scale language model of the large-scale language model server 20001, to describe the content of the non-natural language information source data with a predetermined character limit. Alternatively, the description may be obtained from a server of various other services that can obtain an overview or description of the content of non-natural language information source data such as images, videos, and audio. Furthermore, if alternative text data is available at the time of acquisition of non-natural language source data, this alternative text data may be used as natural language text information that explains the content of the non-natural language source data.

[0208] Next, an example of a display example of the character conversation device (artificial intelligence response output device 10010) of Embodiment 3 of the present invention will be described using Figure 3I. The example in Figure 3I shows an example of displaying the response from the large-scale language model to the user instruction sentences described in Figures 3A to 3H on the display unit 10011 of the character conversation device (artificial intelligence response output device 10010). Specifically, this is an example of displaying the text 10063 of the natural language information source data, the image 10064 of the non-natural language information source data, and / or the video 10065 of the non-natural language information source data, which are the response from the large-scale language model, together with the video of the character 19051 on the display unit 10011. The text 10063, image 10064, and / or video 10065, which are the response from the large-scale language model, may be displayed superimposed in front of the video of the character 19051, as shown in Figure 3I.

[0209] Furthermore, the text 10063, image 10064, and / or video 10065, which are responses from the large-scale language model, may be displayed together with the video of character 19051 without being superimposed on it. The display in Figure 3I is just one example, but for example, if user 230 adjusts the volume of the audio output of the audio output unit 1140 of the character conversation device (artificial intelligence response output device 10010) to the minimum or sets the audio output to OFF by operating via the touch operation input sensor of the operation input unit 1107 or the display unit 10011, user 230 will not be able to confirm the response from the large-scale language model by voice. In this case, the control unit 1110 may control the system to start a display mode in which the text 10063, image 10064, and / or video 10065, which are responses from the large-scale language model, are displayed together with the video of character 19051, as shown in Figure 3I.

[0210] In this way, even when the user 230 wishes to minimize voice output, the user 230 can more conveniently use the character conversation device (artificial intelligence response output device 10010). Furthermore, the user 230 may manually switch ON / OFF the display mode, which displays the text 10063, image 10064, and / or video 10065, which are responses from the large-scale language model, along with the image of the character 19051, via the operation input unit 1107 or the touch operation input sensor of the display unit 10011. As shown in the display example in Figure 3I, it becomes possible to more conveniently output responses from the large-scale language model in a multimodal character conversation device (artificial intelligence response output device 10010).

[0211] As described above, the character conversation device and character conversation system according to Example 3 offer users a more advanced conversational experience that includes non-natural language information in addition to natural language information, by using a multimodal large-scale language model, in addition to the effects of the character conversation device and character conversation system according to Example 2. Furthermore, the character conversation device and character conversation system according to Example 3 can provide character conversation services to users at a lower cost.

[0212] In the above description of Example 3, an example was described in which the large-scale language model possessed by the large-scale language model server 20001 is used as the large-scale language model. In contrast, the character conversation device (artificial intelligence response output device 10010) may be equipped with the local LLM processing unit 10028 shown in Figure 1B, and the multimodal large-scale language model possessed by the local LLM processing unit 10028 may be used. In this case, the multimodal large-scale language model possessed by the local LLM processing unit 10028 may be used instead of the multimodal large-scale language model possessed by the large-scale language model server 20001.

[0213] In this case, in the above description of Example 3, the multimodal large-scale language model possessed by the large-scale language model server 20001 can be replaced with the multimodal large-scale language model possessed by the local LLM processing unit 10028 of the character conversation device (artificial intelligence response output device 10010). In this case as well, by using a multimodal large-scale language model, it is possible to provide users with a more advanced conversation experience that includes non-natural language information in addition to natural language information. When using the multimodal large-scale language model possessed by the local LLM processing unit 10028 instead of the multimodal large-scale language model possessed by the large-scale language model server 20001, the need to consider usage fees based on the number of processing tokens and the amount of data of non-natural language information sources is reduced. However, even with the multimodal large-scale language model possessed by the local LLM processing unit 10028, it is possible to reduce the consumption of resources such as power for inference by reducing the number of processing tokens and the amount of data of non-natural language information sources. In this case, it is possible to provide users with a character conversation service that consumes less power.

[0214] Furthermore, the configuration described in Example 2, which involves uploading and downloading the conversation history with a character and the database data containing the conversation history to and from the second server 19002 or another cloud server, can also be used in the example using a multimodal large-scale language model described in Example 3. In this case as well, when a user interacts with one or more characters at different times across different devices, it is possible to achieve a conversation that appears as if the memories of each character have been artificially carried over from the previous conversation, which is more preferable for the user.

[0215] <Example 4> Next, Embodiment 4 of the present invention is an improvement on the artificial intelligence response output device 10010, character conversation device, or system described in the figures of Embodiment 2 or Embodiment 3. In this embodiment, the differences from Embodiment 2 or Embodiment 3 will be explained, and repeating explanations of configurations similar to those in those embodiments will be omitted.

[0216] Similar to the embodiments described above, the artificial intelligence response output device 10010 may also be referred to as an artificial intelligence response output device, an AI assistant device, an AI assistant display device, or an artificial intelligence interface device. The system including the artificial intelligence response output device 10010 and the large-scale language model server may also be referred to as an artificial intelligence response output system, an AI assistant system, an AI assistant display system, or an artificial intelligence interface system.

[0217] Using Figure 4A, an example of operation using a database in the character conversation device (artificial intelligence response output device 10010) of Embodiment 4 of the present invention will be explained. The database of Embodiment 4 shown in Figure 4A is an extension of the database described in Figure 2I or Figure 3I. Specifically, the database shown in Figure 4A assumes a case where multiple different users use the same character conversation device (artificial intelligence response output device 10010) or the same character conversation system, and stores initial setting instructions and conversation history corresponding to each user and character in the database.

[0218] In the example in Figure 4A, for User 1, whose user ID is 1, the initial setup instructions and conversation history for each of the characters—Koto (character ID 1), Tom (character ID 2), and Necco (character ID 3)—are stored. In addition, for User 2, whose user ID is 2, and User 3, whose user ID is 3, the initial setup instructions and conversation history for each of the characters—Koto (character ID 1), Tom (character ID 2), and Necco (character ID 3)—are also stored.

[0219] These initial setup instructions and conversation history data are stored as separate data in different areas for each user-character combination. In Figure 4A, for illustrative purposes, the data stored in each area is denoted as data 11, 12, 13, 21, 22, 23, 31, 32, and 33. The control unit 1110 of the character conversation device (artificial intelligence response output device 10010) uses the initial setup instructions and conversation history stored in different areas for each user-character combination, based on the user currently using (logged into) the character conversation device (artificial intelligence response output device 10010) or its system, thereby enabling it to more effectively maintain the consistency of the character's personality and the continuity of its memory for each different user.

[0220] Specifically, consider a scenario where User 1 has already conversed with character Tom using a character conversation device (artificial intelligence response output device 10010), and User 2 is unaware of that conversation, and then User 2 subsequently converses with character Tom. In this case, if the artificial intelligence response output device 10010 uses an initial setting instruction statement or a conversation history database that does not identify the user, the response output from the artificial intelligence response output device 10010 may be based on a conversation history that User 2 does not remember, potentially leading to inconsistencies in the conversation between User 2 and the character of the artificial intelligence response output device 10010.

[0221] In contrast, even in a similar situation, using the database shown in Figure 4A, the control unit 1110 of the character conversation device (artificial intelligence response output device 10010) identifies the user by ID, stores the initial setting instructions and conversation history in a different area for each user, and uses the initial setting instructions and conversation history stored in the different areas for each user to generate the artificial intelligence response. As a result, the initial setting instructions and conversation history used to generate the artificial intelligence response for each user are based on the user's operations or conversation history and are managed separately from the operations or conversation history of other users. This makes it possible to better maintain consistency in the conversation history between each user and each character of the artificial intelligence response output device 10010.

[0222] Furthermore, the database of initial setting instructions and / or conversation history, as explained in Figure 4A, may be stored in the storage unit 1170 of the artificial intelligence response output device 10010 and used by the control unit 1110. However, it is not limited to this, and the database of initial setting instructions and / or conversation history may also be stored on a server on the network. For example, if the artificial intelligence response output device 10010 uses the large-scale language model of the large-scale language model server 19001 or the multimodal large-scale language model of the large-scale language model server 20001 in generating artificial intelligence responses, the database of initial setting instructions and / or conversation history, as explained in Figure 4A, may be stored on these servers themselves. In this way, the process of re-inserting the initial setting instructions and conversation history into the instructions and sending them from the artificial intelligence response output device 10010 to these servers can be omitted, and the number of transmission tokens for using the large-scale language model can be reduced.

[0223] When storing a database of initial setting instructions and / or conversation history, as explained in Figure 4A, the AI response output device 10010 should send the user ID, character ID, and user instructions for subsequent conversations to these servers. The large-scale language models on these servers use the user ID and character ID obtained from the AI response output device 10010 to retrieve the corresponding initial setting instructions and conversation history from the database of initial setting instructions and / or conversation history shown in Figure 4A. The large-scale language models on these servers should then perform inference using the initial setting instructions and conversation history, along with the user instructions for subsequent conversations sent from the AI response output device 10010, generate an AI response, and send it to the AI response output device 10010. In this way, the effect of more favorably maintaining the consistency of character personality and memory continuity for each different user can be obtained while saving the number of tokens sent for the use of the large-scale language models.

[0224] Next, using Figure 4B, an example of operation using a database in the character conversation device (artificial intelligence response output device 10010) of Embodiment 4 of the present invention will be described. The database in Embodiment 4 shown in Figure 4B is an extension of the database described in Figure 1C or Figure 2L. Specifically, the database shown in Figure 4B assumes a case where multiple different users use the same character conversation device (artificial intelligence response output device 10010) or the same character conversation system, and stores data of standard response phrases corresponding to each user and character in the database.

[0225] In the example in Figure 4B, for user 1 (user ID 1), the system stores the standard response data for each of the following characters: character Koto (character ID 1), character Tom (character ID 2), and character Necco (character ID 3). In addition, for user 2 (user ID 2) and user 3 (user ID 3), the system also stores the standard response data for each of the following characters: character Koto (character ID 1), character Tom (character ID 2), and character Necco (character ID 3).

[0226] These standard response phrases are stored as separate data in different areas for each user-character combination. In Figure 4B, for illustrative purposes, the data stored in each area is denoted as standard response phrase data 101, 102, 103, 201, 202, 203, 301, 302, and 303. For example, standard response phrase data 101 is stored as a database, such as a table, corresponding to the standard response phrases corresponding to condition numbers 1-7 for character 1: Koto, as shown in Figure 2L. Data 201 in Figure 4B is stored as a database, such as a table, corresponding to the standard response phrases corresponding to condition numbers 1-7 for character 2: Tom, as shown in Figure 2L.

[0227] Data 301 in Figure 4B is stored as a database, such as a table, corresponding to the standard response phrases corresponding to condition numbers 1 to 7 for character 3: Necco shown in Figure 2L. Data 102, 202, and 302 in Figure 4B store standard response phrases modified for user 2 in a similar format. Data 103, 203, and 303 in Figure 4B store standard response phrases modified for user 3 in a similar format. The control unit 1110 of the character conversation device (artificial intelligence response output device 10010) uses standard response phrase data stored in different areas for each combination of user and character, based on the user currently using (logged into) the character conversation device (artificial intelligence response output device 10010) or its system.

[0228] In this way, even with the same character, it becomes possible to provide responses using different predefined response templates for each user. In other words, even with the same character, it may be preferable to change the content of the predefined response template depending on the relationship between the character and the user. For example, depending on the relationship between the character's set age and the user's age registered in the artificial intelligence response output device 10010 or the system, the user may be older, the same age, or younger than the character. In this case, changing the content of the character's predefined response template for older users, users of the same age, and users younger will result in a more appropriate or natural conversation between the user and the character. In other words, by performing the operation using the database shown in Figure 4B, it is possible to create a more appropriate or natural conversation by varying the content of the predefined response template according to the relationship between the character and the user.

[0229] As described above, the response template database (response template DB) shown in Figure 4B is stored in the storage unit 1170, and the control unit 1110 of the artificial intelligence response output device 10010 can use it. However, the response template database (response template DB) shown in Figure 4B may also be provided on the large-scale language model server 19001 side or the large-scale language model server 20001 side. In this case, the control unit of the large-scale language model server 19001 or the control unit of the large-scale language model server 20001 should generate responses using the response template database (response template DB). The control unit of the large-scale language model server 19001 or the control unit of the large-scale language model server 20001 should send the response generated using the response template database (response template DB) to the artificial intelligence response output device 10010 instead of the response generated by the large-scale language model stored in their respective servers. In this way, even if the artificial intelligence response output device 10010 is not equipped with a response template database (response template DB), it becomes possible to generate responses using a response template database (response template DB).

[0230] As described above, the character conversation device and character conversation system according to Embodiment 4 make it possible to create more suitable or natural conversations depending on the relationship between the character and the user, as well as the conversation history.

[0231] <Example 5> Next, Embodiment 5 of the present invention is an improvement on the artificial intelligence response output device 10010 or artificial intelligence response output system described in Figures 1, 2, and 3 of Embodiment 1. Specifically, this is an example of switching the response generation process of the artificial intelligence response output device 10010 from response generation processing using a large-scale language model on the network to response generation processing using a local large-scale language model (such as the local LLM processing unit 10028) provided by the artificial intelligence response output device 10010, or response generation processing using a response template database. In this embodiment, the differences from these embodiments will be explained, and repeated explanations of configurations similar to those embodiments will be omitted.

[0232] Similar to the embodiments described above, the artificial intelligence response output device 10010 may also be referred to as an artificial intelligence response output device, a character conversation device, an AI assistant device, an AI assistant display device, or an artificial intelligence interface device. The system including the artificial intelligence response output device 10010 and the large-scale language model server may also be referred to as an artificial intelligence response output system, a character conversation system, an AI assistant system, an AI assistant display system, or an artificial intelligence interface system.

[0233] Using Figure 5A, an example of the switching process for response generation in the artificial intelligence response output device 10010 of Embodiment 5 of the present invention will be explained. The table in Figure 5A shows examples of the switching process for response generation in the artificial intelligence response output device 10010, from Example 1 to Example 9. In the table in Figure 5A, the "Switching Overview" column shows an overview of the switching process for each example. The "State Before Switching of LLM (API-Connected LLM) on the Network" column shows the state before the response generation process by the large-scale language models on the network (large-scale language models connected using APIs), such as the large-scale language model provided by the large-scale language model server 19001 in Figure 1 and the multimodal large-scale language model provided by the large-scale language model server 20001, is switched to another response generation process. The "Switching Occurrence Conditions" column shows the conditions under which the switching process for response generation occurs. The column "Switching Destination from Network LLM (API-connected LLM)" indicates the switching destination for the response generation process of the artificial intelligence response output device 10010, switching from large-scale language models on the network (large-scale language models connected using APIs), such as the large-scale language model provided by the large-scale language model server 19001 and the multimodal large-scale language model provided by the large-scale language model server 20001. The control unit 1110 of the artificial intelligence response output device 10010 should control the system to switch to the large-scale language model, database, or corresponding indicated in "Switching Destination from Network LLM (API-connected LLM)" when the conditions indicated in "Switching Occurrence Conditions" occur in the state shown in "State Before Switching from Network LLM (API-connected LLM)" in Figure 5A.

[0234] The following describes each example shown in the table in Figure 5A. Example 1 is an example of switching depending on the network connectivity status of the artificial intelligence response output device 10010, as shown in the "Switching Overview". In Example 1, the "State before switching of the LLM (API-connected LLM) on the network" indicates that the network connectivity status of the artificial intelligence response output device 10010 is in a connectable state. Here, in Example 1, the "Condition for switching" is indicated as "When network connectivity becomes impossible". That is, this means when network connectivity between the artificial intelligence response output device 10010 and the large-scale language model on the network (large-scale language model connected using an API) becomes impossible. Specifically, this connectivity failure may be due to a communication failure on the connection path from the artificial intelligence response output device 10010 to the Internet 19000. Alternatively, this connectivity failure may be due to a communication failure on the Internet 19000. Alternatively, this connectivity failure may be due to the large-scale language model on the network (large-scale language model connected using an API) itself being unable to connect to the Internet 19000. Furthermore, in Example 1, "local LLM" is indicated as the "switching destination from the network-based LLM (API-connected LLM)." Specifically, this means switching to the response generation process performed by the local LLM processing unit 10028 of the artificial intelligence response output device 10010. In other words, in Example 1, even if for some reason connection to the large-scale language model on the network (a large-scale language model connected using an API) becomes impossible and the response generation process by the large-scale language model on the network (a large-scale language model connected using an API) becomes unavailable, the response generation process is switched to the local LLM processing unit 10028 of the artificial intelligence response output device 10010. This makes it possible to continue the response generation process using the large-scale language model, although there may be performance differences between the large-scale language models.

[0235] Next, let's explain Example 2 in Figure 5A. In Example 2, the "switching destination from the LLM on the network (API-connected LLM)" in Example 1 is changed from "local LLM" to "response template DB (database)". The response generation process using this "response template DB (database)" is the same as the process explained in Figure 1C, Figure 2L, or Figure 4B, so a repeated explanation will be omitted. In other words, in Example 2, if for some reason it becomes impossible to connect to the large-scale language model on the network (large-scale language model connected using an API), and the response generation process using the large-scale language model on the network (large-scale language model connected using an API) is unavailable, it is possible to switch to the response generation process using the response template database, thereby generating a response with a simpler process and outputting that response to the user.

[0236] Next, let's explain Example 3 in Figure 5A. Example 3 is a variation of Example 1 in which the "switching destination from the network LLM (API-connected LLM)" is changed from "local LLM" to "non-response handling". This "non-response handling" means that even if a user input requests a response from the large-scale language model via the touch panel, microphone 1139, or operation input unit 1107, no response is generated for this input, or even if a user input requests a response from the large-scale language model, no response is output for it. In other words, Example 3 makes it possible to simplify the handling of cases where, for some reason, connection to the network large-scale language model (large-scale language model connected using an API) becomes impossible, and the response generation process by the network large-scale language model (large-scale language model connected using an API) is unavailable.

[0237] Next, we will explain Example 4 in Figure 5A. As shown in the "Switching Overview," Example 4 is an example of switching due to a response delay of the LLM on the network. In Example 4, the "State before switching of the LLM on the network (API-connected LLM)" is shown as a state in which a response from the LLM on the network is obtained within a predetermined time. Here, in Example 4, the "Condition for switching" is shown as when a response from the LLM on the network is not obtained within a predetermined time and exceeds the predetermined time. Also, in Example 4, the "Destination for switching from the LLM on the network (API-connected LLM)" is shown as the "Local LLM." The destination "Local LLM" is the same as in Example 1, so we will omit the repeated explanation. In other words, in Example 4, even if for some reason the response from the LLM on the network (Large-Scale Language Model connected using an API) exceeds a predetermined time and the response generation process by the LLM on the network (Large-Scale Language Model connected using an API) cannot be used smoothly, the response generation process by the local LLM processing unit 10028 of the artificial intelligence response output device 10010 is switched to. This means that, despite performance differences among large-scale language models, it is possible to continue using large-scale language models for response generation.

[0238] Next, let's explain Example 5 in Figure 5A. Example 2 is the same as Example 4, but with the "switching destination from the LLM on the network (API-connected LLM)" changed from "local LLM" to "response template DB (database)". The response generation process using this "response template DB (database)" is the same as the process explained in Figure 1C, Figure 2L, or Figure 4B, so a repeated explanation will be omitted. In other words, in Example 5, if for some reason the response from the LLM on the network (large-scale language model connected using an API) exceeds a predetermined time, and the response generation process using the LLM on the network (large-scale language model connected using an API) cannot be used smoothly, it is possible to switch to a response generation process using the response template database, thereby generating a response with a simpler process and outputting that response to the user.

[0239] Next, we will explain Examples 6 to 9 in Figure 5A. As shown in the "Switching Overview," Examples 6 to 9 are examples of switching due to reaching the upper limit of API usage or usage fees. Here, as explained in Example 2, providers of large-scale language models often recover the costs used to train the large-scale language model from the user of the device as API usage fees. In this case, natural language models often charge API usage fees based on the number of tokens, which are units of words that divide sentences. Here, various billing and limiting methods can be considered for API usage fees. One example is to define the upper limit of the amount of large-scale language model usage service that a user can receive under normal circumstances using the number of tokens processed.

[0240] In this case, users can access services utilizing large-scale language models at a predetermined API usage fee until they reach their usage limit (or corresponding usage fee). Once they reach their usage limit (or corresponding usage fee), certain restrictions may be imposed, such as being unable to access services utilizing large-scale language models at their normal performance or frequency.

[0241] Examples 6 to 9 in Figure 5A illustrate the switching control of the response generation process by the control unit 1110 of the artificial intelligence response output device 10010 when such limitations occur in the large-scale language model utilization service. Specifically, in Example 6, the "state before switching of LLM on the network (API-connected LLM)" is a state in which the API usage amount and API usage fees have not reached a predetermined upper limit. This means that the usage amount of LLM on the network (API-connected LLM) has not reached a predetermined upper limit. At this time, the user can use LLM on the network (API-connected LLM) in the normal state.

[0242] In Example 6, the "condition for switching" is shown as when the API usage or API usage fee reaches a predetermined limit. This means when the usage of the LLM on the network (API-connected LLM) reaches a predetermined limit. Also, in Example 6, the "destination for switching from the LLM on the network (API-connected LLM)" is shown as a second LLM on a different network from the LLM that was being used under normal circumstances (which may be called the first LLM). An example of a second LLM on the network is an LLM that is cheaper than the first LLM that was being used under normal circumstances. Since it is a lower-priced service, the performance of the second LLM is likely to be lower than that of the first LLM. Even in this case, there is still a significant advantage if a large-scale language model can be used cheaply even after the usage / usage fee limit of the first LLM has been reached.

[0243] Next, let's explain Example 7 in Figure 5A. In Example 7, the "switching destination from the network-based LLM (API-connected LLM)" in Example 6 is changed from a second network-based LLM (which may be called the first LLM) that was normally used to a "local LLM". In Example 7, even if the API usage or API usage fees reach a predetermined limit, that is, even if the usage of the network-based LLM (API-connected LLM) reaches a predetermined limit, it is possible to continue using the large-scale language model for response generation by switching to a response generation process using a local LLM, which is not subject to restrictions based on the network-based LLM usage, API usage, or API usage fees.

[0244] Next, Example 8 in FIG. 5A will be described. Example 8 is obtained by changing the "switching destination from the LLM on the network (API-connected LLM)" in Example 7 from "local LLM" to "response template sentence DB (database)". Since the response generation process by the said "response template sentence DB (database)" is the same as the process described in FIG. 1C, FIG. 2L, or FIG. 4B, repeated explanation will be omitted. In Example 8, even when the usage amount of the API or the usage fee of the API reaches a predetermined upper limit, that is, even when the usage amount of the LLM (API-connected LLM) on the network reaches a predetermined upper limit, it switches to the response generation process using the response template sentence database without being restricted by the usage amount of the LLM on the network, the usage amount of the API, or the usage fee of the API, etc. Thereby, it becomes possible to generate a response by a simpler process and output the response to the user.

[0245] Next, Example 9 in FIG. 5A will be described. Example 9 is obtained by changing the "switching destination from the LLM on the network (API-connected LLM)" in Example 7 from "local LLM" to "non-response handling". The said "non-response handling" means handling that does not generate a response to the user or does not output a response to the user. In Example 9, when the usage amount of the API or the usage fee of the API reaches a predetermined upper limit, that is, when the usage amount of the LLM (API-connected LLM) on the network reaches a predetermined upper limit, it becomes possible to simplify the handling in the case where the response generation process by the large language model on the network (large language model connected using the API) cannot be used.

[0246] According to the switching control of the response generation process of the artificial intelligence response output device 10010 shown in Examples 1 to 9 of FIG. 5A described above, even in a situation where the response generation process by the LLM on the network (large language model connected using the API) cannot be used as usual, more suitable switching or handling according to each situation can be performed.

[0247] Note that the switching control in Examples 1 to 9 in Figure 5A may be performed by combining multiple examples. For example, the switching control in Examples 1 to 3 may be combined with any of the controls in Examples 4 to 9. Similarly, the control in Example 4 or Example 5 may be combined with any of the controls in Examples 1 to 3 or 6 to 9. Similarly, the controls in Examples 6 to 9 may be combined with any of the controls in Examples 1 to 5.

[0248] Next, using Figures 5B to 5D, we will explain an example of the display of an AI assistant or character when the artificial intelligence response output device 10010 of Example 5 is configured as an AI assistant device or a character conversation device.

[0249] First, Figure 5B shows an example of the display of the AI assistant or character in the artificial intelligence response output device 10010 when performing the switching control shown in Example 3 of Figure 5A. In the example in Figure 5B, the display state of the AI assistant or character is changed depending on whether the network connection status of the artificial intelligence response output device 10010 is network-connected or network-unconnected. The states in which the artificial intelligence response output device 10010 is network-connected and network-unconnected have been explained in Figure 5A, so a repeated explanation will not be given.

[0250] In the example in Figure 5B, the artificial intelligence response output device 10010 (1) displays the AI assistant or character in a normal, awake state if a network connection is available, and (2) displays the AI assistant or character in a "sleeping" state if a network connection is unavailable. In the switching control of Example 3 in Figure 5A, if the artificial intelligence response output device 10010 is unable to connect to the network, it will not generate a response or output a response even if the user inputs an instruction. In this case, the user may feel uncomfortable if the AI assistant or character displayed by the artificial intelligence response output device 10010 is in a normal, awake state, but if the AI assistant or character displayed by the artificial intelligence response output device 10010 is in a sleeping state, the user can understand that "the AI assistant or character is not responding because it is sleeping," and the discomfort the user feels can be further reduced.

[0251] In the case of Figure 5B(2), it is desirable that the user understands that "the AI assistant or character is not responding because it is asleep" before the user makes a user input requesting a response from the large-scale language model via the touch panel, microphone 1139, or operation input unit 1107 of the artificial intelligence response output device 10010. Therefore, it is desirable that the start timing of the state in Figure 5B(2) where the AI assistant or character is displayed in a "sleeping" state when network connectivity is unavailable is immediately after the control unit 1110 of the artificial intelligence response output device 10010 determines that network connectivity is unavailable, prior to the user making a user input requesting a response from the large-scale language model.

[0252] Next, as another display example, the display example in Figure 5C will be explained. The display example in Figure 5C is an example in which the display state of the AI assistant or character is changed according to the state of "Switching destination from LLM on the network (API-connected LLM)" in the table in the switching control in Figure 5A. Specifically, Figure 5C shows: (1) an example of the display of the AI assistant or character when the artificial intelligence response output device 10010 is able to connect to a large-scale language model on the network (a large-scale language model connected using an API) and is in a state where response generation processing by the large-scale language model on the network is available (referred to as the normal state in this figure); (2) an example of the display of the AI assistant or character when the artificial intelligence response output device 10010 has switched to response generation processing using an LLM or response template database that is less powerful than the large-scale language model on the network (a large-scale language model connected using an API); and (3) an example of the display of the AI assistant or character when the artificial intelligence response output device 10010 has switched to the non-response response described in Figure 5A.

[0253] In the example in Figure 5C, for example, if (1) the artificial intelligence response output device 10010 is in the "normal state," the artificial intelligence response output device 10010 will display the AI assistant or character in a state where there are no particular problems. Note that "normal state" in Figure 5C can be considered as any state other than states (2) and (3). Also, for example, if (2) the artificial intelligence response output device 10010 switches to response generation processing using an LLM or response template database which is less powerful than the large-scale language model on the network (a large-scale language model connected using an API), the artificial intelligence response output device 10010 will display the AI assistant or character in a "sleepy" state. Note that "displaying the AI assistant or character in a 'sleepy' state" can also be expressed as "displaying a state in which the AI assistant or character is feeling sleepy."

[0254] The response generation process in (2) is less efficient than the response generation process in (1) using a large-scale language model on the network (a large-scale language model connected using an API). Therefore, by displaying the AI assistant or character in a "sleepy" state, it is possible to implicitly inform the user that the AI assistant or character has low response performance. This makes it possible to further reduce the discomfort the user feels with low-performance responses. The switching conditions for the artificial intelligence response output device 10010 to switch to response generation processing using an LLM or a response template database, which is less efficient than the large-scale language model on the network (a large-scale language model connected using an API), are as explained in Figure 5A, so a repeated explanation will be omitted.

[0255] Furthermore, in the case of Figure 5C(2), it is desirable to implicitly inform the user that the AI assistant or character has low response performance before the user input requests a response from the large-scale language model via the touch panel, microphone 1139, or operation input unit 1107 of the artificial intelligence response output device 10010. Therefore, it is desirable that the start timing of the state in Figure 5C(2) where the AI assistant or character is displayed in a "sleepy" state is immediately after the point in time when the artificial intelligence response output device 10010 switches to response generation processing using an LLM or response template database, which has lower performance than the large-scale language model on the network (the large-scale language model connected using an API), before the user input requests a response from the large-scale language model.

[0256] Furthermore, for example, when the artificial intelligence response output device 10010 switches to the non-response mode described in Figure 5A, the artificial intelligence response output device 10010 displays the AI assistant or character in a "sleeping" state. As explained in Figure 5B, by displaying the AI assistant or character displayed by the artificial intelligence response output device 10010 in a "sleeping" state, the user can understand that "the AI assistant or character is not responding because it is sleeping," thereby further reducing the sense of unease the user may feel. The conditions for the artificial intelligence response output device 10010 to switch to the non-response mode described in Figure 5A are as explained in Example 3 or Example 9 in Figure 5A, so a repeated explanation will be omitted. In the case of Figure 5C(3), it is desirable that the user understands that "the AI assistant or character is not responding because it is sleeping" before the user makes user input requesting a response from the large-scale language model via the touch panel, microphone 1139, or operation input unit 1107 of the artificial intelligence response output device 10010. Therefore, the timing of the start of the state in Figure 5C (3) where the AI assistant or character is displayed in a "sleeping" state is preferably immediately after the point in time when the artificial intelligence response output device 10010 switches to the non-response mode described in Figure 5A, prior to the user input requesting a response from the large-scale language model.

[0257] In the display example shown in Figure 5C, the artificial intelligence response output device 10010 implicitly reflects the change in the state of the AI assistant or character as a change in state, without directly providing the user with a technical explanation of the state of the artificial intelligence response output device 10010 regarding the response generation process. This reduces the sense of unease the user may feel compared to directly providing the user with a technical explanation of the state of the artificial intelligence response output device 10010 regarding the response generation process. Furthermore, it reduces the sense of unease the user may feel compared to keeping the display state of the AI assistant or character the same as the normal state, even though the state of the artificial intelligence response output device 10010 regarding the response generation process has changed.

[0258] However, some users may want a more accurate explanation of the technical status in each state. Therefore, an example of a display to accommodate such users will be explained using Figure 5D. The rows for the device status and the display status explanation in the table shown in Figure 5D are exactly the same as those in Figure 5C, so repeated explanations will be omitted. Also, the display example of the AI assistant or character shown in the row for the AI assistant or character display example is almost identical to that in Figure 5C, but differs in that a question mark (?) is displayed in the display example. This question mark (?) is a mark that the user operates when requesting an explanation from the artificial intelligence response output device 10010, and may be called a help mark.

[0259] In the example in Figure 5D, when a user selects the question mark (?) through user operation via the touch panel of the operation input unit 1107 or display unit 10011 in Figure 1B, the display of the AI assistant or character of the artificial intelligence response output device 10010 changes to the display example shown in the row of the display example after user operation. Specifically, regardless of whether the device state is (1), (2), or (3), a technical explanation of the state for each state is displayed. For example, in the example in Figure 5D, if the device state is (1) normal state, a display such as "Normal state" should be shown, explaining that it is a normal state with no particular technical limitations. Also, if the device state is (2) using a low-performance LLM or response template database, a display such as "Low-performance mode" should be shown, technically explaining that it is a low-performance state. This display can also be considered an explanation of the factors that cause the AI assistant or character to display in a "sleepy" state.

[0260] In this case, a more technically detailed explanation may be provided. Specifically, a message such as "Low-performance LLM usage mode" or "Standard response mode" may be displayed. Also, if the device status is (3) unresponsive, a message such as "Network connection unavailable" may be displayed to technically explain the reason for switching to unresponsive mode. If the reason for switching to unresponsive mode is that the response from the LLM (Large-Scale Language Model connected using an API) on the network has exceeded a predetermined time, a message such as "Response from LLM is delayed" may be displayed. Also, if the reason for switching to unresponsive mode is that the usage of LLM on the network, API usage, or API usage fees have reached their limits, a message such as "LLM usage has reached its limit," "API usage has reached its limit," or "API usage fees have reached a predetermined amount" may be displayed. These messages can be considered as explanations of the reasons why the AI assistant or character is displayed in a "sleeping" state.

[0261] As shown in the example display in Figure 5D described above, even if there are technical limitations in the response generation process of the artificial intelligence response output device 10010, by implicitly indicating the status of the device through changes in the display state of the AI assistant or character, rather than directly explaining it to the user, the sense of discomfort the user may feel can be further reduced. This display is more suitable for users who do not need a technical explanation. Furthermore, by displaying an operation mark to explain the technical status, the status of the response generation process in the artificial intelligence response output device 10010 (normal state or state with technical limitations) is displayed to the user who operates the mark. This makes it possible to provide a more suitable display for users who want to know the technical status accurately.

[0262] In the examples of Figures 5B, 5C, and 5D, the "sleeping" state is shown as an example of the display state of the AI assistant or character when it is "unresponsive," but this is just one example, and the embodiments of this example are not limited to this. Instead of the "sleeping" state, other display states that implicitly indicate an unresponsive situation, such as "on break," may be used. Also, in the examples of Figures 5C and 5D, the "sleepy" state is shown as an example of the display state of the AI assistant or character when a low-performance LLM or response template database is being used, but this is just one example, and the embodiments of this example are not limited to this. Other display states that implicitly indicate the low responsiveness of the AI assistant or character, such as "hungry," may be used.

[0263] As described above, the artificial intelligence response output device and artificial intelligence response output system according to Embodiment 5 make it possible to more effectively switch the response generation process used by the artificial intelligence response output device depending on the connection status between the large-scale language model on the network and the artificial intelligence response output device, the response delay status from the large-scale language model on the network, or the amount of utilization of the large-scale language model on the network. Furthermore, when the artificial intelligence response output device according to Embodiment 5 is configured as an AI assistant device or a character conversation device, it becomes possible to display information that is less unnatural to the user.

[0264] <Example 6> Next, Embodiment 6 of the present invention is an improvement on the artificial intelligence response output device 10010 or artificial intelligence response output system described in Figures 1 to 5 of Embodiments. Specifically, this embodiment is an example in which the response generation process of the artificial intelligence response output device 10010 is more preferably combined with a response generation process using a large-scale language model on the network or a local large-scale language model (such as the local LLM processing unit 10028) provided by the artificial intelligence response output device 10010, and a response generation process using a response template database to generate a response output. In this embodiment, the differences from these embodiments will be explained, and repeated explanations of configurations similar to these embodiments will be omitted.

[0265] Similar to the above embodiments, the artificial intelligence response output device 10010 may also be referred to as an artificial intelligence response output device, a character conversation device, an AI assistant device, an AI assistant display device, or an artificial intelligence interface device. A system including the artificial intelligence response output device 10010 and a large language model server may also be referred to as an artificial intelligence response output system, a character conversation system, an AI assistant system, an AI assistant display system, or an artificial intelligence interface system.

[0266] Using FIG. 6, an example of the response generation process in the artificial intelligence response output device 10010 according to Embodiment 6 of the present invention will be described. FIG. 6 shows an example of a flowchart of the response generation process in the artificial intelligence response output device 10010 according to Embodiment 6 of the present invention according to Embodiment 6. Specifically, a time axis on which time progresses from top to bottom, a processing flow, and an example of response output are shown. The output of the response shown in the response output example may be performed via display by the display unit 10011 of the artificial intelligence response output device 10010 or voice output by the voice output unit 1140.

[0267] In the example of FIG. 6, first, at time t0, there is a user input requesting a response from the large language model from the user via the touch panel, microphone 1139, or operation input unit 1107 of the artificial intelligence response output device 10010, and the control unit 1110 of the artificial intelligence response output device 10010 acquires the user input (step 600). Next, at time t1, the control unit 1110 starts preparations for response output using the response template sentence database stored in the storage unit 1170 and starts response output using the response template sentence database (step 601). In the example of FIG. 6, at time t2, response output using the response template sentence database has been started, and as shown in the figure, the template sentence response is being output and has not been completed. "Good morning" in the figure indicates the output up to the middle of the sentence that continues as "Good morning. …".

[0268] At time t3, before the response output using the response template database is completed, the control unit 1110 generates an instruction sentence based on the user input acquired in step 600, and sends the generated instruction sentence to the large-scale language model on the network or to the local large-scale language model (such as the local LLM processing unit 10028) of the artificial intelligence response output device 10010, initiating a request for a response from the large-scale language model (step 602). Furthermore, at time t4, before the response output using the response template database is completed, the control unit 1110 begins acquiring a response from the large-scale language model (step 603).

[0269] At time t5, an example of a response output completed using the response template database is shown. For example, Figure 6 shows an example where, at time t5, the display of the response "Good morning. Today is [Month] [Day]." has been completed using the template text stored in the response template database and date information stored in memory. Here, at time t4, before the completion of the response output using the response template database, the control unit 1110 has already started acquiring responses from the large-scale language model. Therefore, at time t6, following time t5 when the display of the response output using the response template database is completed, the control unit 1110 starts outputting responses from the large-scale language model following the response output using the response template database (step 604). Subsequently, at time t7, the response from the large-scale language model is output following the response output using the response template database. Once the response output from the large-scale language model is completed, the response output according to the processing flow shown in Figure 6 is completed (step 605).

[0270] Next, the effects of the processing flow shown in Figure 6 of the present invention will be explained. Processing large-scale language models requires many computational resources. Generally, even if inference, which requires fewer computational resources than training, is processed using a GPU (Graphics Processing Unit), it may take several seconds to more than ten seconds from the time the control unit starts requesting a response from the large-scale language model until it can obtain a response from the large-scale language model. This period corresponds to the period from time t3 to time t4 shown in Figure 6. Furthermore, from time t0, when user input is received, until time t4, the control unit 1110 has not yet obtained a response output from the large-scale language model, and therefore cannot output a response from the large-scale language model to the user.

[0271] Therefore, in a processing flow that does not include the start of preparation for response output using the response template database and the start of response output using the response template database as shown in step 601 in Figure 6, the user may have to wait for several seconds to more than 10 seconds from the time t0 when the user input is made until time t4 without receiving a response from the artificial intelligence response output device 10010. For example, if the artificial intelligence response output device 10010 is configured as an AI assistant device or a character conversation device, this waiting time may cause discomfort to the user.

[0272] In contrast, in the processing flow according to Embodiment 6 of the present invention shown in Figure 6, the control unit 1110 starts processing response output using a response template database, which requires fewer computational resources than processing the large-scale language model, before starting to acquire a response from the large-scale language model. As a result, the user does not have to wait from time t0 to time t4 without receiving a response from the artificial intelligence response output device 10010. For the user, whether the response output uses a response template database or a response output from a large-scale language model, it is still a response from the artificial intelligence response output device 10010.

[0273] Therefore, in the processing flow shown in Figure 6, by adding step 601 before step 603, the response of the artificial intelligence response output device 10010 to the user can be made to appear faster. This makes it possible to further reduce the user's discomfort caused by the length of waiting time. Furthermore, by outputting the response from the large-scale language model in step 604, following the response using the response template database, the user can perceive these outputs as a series of more natural outputs.

[0274] As described above, the artificial intelligence response output device and artificial intelligence response output system according to Embodiment 6 can reduce the user's waiting time for a response from the artificial intelligence response output device, thereby further reducing the discomfort the user feels.

[0275] <Example 7> Embodiment 7 of the present invention is an improvement on the artificial intelligence response output device 10010 or artificial intelligence response output system described in Figures 1 to 6 of Embodiments. Specifically, it is an example in which the response generation process of the artificial intelligence response output device 10010 is more preferably combined with a response generation process using a large-scale language model on a network or a local large-scale language model (such as the local LLM processing unit 10028) provided by the artificial intelligence response output device 10010 and a response generation process using a response template database to generate a response output. In this embodiment, the differences from these embodiments will be explained, and repeated explanations of configurations similar to these embodiments will be omitted.

[0276] Similar to the embodiments described above, the artificial intelligence response output device 10010 may also be referred to as an artificial intelligence response output device, a character conversation device, an AI assistant device, an AI assistant display device, or an artificial intelligence interface device. The system including the artificial intelligence response output device 10010 and the large-scale language model server may also be referred to as an artificial intelligence response output system, a character conversation system, an AI assistant system, an AI assistant display system, or an artificial intelligence interface system.

[0277] The artificial intelligence response output system according to Embodiment 7 is configured similarly to Embodiment 1 and the like, including an artificial intelligence response output device 10010 and a large-scale language model server 19001 and / or a multimodal large-scale language model server 20001 (see Figure 1A). The artificial intelligence response output device 10010 also includes a display unit 10011, a control unit 1110, a memory 1109, a non-volatile memory 1108, an external power input interface 1111, an operation input unit 1107, a power supply 1106, a secondary battery 1112, a storage unit 1170, a video control unit 1160, a posture sensor 1113, a communication unit 1132, an audio output unit 1140, a microphone 1139, a video signal input unit 1131, an audio signal input unit 1133, an imaging unit 1180, and the like (see Figure 1B).

[0278] Example 7 is an example in which an artificial intelligence response output device 10010 or artificial intelligence response output system with the above configuration performs scene search processing within a video using a large-scale language model and outputs the search results as a response.

[0279] When performing scene searches within a video using a large-scale language model, if the search information used for the search is not appropriate for the instructions (prompts) given to the large-scale language model based on the input information entered by the user, the search may not reflect the user's intent. These instructions can also be described as questions or requests, which are the user's input information. Therefore, in Example 7, when the large-scale language model performs a scene search in the video based on the instructions (input information), the search accuracy is improved by determining the superiority of the search information it references for the search.

[0280] More specifically, prior to performing a scene search within a video using a large-scale language model, a process is executed to obtain search information based on an instruction statement derived from the user's input requesting a scene search against the large-scale language model, and video information, which is information about the video to be searched (hereinafter referred to as the target video). Furthermore, a search accuracy improvement process may be executed to select or refine the search information based on its context. Finally, a scene search within the video is performed using the large-scale language model based on the search information obtained through these processes.

[0281] Figure 7 is a diagram showing an overview of the functional blocks of the artificial intelligence response output system according to Embodiment 7. As shown in Figure 7, the artificial intelligence response output system according to Embodiment 7 comprises a video information acquisition / transmission unit 1201, a video-related information processing unit 1202, a video-related generation information processing unit 1203, a search information processing unit 1204, and a video scene search processing unit 1205.

[0282] The video information acquisition / transmission unit 1201 acquires video information, which is metadata of the target video, from, for example, the storage unit 1170 of the artificial intelligence response output device 10010, and transmits the acquired video information along with the instruction text to the large-scale language model. The video-related information processing unit 1202 performs various processing related to video-related information, such as acquiring video-related information, which is a type of video information. The video-related generation information processing unit 1203 performs various processing related to video-related generation information, such as acquiring video-related generation information, which is a type of video information. The search information processing unit 1204 performs various processing, such as extracting search information, as will be described in detail later, and stores the extracted search information in the storage unit 1170, etc. The in-video scene search processing unit 1205 performs a scene search within the target video using the large-scale language model.

[0283] As an example, the video information acquisition / transmission unit 1201, the video-related information processing unit 1202, and the video-related generation information processing unit 1203 are provided by the control unit 1110 of the artificial intelligence response output device 10010. The search information processing unit 1204 and the in-video scene search processing unit 1205 are provided by the large-scale language model server 19001 having a large-scale language model, the multimodal large-scale language model server 20001, or the local LLM processing unit 10028. Alternatively, the search information processing unit 1204 and the in-video scene search processing unit 1205 may be provided by the large-scale language model server 19001, the multimodal large-scale language model server 20001, and the local LLM processing unit 10028, respectively. If the local LLM processing unit 10028 includes the search information processing unit 1204 and the in-video scene search processing unit 1205, it may further include the video information acquisition / transmission unit 1201, the video-related information processing unit 1202, and the video-related generation information processing unit 1203.

[0284] Furthermore, video information and search information are stored together with the target video in, for example, the storage unit 1170 of the artificial intelligence response output device 10010. However, the target video, video information, and search information may be stored in the storage unit of an external server, such as the large-scale language model server 19001 or the multimodal large-scale language model server 20001. In this case, the external server, such as the large-scale language model server 19001 or the multimodal large-scale language model server 20001, may be equipped with a video information acquisition / transmission unit 1201.

[0285] Here, as illustrated in Figure 8, the video information, which is the information of the target video, includes "video-related information" associated with the target video. This video-related information includes, as shown in Examples 1 to 4, at least one of the following: in-video image information (metadata including content information, etc.), which is image information within the target video; text information such as subtitles linked to this in-video image information; audio information linked to the in-video image information; and additional image information linked to the in-video image information. Additional image information refers to, for example, image information added through editing, etc.

[0286] Furthermore, the video information includes "video-related generated information" that is generated based on the video-related information described above. Video-related generated information is information that is created based on the target video by, for example, a large-scale language model and is linked to the target video. As shown in Examples 1 to 4, the video-related generated information includes at least one of the following: image information generated based on the video-related information, text information generated based on the video-related information, and audio information generated based on the video-related information. In addition, the search information is video information related to input information (instructions) that requests a scene search of the target video from the large-scale language model, and is extracted from the video information stored in the storage unit 1170, etc., as will be described later. Note that although the example shown in Figure 8 describes an example in which video-related information and video-related generated information are included in the video information, the information included in the video information is not particularly limited, and for example, video-related generated information does not necessarily have to be included.

[0287] Figure 9 is a flowchart showing an example of the response output processing flow in the artificial intelligence response output system according to Embodiment 7. For example, when a user views a video stored in the storage unit 1170 of the artificial intelligence response output device 10010, in step S011, the control unit 1110 of the artificial intelligence response output device 10010 first acquires video information of the target video to be viewed by the user. In this example, the video information acquisition / transmission unit 1201 provided in the control unit 1110 acquires video information of the target video.

[0288] Next, when the AI response output device 10010 receives a request from the user for a scene search of the target video, in step S012, the control unit 1110 generates an instruction statement (input information 1) for the large-scale language model based on the user input and sends the generated instruction statement (input information 1) to an external server having a large-scale language model, for example, a large-scale language model server 2001 (step S013). At that time, the video information acquisition / transmission unit 1201 transmits the video information of the target video to the large-scale language model server 2001. In this example, since the target video and its video information are stored in the storage unit 1170, the video information acquisition / transmission unit 1201 transmits the video information of the target video to the large-scale language model server 2001. However, if the target video and its video information are stored in the storage unit of an external server, then in step S013, only the transmission of the instruction statement (input information 1) by the control unit 1110 may occur. Furthermore, if the scene search of the target video is performed by the local large-scale language model (local LLM) provided in the local LLM control unit 10028 of the artificial intelligence response output device 10010, the instruction sentence (input information 1) is sent from the control unit 1110 to the local LLM control unit 10028.

[0289] Next, on the large-scale language model server 2001 side, for example, after the search information processing unit 1204 has performed the acquisition process of instruction text (input information 1), video information, etc. (step S014), the search information processing unit 1204 then executes the search information acquisition process based on the acquired instruction text (input information 1) and video information (step S015). Specifically, as shown in Figure 10 as an example, the search information processing unit 1204 refers to input information 1 (instruction text) and video information as part of the search information acquisition process and makes a determination of video information (search information) related to input information 1. That is, the search information processing unit 1204 determines whether each piece of information contained in the video information is related to input information 1. Then, it acquires the video information that it has determined to be related to input information 1 as search information. In other words, the search information processing unit 1204 executes the process of extracting information related to input information 1 from the video information of the target video as search information.

[0290] As shown in Figure 11 as an example, the search information obtained through the search information acquisition process includes "video-related information" extracted from the video information. This video-related information includes, as shown in Examples 1 to 4, in-video image information related to the input information (input information 1 in this example), text information linked to the in-video image information related to the input information, audio information linked to the in-video image information related to the input information, and additional image information linked to the in-video image information related to the input information. Furthermore, the search information includes "video-related generated information" extracted from the video information. This video-related generated information includes, as shown in Examples 1 to 3, image information generated based on the video-related information related to the input information, text information generated based on the video-related information related to the input information, and audio information generated based on the video-related information related to the input information.

[0291] Furthermore, the following processes may be executed as part of the search information acquisition process in step S015. As illustrated in Figure 10, various reference information for determining video information (search information) related to the input information may be acquired from the storage unit 1170 or an external storage unit, and the video information may be determined based on the acquired reference information. Alternatively, the input information and video information may be converted into text, audio, or images, and the video information may be determined based on that information.

[0292] Returning to the flow in Figure 9, once the search information processing unit 1204 acquires the search information, in step S016, the video scene search processing unit 1205 performs a scene search of the target video using the search information as a response generation process, and generates the search results. The search result information (response generation processing result information) generated by this response generation process is transmitted from the large-scale language model server 2001 to the artificial intelligence response output device 10010 (step S017). When the artificial intelligence response output device 10010 receives the response generation processing result information, the control unit 1110 performs a response reflection process based on the received response generation processing result information (step S018). As part of the response reflection process, the control unit 1110 outputs the search results of the scene search of the target video via the display unit 10011 and the audio output unit 1140.

[0293] Figure 12 shows an example of the display state of the display unit as a user interface according to Embodiment 7. The example shown in Figure 12 is an example in which the user inputs the request statement "Please display the scene showing lightning" as input information 1 along with the target video, and the result of the scene search is displayed on the display unit 10011 through the response reflection process. In the example shown in Figure 12, as a response from the Large-Scale Language Model (LLM), along with the response statement "We will display the scene showing lightning," multiple scenes from the target video that show "lightning" are played sequentially.

[0294] As an example, below the display area of the target video, a mark (so-called playhead) 1301 indicating the playback position of the video and a progress bar (so-called seek bar) 1302 indicating the progress of video playback are displayed. The scenes in the target video that show "thunder," that is, the range (time frame) in which "thunder" is shown, are marked on the progress bar 1302 as scenes s1 to s5. The example in Figure 12 shows the state in which scene s1 of the target video is being played, and the image of the target video displays the subtitle "Oh, it's thunder," which is text information linked to the image information in the video, and the additional image (additional text) "Thunder rumbled!" which is additional image information linked to the image information in the video.

[0295] As described above, in Example 7, when performing a scene search of a video using a large-scale language model in response to a user's request, search information is extracted from the video information and used. This makes it easier to extract scenes that are more suitable to the user's request as a result of the scene search. In other words, the accuracy of search results in scene searches can be improved. Furthermore, by performing a scene search of a video using search information, the processing load on the system during the scene search can be reduced.

[0296] (Variation 1) Figure 13 is a flowchart showing a modified example of the response output processing flow in the artificial intelligence response output system according to Example 7. In the flowchart of Figure 13, the same reference numerals are used for the same steps as in the flowchart of Figure 9, and redundant explanations may be omitted.

[0297] The example shown in Figure 13 is a modified example of the processing performed by the search information processing unit 1204. Specifically, in this example, after the search information acquisition process is performed in step S015, a search accuracy improvement process is executed in step S021, which selects search information based on the context of the search information. In other words, in step S021, a search accuracy improvement process is executed to remove non-searchable information from the search information acquired in the search information acquisition process. As an example, the search information processing unit 1204 selects non-searchable information from the search information based on the context of the search information and executes a search accuracy improvement process to remove the selected non-searchable information from the search information. Then, in step S016, a scene search of the target video is performed using the search information on which the search accuracy improvement process has been performed.

[0298] In the context-based search accuracy improvement process for search information, as shown in Example 1 of Figure 14, for example, information about the word immediately preceding a negative word in the search information is selected as non-search information. In other words, in the search accuracy improvement process, information about the word immediately preceding a negative word is not selected as search information. Also, as shown in Example 2, information about the word immediately preceding a presumptive or interrogative word in the search information is selected as non-search information. In other words, information about the word immediately preceding a presumptive or interrogative word is not selected as search information. Furthermore, as shown in Example 3, information about a word that is cut off midway is selected as non-search information. In other words, information about a word that is cut off midway is not selected as search information.

[0299] As shown in Figure 15 as an example, suppose a user inputs the following information to the artificial intelligence response output device 10010: "Please display a scene showing lightning." At this time, the search information acquisition process may acquire, based on the word "lightning" contained in the input information, text information or video information containing audio information that includes the word "lightning," such as "Oh, lightning... no, it's just a cloud." "Oh, it might be lightning." "Oh, lightning... just a cloud."

[0300] When search accuracy improvement processing is performed on searchable information, the three video information examples shown in Figure 15 are all selected as non-searchable information and excluded from the searchable information. In the case of text information such as "Oh, thunder... no, it's just a cloud," the information about "thunder," which is the word immediately preceding the negative phrase, is considered non-searchable information. Similarly, in the case of text information such as "Oh, it might be thunder," the information about "thunder," which is the word immediately preceding the speculative or questioning word, is considered non-searchable information. Furthermore, in the case of text information such as "Oh, lightning... just a cloud," the phrase "thunder" is considered to be cut off midway, so the information that would make it "thunder" is considered non-searchable information. In other words, information related to the above three text information examples is considered non-searchable information.

[0301] As a result, video information related to definitive text information or audio information such as "Oh, it's thunder." is adopted as search information (see Figure 12). Therefore, the response generation process can more reliably output scenes showing "thunder." In other words, by performing a scene search of the target video using search information that has undergone search accuracy improvement processing, it is possible to output a response content to the user that is more appropriate to the user's input information (instruction).

[0302] Furthermore, as shown in Figure 14 as Example 4, as an example of search accuracy improvement processing based on the context of the search information, information about actions that were interrupted or stopped midway may be selected as non-search information. For example, as shown in Figures 16A and 16B as an example, suppose a user enters a request (input information) that says, "Please display a scene of someone practicing pitching in baseball." If the search accuracy improvement processing is not performed, as processing for this request, for example, information from a video in which the pitching motion has started but has been stopped before the ball is thrown, that is, information from a video in which the pitching practice was interrupted, may be adopted as search information and used in the response generation process. In this case, for example, as shown in Figure 16A, as a response from the Large-Scale Language Model (LLM), along with the response statement, "We will display a scene of someone practicing pitching," scenes s11 to s16 related to pitching practice, including scene s11 of the video in which the pitching motion was stopped, are sequentially played from the target video.

[0303] On the other hand, when the search accuracy improvement process is performed, information from videos where the pitching motion was interrupted (video-related information) is removed from the search information by the subsequent search accuracy improvement process, even if it was acquired as search information by the search information acquisition process. In other words, the search information after the search accuracy improvement process is performed is, for example, information from videos of pitching motions where the ball is thrown without interruption, that is, information from videos where pitching practice is actually performed. In this case, for example, as shown in Figure 16B, the response from the large-scale language model is, "We will display scenes of pitching practice," and scenes s12 to s16 of the target video, in which pitching practice is actually performed, are sequentially played, with the response statement, "We will display scenes of pitching practice."

[0304] In this way, even if the search accuracy improvement process is designed to separate information about interrupted operations as non-searchable information, it becomes easier to output a response to the user that is more appropriate to the user's input information (instructions).

[0305] In the configuration of Embodiment 7 described above, the large-scale language model (external LLM) provided by the large-scale language model server 2001 performs the search information acquisition process, or the search information acquisition process and the search accuracy improvement process, along with the response generation process. However, the response generation process, the search information acquisition process, and the search accuracy improvement process do not necessarily all have to be performed by the external LLM. For example, as shown in Figure 17, the search information acquisition process and the search accuracy improvement process may be performed by the local LLM. In this example, in step S012, when an instruction statement (input information 1) for the large-scale language model is generated based on user input, in step S013, the instruction statement (input information 1) and video information are sent from the control unit 1110 to the local LLM control unit 10028. Next, in step S031, the search information processing unit 1204 provided by the local LLM control unit 10028 performs the acquisition process of the instruction statement (input information 1), video information, etc., and then performs the search information acquisition process described above based on the acquired instruction statement and video information (step S032). Furthermore, in step S033, the search accuracy improvement process described above is executed on the search information obtained through the search information acquisition process. The contents of the search information acquisition process and the search accuracy improvement process are as described above, so their explanation is omitted here.

[0306] After the local LLM completes the process of acquiring search information and improving search accuracy, in step S034, the video information and the search information based on the search accuracy improvement process (search information after the search accuracy improvement process has been performed) are sent from the local LLM control unit 10028 to the large-scale language model server 2001. Subsequently, the large-scale language model server 2001 performs the acquisition process of video information, search information based on the search accuracy improvement process, etc. (step S035), and in step S016, as a response generation process, a video scene search using the large-scale language model is performed as described above.

[0307] (Modification 2) In the above-described modification 1, a search accuracy improvement process based on the context of the search information was explained as an example of a search accuracy improvement process executed by the search information processing unit 1204. However, the search accuracy improvement process executed in steps S021 and S033 is not limited to this. The search accuracy improvement process may, for example, be a process based on a reference priority determination of the search information. In the search accuracy improvement process according to modification 2, video information to be referenced preferentially (priority information) is set from among multiple video information contained in the search information. Then, video information whose content differs from the priority information is selected as non-search information, and the selected non-search information is excluded from the search information. Priority information is information that is preferentially applied as search information in the search accuracy improvement process, or more precisely, information that is preferentially applied in the response generation process. Which information is to be designated as priority information may be set in advance, but it may also be possible to change it arbitrarily, for example, in response to a user's request.

[0308] In the search information acquisition process described as Modification 1, for example, the relationship between in-video image information, text information linked to in-video image information, audio information linked to in-video image information, and additional image information linked to in-video image information and the input information is determined individually. Therefore, even if the video information extracted as search information (video information related to the input information) is in-video image information, the content of the in-video image information, the text information linked to the in-video image information, the audio information linked to the in-video image information, and the additional image information linked to the in-video image information does not necessarily match. Therefore, in the search accuracy improvement process related to Modification 2, the information set as priority information is considered correct information, and even if the information is acquired as search information by the search information acquisition process, information whose content differs from the priority information is selected as non-search information.

[0309] As an example, in the search accuracy improvement process, let's assume that in-video image information related to the input information is designated as priority information (also called priority search information), as shown in Example 1 of Figure 18A. In this case, in the search accuracy improvement process, even if information is acquired as search information in the search information acquisition process, if the content differs from the priority information (in-video image information related to the input information), it will be designated as non-search information. For example, if the additional image information linked to the in-video image information acquired as search information differs in content from the priority information (in-video image information), this additional image information linked to the in-video image information will be selected as non-search information and excluded from the search information.

[0310] As shown in Figure 19 as an example, suppose a user inputs the request statement, "Please display the scene with lightning," along with the target video, to the artificial intelligence response output device 10010. Furthermore, suppose the target video has in-video image information for an image showing "clouds (without lightning)," and that text information and audio information, "Oh, it's lightning," are associated with this in-video image information. In this example, the text information etc. associated with the in-video image information "Oh, it's lightning," is acquired as search information through the search information acquisition process as described above. However, since the acquired text information etc. associated with the in-video image information differs in content from the priority information (in-video image information related to the input information), it is selected as non-search information in the search accuracy improvement process and excluded from the search information. In other words, if the content of the in-video image information differs from the content of the text information and audio information, the response generation process in step S016 performs a scene search based on the in-video image information. As a result, in the scene search during the response generation process, only scenes in which lightning is actually visible are extracted (see Figure 12), and scenes in the video that do not contain lightning, such as the image shown in Figure 19, are not extracted. Furthermore, when this search accuracy improvement process based on prioritizing the reference of search information is performed, when outputting the search results for the scene search of the target video (step S018), an LLM response statement such as "Scenes in which the content of the image information in the same scene in the video differs from the content of the text and audio information were searched based on the image information" may also be output.

[0311] Furthermore, in the search accuracy improvement process, as shown in Example 2 of Figure 18A, additional image information related to the input information is considered priority information. In this case, first, the content of the priority information (additional image information linked to the in-video image information related to the input information) is compared and judged against other information acquired as search information, such as the content of the in-video image information related to the input information. As a result of this judgment, in-video image information related to the input information that differs in content from the priority information is treated as non-search information. If the in-video image information acquired as search information differs in content from the priority information, that in-video image information is selected as non-search information and excluded from the search information. In other words, if the in-video image information acquired as detection information differs in content from the additional image information, the response generation process in step S016 will perform a search based on the additional image information.

[0312] As shown in Figure 20A as an example, suppose a user inputs the following request to the artificial intelligence response output device 10010 along with the target video: "Please display the scene in which a dog is shown." The target video contains an image of an animal (illustration), and for example, image recognition processing determines that the image in the video is an image of a dog, and the video image information contains information indicating that it is an "image of a dog (illustration)." Furthermore, suppose that the information of an additional image (so-called caption) added to the video image, that is, the additional image information linked to the video image information, contains information such as "This illustration is a fox. It is not a dog." In this case, in the search information acquisition process, the video image information and the additional image information linked to the video information are acquired as search information, respectively. However, since the content of the video image information differs from the content of the additional image information (caption content), which is the priority information, the video image information is selected as non-search information and excluded from the search information. In other words, the scene search in the response generation process is performed based on additional image information linked to the in-video image information, which includes content such as "This is not a dog." Therefore, the scene search in the response generation process does not extract scenes containing the in-video image shown in Figure 20A. Note that in-video images that have in-video image information designated as non-searchable information, and similar in-video images, may subsequently be treated as "not images of a dog."

[0313] On the other hand, as shown in Figure 20B as an example, suppose a user inputs the request statement, "Please display the scene in which a dog is shown," along with the target video, as input information to the artificial intelligence response output device 10010. Furthermore, suppose that the target video contains an image (illustration) of an animal, and that, for example, image recognition processing determines that the image in the video is not an image of a dog, and that the information in the video contains information stating that "this is not an image (illustration) of a dog." Also, suppose that the information of an additional image (so-called caption) added to the image in the video, that is, the additional image information linked to the image information in the video, contains information such as "This illustration is a dog. It is not a fox." In this case as well, it is conceivable that the search information acquisition process will acquire both the image information in the video and the additional image information linked to the image information in the video.

[0314] In this process, if in-video image information related to the input information is prioritized as information for improving search accuracy (Example 1 in Figure 18A), the in-video image information is filtered out as non-searchable information. Therefore, the scene containing the in-video image shown in Figure 20B will not be extracted by the scene search in the response generation process. On the other hand, if additional image information linked to the in-video image information related to the input information is prioritized as information for improving search accuracy (Example 2 in Figure 18A), the in-video image information is filtered out as non-searchable information and excluded from the searchable information. Therefore, the scene containing the in-video image shown in Figure 20B will be extracted by the scene search in the response generation process. Note that in-video images containing in-video image information that has been filtered out as non-searchable information, and similar in-video images, may subsequently be treated as "dog images."

[0315] Furthermore, in the search accuracy improvement process, for example, as shown in Example 3 of Figure 18B, in-video image information of non-silhouette images related to the input information may be treated as priority information. In this case, in the search accuracy improvement process, in-video image information of silhouette images related to the input information is treated as non-searchable information because it is different from the in-video image information of non-silhouette images related to the input information (priority information). In other words, among the in-video image information extracted as searchable information, in-video image information of silhouette images is selected as non-searchable information and excluded from the searchable information. To put it another way, a scene search is performed based on the in-video image information in the response generation process in step S016 only when the in-video image information acquired as detection information is in-video image information of non-silhouette images.

[0316] In this context, a silhouette image is an image in which the outline of an object or person can be recognized, but the details of the object cannot. The determination of whether an image is a silhouette image or a non-silhouette image can be made based on arbitrarily set criteria.

[0317] As shown in Figure 21A as an example, suppose a user inputs the following request to the artificial intelligence response output device 10010 along with the target video: "Please display the scene in which a butterfly is shown." Also, suppose the target video contains an image with a butterfly silhouette. In the search information acquisition process, if the video image information of non-silhouette images is given priority, the video image information of a silhouette image, as shown in Figure 21A, is selected as non-searchable information as described above. Therefore, the scene search in the response generation process does not extract scenes containing a video image that is a silhouette image, as shown in Figure 21A.

[0318] On the other hand, as shown in Figure 21B as an example, suppose a user inputs the request statement, "Please display the scene in which a butterfly is shown," along with the target video, as input information to the artificial intelligence response output device 10010. Also, suppose that the target video contains an image that is not a silhouette image of a butterfly. If the video image information of a non-silhouette image is prioritized in the search information acquisition process, the video image information of a non-silhouette image as shown in Figure 21B becomes prioritized information and is retained as search information. Therefore, the scene search in the response generation process appropriately extracts scenes that contain video images that are not silhouette images, as shown in Figure 21B.

[0319] Furthermore, in the search accuracy improvement process, the in-video image information of the silhouette image related to the input information may be treated as non-priority information (also called non-priority search information) rather than non-search information, because it is different from the priority information. In this case, the in-video image information of the silhouette image related to the input information is not excluded from the search information. Then, in the response generation process of the next step S016, a response using priority information and a response using non-priority information are generated.

[0320] More specifically, the response generation process performs a scene search within the target video by referencing the in-video image information (priority information) of non-silhouette images related to the input information and the in-video image information (non-priority information) of silhouette images related to the input image. Then, responses are generated that display the search results differently for the results obtained by referencing the in-video image information (priority information) of non-silhouette images related to the input information and the results obtained by referencing the in-video image information (non-priority information) of silhouette images related to the input information.

[0321] Subsequently, in the response reflection process of step S018, each search result is output to the user in a different display format. For example, when each extracted scene is marked on the progress bar 1302 as described above (see Figure 12, etc.), the display format of the mark (e.g., size, color, shape, pattern, etc.) will be different for scenes searched using priority information and scenes searched using non-priority information.

[0322] Furthermore, in the search accuracy improvement process, for example, as shown in Example 4 of Figure 18B, in-video image information of a predetermined size or larger related to the input information may be treated as priority information. In other words, in-video image information in which the ratio of the target image area (image area of an object related to the input information) to the total image area is greater than or equal to a predetermined value may be treated as priority information. In this case, as part of the search accuracy improvement process, first, the ratio of the size of the target image area of the in-video image information related to the input information to the total image area is determined. Then, in-video image information related to the input information in which the above ratio is less than the predetermined value is selected as non-searchable information and excluded from the searchable information. That is, in this example, in the response generation process in step S016, the scene search of the target video is performed based only on in-video image information related to the input information in which the above ratio is greater than or equal to the predetermined value.

[0323] For example, if a user requests to "show a scene with a bird," and as shown in Figure 22A, the area of the image of the target object, the "bird" (target image area) A2 is relatively small compared to the entire image area A1, and the ratio of the size of the target image area A2 to the entire image area A1 is less than a predetermined value, then the in-video image information for this image is treated as non-searchable information. On the other hand, as shown in Figure 22B, if the area of the target image A2 is relatively large compared to the entire image area A1, and the ratio of the size of the target image area A2 to the entire image area A1 is greater than or equal to a predetermined value, then the in-video image information for this image is treated as priority information and retained as searchable information.

[0324] In this example, the decision of whether or not to prioritize the in-video image information is based on the ratio of the target image area A2 to the entire image area A1, but the decision criteria are not limited to this. For example, as shown by the solid lines in Figures 22A and 22B, if the size of the comparison image area A3 to be compared is specified in advance according to the user's request, the decision of whether or not to prioritize the in-video image information may be based on the ratio of the target image area A2 to the comparison image area A3.

[0325] Furthermore, similar to the case where in-video image information of silhouette images related to input information is treated as non-priority information, in-video image information related to input information where the above ratio is less than a predetermined value (in-video image information that is different from priority information) may be treated as non-priority information instead of non-searchable information.

[0326] Incidentally, the search accuracy improvement process described above basically compares information from the same time period. As shown in Example 1 of Figure 23, the search accuracy improvement process is performed by referring to the search information from the same time period. For example, as shown in Figure 12, when multiple scenes s1 to s5 are extracted by a scene search, the search accuracy improvement process compares the search information within each scene s1 to s5. This is because the content of each piece of information may change depending on the time period, and there is a risk that information from different time periods cannot be properly compared.

[0327] However, if certain conditions are met, the search accuracy improvement process may compare and refer to information from different time periods. As shown in Figure 23 as Example 2, when referring to search information from different time periods, it is determined whether there is information indicating a relationship between the search information from different time periods (for example, information indicating a specific time in a video, information indicating that the image information within each video is similar, etc.). Note that different time periods (time frames) refer to discontinuous time periods, which are time periods separated by a predetermined amount of time or more. If information indicating a relationship between the search information from different time periods is detected, the search accuracy improvement process may be executed by referring to the search information from different time periods. On the other hand, if no information indicating a relationship is detected, the search accuracy improvement process is executed without referring to different time periods.

[0328] As shown in Figure 24A as an example, suppose a user inputs the following request to the artificial intelligence response output device 10010 along with the target video: "Please display the scene in which a cat is shown." In this example, suppose the opening scene s21 of the target video contains an in-video image showing a cat (illustration). However, suppose this cat image (illustration) may not be recognized as a cat by, for example, recognition image processing. Furthermore, suppose the in-video image information for scene s21 is associated with text information and audio information such as "This cat illustration is from 10:00." Also, as shown in Figure 24B as an example, suppose scene s22, approximately 10 minutes after the start of the target video, contains the same in-video image as the cat image displayed in the opening scene s21. However, the in-video image information for scene s22 is not associated with text information or audio information.

[0329] When a user requests a video to "show scenes with cats," and a large-scale language model is used to search for scenes in the video, the opening scene s21 is reliably extracted, but there is a risk that the scene s21 10 minutes later in the video may not be extracted correctly. The opening scene s21 can be extracted correctly based on text information such as "scenes with cats," even if the in-video image information related to the input information does not allow it to be determined that the image (text) is of a "cat." On the other hand, the scene s22 10 minutes later in the video will be extracted based on the in-video image information related to the input information, and therefore there is a risk that it may not be extracted correctly.

[0330] However, in the opening scene s21 of the target video, the text information "This cat illustration is from 10:00" is displayed, making it clear that the image displayed in scene s21, 10 minutes later in the target video, is a cat illustration. Therefore, if information indicating a specific time in the target video is included in the text or audio information linked to the image information within the video, the search accuracy improvement process may compare and reference the information from different time periods. In other words, if information indicating the relationship between scenes at different time periods is included in the text or audio information linked to the image information within the video, the search accuracy improvement process may compare and reference the information from different time periods. This allows for more appropriate scene searching of the target video in the response generation process. For example, in the examples shown in Figures 24A and 24B, in a scene search using search information, the text information linked to the image information within the opening scene s21 of the target video is referenced, and scene s22 of the target video is appropriately extracted.

[0331] <Example 8> Embodiment 8 of the present invention is an improvement on the artificial intelligence response output device 10010 or artificial intelligence response output system described in Embodiment 7. In this embodiment, the differences from Embodiment 7 will be explained, and the same configuration as in Embodiment 7 will not be repeated in the explanation.

[0332] The artificial intelligence response output device 10010 according to Example 8 may be, for example, a device mounted on a vehicle such as an automobile, and may have functions such as a so-called drive recorder. Alternatively, the artificial intelligence response output device 10010 according to Example 8 may be, for example, a mobile terminal device such as a smartphone or tablet.

[0333] As shown in Figure 25, the artificial intelligence response output device 10010 according to Embodiment 8 includes a display unit 10011, a control unit 1110, a memory 1109, a non-volatile memory 1108, an external power input interface 1111, an operation input unit 1107, a power supply 1106, a secondary battery 1112, a storage unit 1170, a video control unit 1160, a posture sensor 1113, a communication unit 1132, an audio output unit 1140, a microphone 1139, a video signal input unit 1131, an audio signal input unit 1133, and an imaging unit 1180. Furthermore, it includes, for example, a positioning sensor 1191 using GPS (Global Positioning System), a timer 1192, a thermosensor 1193 for detecting ambient temperature, and the like.

[0334] In Embodiment 8, the artificial intelligence response output device 10010 or artificial intelligence response output system with the above configuration primarily performs scene search processing within a video using the local large-scale language model of the local LLM processing unit 10028, and outputs the search results as a response. In Embodiment 8, when the local LLM processing unit 10028, which includes a search information processing unit 1204 and a video scene search processing unit 1205, performs search information acquisition processing to acquire search information based on an instruction sentence based on user input information requesting a scene search against the large-scale language model, and video information which is information about the video to be searched (target video). Furthermore, the local LLM processing unit 10028 performs search accuracy improvement processing to select or refine the search information, and performs scene search within the video using the search information acquired through these processes.

[0335] In the case of the artificial intelligence response output system according to Example 8, similar to Example 7, the video information, which is information about the target video, includes video-related information and video-related generation information. Furthermore, the video-related information includes, as shown in Figure 26 as Examples 1 to 4, video-related information, text information, audio information, additional image information, etc., which are image information within the target video, similar to Example 7. In addition, the video-related information includes, as shown in Examples 5 to 12, location / map information linked to video-related images, date and time information linked to video-related images, speed information linked to video-related images, direction information linked to video-related images, vibration information linked to video-related images, temperature information linked to video-related images, activity history (log) information linked to video-related images, external communication history (log) information linked to video-related images, etc. Activity history information may include, for example, the vehicle's movement history or communication history due to operations such as acceleration and braking. Furthermore, various types of information, such as location / map information, date and time information, speed information, direction information, vibration information, temperature information, activity history information, and external communication history information, are acquired from positioning sensors 1191, timers 1192, thermosensors 1193, etc., equipped in the artificial intelligence response output device 10010. Of course, the artificial intelligence response output device 10010 may also be equipped with other sensors as needed. In addition, if the artificial intelligence response output device 10010 is mounted on a vehicle, it may be configured to acquire the above types of information from sensors equipped in the vehicle.

[0336] Furthermore, the search information in Example 8 includes video-related information extracted from the video information, similar to Example 7. This video-related information includes, as shown in Figure 27 as Examples 1 to 4, video image information related to the input information, text information linked to video image information related to the input information, audio information linked to video image information related to the input information, and additional image information linked to video image information related to the input information, similar to Example 7. In addition, as shown in Examples 5 to 12, this video-related information includes location / map information linked to video images related to the input information, date and time information linked to video images related to the input information, speed information linked to video images related to the input information, direction information linked to video images related to the input information, vibration information linked to video images related to the input information, temperature information linked to video images related to the input information, activity history information linked to video images related to the input information, and external communication history information linked to video images related to the input information.

[0337] In the response output system according to Example 8, the response output processing is performed in the same flow as in Example 7 (see Figure 13). For example, after the search information acquisition process is performed in step S015, the search accuracy improvement process is performed in step S021. In Example 8, in the search accuracy improvement process, information acquired from the artificial intelligence response output device 10010 or sensors provided by the vehicle and linked to the video images related to the input information, such as location / map information linked to the video images related to the input information, date and time information linked to the video images related to the input information, speed information linked to the video images related to the input information, direction information linked to the video images related to the input information, vibration information linked to the video images related to the input information, temperature information linked to the video images related to the input information, behavior history information linked to the video images related to the input information, and external communication history information linked to the video images related to the input information, is set as priority information. Furthermore, the LLM may be responsible for deciding which information to prioritize.

[0338] Specifically, as shown in Figure 28 as Examples 1 to 9, it is preferable to select the information to be set as priority information depending on whether the user input information includes, for example, location, date and time, speed, direction, vibration, temperature, activity history, external communication history, etc. For example, as shown in Figure 28 as Example 1, if the input information includes location, more specifically, if it includes geographical location, position, distance, proximity, or landmark, the priority information will be set to location / map information linked to the in-video image related to the input information, or video-related generated information generated based on the location / map information linked to the in-video image related to the input information. As an example, let's assume that location / map information linked to the in-video information related to the input information is set as priority information.

[0339] In this case, during the search accuracy improvement process in step S021, even if the information was acquired as search information during the search information acquisition process, if its content differs from the priority information (location / map associated with the in-video image information related to the input information), it is treated as non-searchable information. For example, if the in-video image information, text information associated with the in-video image information, and audio information associated with the in-video image information acquired as search information differ in content from the priority information, which is the location / map information, this information is selected as non-searchable information and excluded from the search information. In other words, the location / map associated with the in-video image information related to the input information is set higher in the priority of search information than the in-video image information, text information associated with the in-video image information, etc. This allows for more appropriate scene searching of the target video during the response generation process. Furthermore, because priority information is set according to the input information, the extraction of inappropriate scenes in scene searches is more appropriately suppressed.

[0340] As shown in Figures 29A and 29B, a user uploads a video linked to text information, audio information, and location information to the artificial intelligence response output device 10010, and the request statement "Please display the scene when I am in Okayama Prefecture" is input as input information for the uploaded video (target video). In this case, based on the content of the text information such as "when I am in Okayama Prefecture," the input information is determined to be location-related and highly relevant to location, and location / map information is set as priority information. The example in Figure 29A is an example where the video images (video image information) of a video being taken while moving outside Okayama Prefecture are linked to text information such as "I think I've entered Okayama Prefecture now." On the other hand, the example in Figure 29B is an example where the video images (video image information) of a video being taken while moving within Okayama Prefecture are linked to text information such as "This is Kibitsu Shrine in Okayama Prefecture."

[0341] As described above, when the search accuracy improvement process is executed with location / map information linked to in-video image information related to the input information as the priority information, in response to the user's request statement, "Please show the scenes when I am in Okayama Prefecture," it is first determined, based on the location / map information, whether or not the in-video image is an image of Okayama Prefecture. If it is determined that the image is outside Okayama Prefecture, the information related to that in-video image will be filtered out as non-searchable information. For example, text information and audio information linked to in-video image information, such as "I think I've entered Okayama Prefecture now," will also be filtered out as non-searchable information.

[0342] Incidentally, text information linked to the in-video image information, such as "I think we've entered Okayama Prefecture now," is acquired as search information in the search information acquisition process, as described above. Therefore, if the search accuracy improvement process is not performed, the response generation process in step S016 will perform a scene search based on the search information that includes the text information "We're already in Okayama...". For this reason, in the example shown in Figure 29A, in response to the user's request, "Please display the scene when we are inside Okayama Prefecture," the scene search may extract scenes containing in-video images of driving outside Okayama Prefecture and output them to the user.

[0343] Furthermore, if the user input includes ambiguous expressions, such as those relating to distance or degree, it is preferable to set appropriate thresholds corresponding to these expressions. For example, if the input includes "near point A," it is preferable to set a threshold such as "within a specified distance (e.g., within 1 kilometer)" instead of the expression "near."

[0344] <Example 9> Embodiment 9 of the present invention is an improvement on the artificial intelligence response output device 10010 or artificial intelligence response output system described in Embodiment 8. In this embodiment, the differences from Embodiment 8 will be explained, and the same configuration as in Embodiment 8 will not be repeated in the explanation.

[0345] The artificial intelligence response output device 10010 according to Example 9 is a device capable of acquiring special information such as the following, and specific examples include mobile terminal devices such as smartphones and tablets. The configuration of the artificial intelligence response output device 10010 according to Example 9 is the same as that of the artificial intelligence response output device 10010 according to Example 8, except that it does not have a thermosensor 1193 (see Figure 25).

[0346] In Examples 7 and 8, the search information acquisition process was based on user input information (instructions) and video information, which is related to the target video. In contrast, Example 9 differs from Examples 7 and 8 in that it acquires search information based on special information in addition to user input information and video information.

[0347] Special information, specifically as shown in Figure 30 as Examples 1 to 6, includes, for example, personal or real-time (real-time) image information, personal or real-time location information, personal or real-time direction information, personal or real-time date and time information, personal activity history (log) information, and device-internal information. In other words, special information is personal or real-time information acquired by a mobile terminal device such as a smartphone owned by the user, or device-internal information stored in the storage unit, etc.

[0348] Furthermore, the search information includes not only video information related to the input information, but also special information related to the input information and information that is highly relevant to the special information related to the input information. Examples of special information and information highly relevant to special information included in the search information are shown in Figure 31 as Examples 1 to 6, such as: "Personal or real-time image information related to the input information," "Personal or real-time location information / personal or real-time map information related to the input information," "Personal or real-time direction information / personal or real-time location information or map information related to the input information," "Personal or real-time date and time information / personal or real-time location information related to the input information," "Personal activity history (log) information / personal calendar information related to the input information," and "Device built-in information related to the input information."

[0349] Figure 32 is a diagram showing an overview of the functional blocks of the artificial intelligence response output system according to Embodiment 9. As shown in Figure 32, the artificial intelligence response output system, like Embodiments 7 and 8, comprises a video information acquisition / transmission unit 1201, a video-related information processing unit 1202, a video-related generation information processing unit 1203, a search information processing unit 1204, and a video scene search processing unit 1205, and further comprises a special information processing unit 1206. The special information processing unit 1206 performs various processes related to special information, such as acquiring the special information described above, and stores the acquired special information in a storage unit 1170 or the like.

[0350] Furthermore, in Example 9, similar to Example 8, the artificial intelligence response output device 10010 or artificial intelligence response output system performs scene search processing within the video using the local large-scale language model of the local LLM processing unit 10028, and outputs the search results as a response. For this purpose, the local LLM processing unit 10028 is equipped with a search information processing unit 1204, a video scene search processing unit 1205, and a special information processing unit 1206, and when performing scene search processing within the video, it performs search information acquisition processing based on user input information, video information, and special information. In addition, the local LLM processing unit 10028 performs search accuracy improvement processing to select or refine the search information, and performs scene search within the video using the search information acquired through these processes.

[0351] In the response output system according to Example 9, the response output processing is performed in the same flow as in Example 7 (see Figure 13). For example, in step S015, the search information acquisition process is performed, and then in step S021, the search accuracy improvement process is further performed.

[0352] In this embodiment 9, the process shown in Figure 33 is executed as the search information acquisition process in step S015. First, similar to the embodiment described above, the system refers to the user input information and video information to determine whether each video piece is related to the input information, and acquires the information related to the input information as search information. The system also refers to the input information to determine whether each acquired special piece of information is related to the input information, and acquires the special piece of information related to the input information as search information. Furthermore, the system determines whether there is any information that is highly related to the special piece of information acquired as search information (for example, information within the device), and if there is any highly related information, it acquires it as search information.

[0353] Furthermore, the following processes may be executed as the search information acquisition process in step S015. As illustrated in Figure 33, various reference information for determining video information (search information) related to the input information may be acquired from the storage unit 1170 or an external storage unit, and the video information may be determined based on the acquired reference information. Alternatively, the input information and video information may be converted into text, audio, or images, and the video information may be determined based on that information. Furthermore, various reference information for determining special information (search information) related to the input information may be acquired from the storage unit 1170 or an external storage unit, and the special information may be determined based on the acquired reference information. Alternatively, the input information and special information may be converted into text, audio, or images, and the special information may be determined based on that information.

[0354] Furthermore, in the response output system according to Embodiment 9, as shown as Example 1 in Figure 34, if the input information is personal or real-time content (excluding Examples 2 to 5), the special information processing unit 1206 acquires information including personal or real-time image information as special information. Then, the search information acquisition process in step S015 is executed based on the video information and the special information. In the search information acquisition process, special information including personal or real-time image information related to the input information, which is stored in the artificial intelligence response output device 10010 or captured by the imaging unit (camera) 1180 of the artificial intelligence response output device 10010, is acquired as search information.

[0355] Furthermore, as part of the search enhancement process in step S021, the acquired special information is notified to the user, and the user is requested to confirm the validity of the special information. If the validity is confirmed, the acquired special information is retained as search information. On the other hand, if the validity cannot be confirmed, the first response is to retain the acquired special information as search information, but the response generation process performs a search of scenes in the video based on estimation. In this case, it is desirable to generate a response (response sentence) that indicates that the search is based on estimation. The second response is to remove the acquired special information from the search information as non-search information. In this case, additional special information is acquired, and the acquired special information is notified to the user, and the user is requested to confirm its validity.

[0356] As shown in Figure 35 as an example, suppose a user uploads a video to be searched to the artificial intelligence response output device 10010, and the user inputs the request statement, "Please show me the scene in which my dog appears," as input information for the uploaded video. In this case, for example, special information, including personal or real-time image information related to the input information, stored in the artificial intelligence response output device 10010, which is a portable information terminal, is acquired as search information, and the validity of the special information is verified. For example, an image presumed to be the user's dog is acquired as special information from the images stored in the storage unit 1170. Then, as notification to the user of the acquired special information, an LLM response is output along with the acquired image, asking, "Is this your dog?", and the user is asked to verify the validity of the special information. If the user inputs "yes" to this verification request and the validity of the special information is confirmed, this special information is retained as search information. Subsequently, in the response generation process of step S016, a scene search of the video is performed based on the search information including the special information.

[0357] This allows for more accurate scene searches within the target video. For example, it can effectively extract scenes featuring the user's pet dog. Furthermore, verifying the validity of the acquired special information can further improve the accuracy of scene searches.

[0358] Furthermore, as shown in Figure 34 as Example 2, in the response output system according to Embodiment 9, if the input information includes personal or real-time location information, information including personal or real-time location information may be acquired as special information. In the search information acquisition process, special information including personal or real-time location information related to the inp...

Claims

1. A response output system, Large-scale language models and, A control unit that obtains a response from the large-scale language model to an instruction sentence to the large-scale language model, The control unit comprises an output unit that outputs based on the response of the large-scale language model acquired by the control unit, The control state of the control unit includes a state in which it controls the output of a response generated based on the response of the large-scale language model via the output unit. Response output system.

2. A response output system according to claim 1, The control state by the control unit includes a state in which it controls the output of scene search results for the target video, which are generated based on the response of the large-scale language model and use the search information obtained based on the instruction sentence, via the output unit. Response output system.

3. In the response output system according to claim 2, The system includes a search information processing unit that performs the process of obtaining the search information based on the instruction statement and the video information of the target video. Response output system.

4. In the response output system according to claim 3, The aforementioned video information includes video-related information related to the aforementioned target video. The video-related information includes at least one of the following: in-video image information which is image information within the target video; text information associated with the in-video image information; audio information associated with the in-video image information; and additional image information associated with the in-video image information. Response output system.

5. In the response output system according to claim 4, The aforementioned video information includes video-related generated information generated based on the aforementioned video-related information. The video-related generated information includes at least one of image information generated based on the video-related information, text information generated based on the video-related information, and audio information generated based on the video-related information. Response output system.

6. In the response output system according to claim 3, The aforementioned search information processing unit is: A search accuracy improvement process is executed to select the search information based on the context of the search information. Response output system.

7. In the response output system according to claim 6, The aforementioned search information processing unit is: The aforementioned search accuracy improvement process does not select information about the word immediately preceding a negative word as search information. Response output system.

8. In the response output system according to claim 6, The aforementioned search information processing unit is: The aforementioned search accuracy improvement process does not select information about the word immediately preceding the estimated or questioned word as search information. Response output system.

9. In the response output system according to claim 6, The aforementioned search information processing unit is: The aforementioned search accuracy improvement process does not select information about truncated phrases as search information. Response output system.

10. In the response output system according to claim 6, The aforementioned search information processing unit is: The aforementioned search accuracy improvement process does not select information about operations that were interrupted midway as search information. Response output system.

11. In the response output system according to claim 3, The aforementioned search information processing unit is: The system performs a search accuracy improvement process that sets priority information from multiple video information included in the search information, and does not select video information whose content differs from the priority information as part of the search information. Response output system.

12. In the response output system according to claim 11, The search information includes additional image information linked to the in-video image information, which is image information within the target video. The aforementioned search information processing unit is: The aforementioned search accuracy improvement process sets the additional image information linked to the image information within the video as the priority information. Response output system.

13. In the response output system according to claim 11, The aforementioned search information processing unit is: The search accuracy improvement process sets the in-video image information, which is image information within the target video and is not silhouette image information, as the priority information, and does not select the in-video image information, which is silhouette image information, as the search information. Response output system.

14. In the response output system according to claim 11, The aforementioned search information processing unit is: The aforementioned search accuracy improvement process does not select information for which the ratio of the target image area to the entire image area is less than a predetermined value as search information. Response output system.

15. In the response output system according to claim 11, The aforementioned search information processing unit is: The search information obtained based on the video information of the target video for the same time period is referenced in the search accuracy improvement process. Response output system.

16. In the response output system according to claim 11, If the video information for different time periods of the aforementioned target video contains information indicating a relationship, the search information processing unit shall The search information obtained based on video information from different time periods of the target video is referenced in the search accuracy improvement process. Response output system.

17. In the response output system according to claim 4, The video-related information further includes at least one of the following: location / map information linked to the in-video image information, date and time information linked to the in-video image information, speed information linked to the in-video image information, direction information linked to the in-video image information, vibration information linked to the in-video image information, temperature information linked to the in-video image information, activity history information linked to the in-video image information, and external communication history information linked to the in-video image information. Response output system.

18. In the response output system according to claim 3, The aforementioned search information processing unit is: In addition to the aforementioned instruction text and video information, the search information is obtained based on special information including the user's personal information or real-time information. Response output system.

19. In the response output system according to claim 18, The aforementioned search information processing unit is: When obtaining the search information based on the special information, the special information is notified to the user, and the user is requested to confirm the validity of the special information. Response output system.

20. A control unit that acquires the response of a large-scale language model to an instruction sentence, The control unit provides an output unit that outputs based on the response of the large-scale language model acquired by the control unit, Equipped with, The control state of the control unit includes a state in which, as a response to the large-scale language model, scene search results of the target video, obtained based on the instruction sentence, are output via the output unit. Response output device.