Embedded representation generation system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The embedding expression generation system addresses the challenge of representing relationships between different entities by using a language model and graph neural network to generate and adjust embedding representations, enabling accurate distance calculation and link prediction between users and topics.

JP7874067B2Active Publication Date: 2026-06-15NTT DOCOMO INC

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: NTT DOCOMO INC
Filing Date: 2023-01-31
Publication Date: 2026-06-15

Application Information

Patent Timeline

31 Jan 2023

Application

15 Jun 2026

Publication

JP7874067B2

IPC: G06F16/383

AI Tagging

Application Domain

Digital data information retrieval Special data processing applications

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Robust enhancement method and system for long context processing based on dynamic adjustment of neurons
CN122263966ADigital data information retrieval Semantic analysis Algorithm Engineering
A nursing intelligent teaching auxiliary system and method
CN122264349Aimprove teaching qualityReal-time perception of psychological stressDigital data information retrieval Data processing applicationsNursing scienceRenal Nursing
A multi-agent-based retrieval method, apparatus, device, and medium
CN122173530ADigital data information retrieval Biological models Data mining Query statement
Information processing device and information processing method
JP2026101011ADigital data information retrieval Special data processing applications Information processing Data set
Apparatus and method of network control using language model
US20260163818A1Digital data information retrieval Transmission

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure 0007874067000001
Figure 0007874067000002
Figure 0007874067000003

Patent Text Reader

Abstract

To obtain an embedded expression which properly expresses a relation between entities.SOLUTION: An embedded expression generating system 1 includes: a language understanding unit for acquiring a decoded text by inputting, to a decoding portion, a synthesized embedded expression obtained by synthesizing a user embedded expression and a user speech embedded expression obtained by inputting a first user speech text into an embedded portion, and for carrying out a machine learning of a language model having the embedded portion and the decoded portion and of the user embedded expression based on an error between a second user speech text and the decoded text; an expression acquiring unit for acquiring a topic embedded expression by inputting a topic word to the embedded portion; and a relation learning unit for, with respect to a relation graph set with an edge based on a user speech result by using a user and a topic as nodes, acquiring an embedded expression of each node by learning through a graph neural network using the user embedded expression and the topic embedded expression as a feature amount of each node.SELECTED DRAWING: Figure 1

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to an embedded expression generation system.

Background Art

[0002] A system for summarizing abstract text is known. For example, in the system described in Patent Document 1, an encoder-decoder model realized by a recurrent neural network is used. The encoder processes the input token embedding of a document, and the output token of the decoder is processed to obtain the output of the summary token.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] By using an appropriately configured encoder-decoder model, for example, an embedded expression of a word representing a topic, a location, etc. can be obtained. Similarly, it is also possible to obtain an embedded expression of a person, which is an entity of a different type from a word. Since the embedded expression is represented by a real vector, it was possible to calculate the distance between entities of the same type, such as between words and between people. However, in the generation of the embedded expressions of words and people respectively, when the relationship between a person and a word has not been learned, the relationship between them is not reflected in the embedded expressions of the person and the word respectively, so the distance between entities of different types, such as between a person and a word, could not be calculated.

[0005] Therefore, the present invention has been made in view of the above problems, and an object thereof is to obtain an embedded expression of an entity in which the relationship between different entities is appropriately represented. [Means for solving the problem]

[0006] To solve the above problems, an embedding expression generation system relating to one aspect of this disclosure is an embedding expression generation system that generates embedding expressions of at least a user and a topic, and a language understanding unit that learns a language model composed of an encoder-decoder model including an embedding unit and a decoding unit, wherein the embedding unit outputs an embedding expression that represents the features of the input text, the decoding unit decodes the embedding expression which includes at least the output from the embedding unit, and obtains a user utterance embedding expression output from the embedding unit by inputting a first user utterance text which represents the content of an utterance of one user from among the utterance text which represents the content of an utterance of a user into the embedding unit, and obtains a decoded text output from the decoding unit by inputting a composite embedding expression which is a combination of the user utterance embedding expression and the user embedding expression which is the embedding expression of the one user into the decoding unit, and the language in the utterance text is configured so that the error between the second user utterance text which follows the first user utterance text and the decoded text is small. The system includes: a language understanding unit that performs machine learning to adjust the model and user embedding representations, where the user embedding representations are either initial user embedding representations before learning or user embedding representations during the learning process; a topic extraction unit that extracts topic words, which are words or phrases representing the topic in the user's utterance, from the utterance text; an embedding representation acquisition unit that inputs the topic words into a learned embedding unit and acquires topic embedding representations output from the embedding unit; a relationship extraction unit that generates a relationship graph based on the user's utterance history and behavior history, where at least the user and the topic are nodes, the history of dialogue between users are edges connecting the users, and the history of the user's utterance of topic words are edges connecting the user and the topic; a relationship learning unit that obtains learned embedding representations for each node by learning a graph neural network that uses the learned user embedding representations and topic embedding representations, respectively, as features of the user and topic nodes in the relationship graph; and an embedding representation output unit that outputs the embedding representations for each node.

[0007] As described above, a language model composed of an encoder-decoder model uses a pair of first and second user utterance texts as training data. The first user utterance text is input to the embedding unit, and the resulting user utterance embedding representation is synthesized with the user embedding representation. This composite embedding representation is then input to the decoding unit. The language model and user embedding representation are machine-learned to minimize the error between the decoded text output from the decoding unit and the second user utterance text. This process results in an embedding unit (encoder) that outputs a suitable topic embedding representation in response to topic word input, as well as a user embedding representation that appropriately reflects the user's characteristics. A relationship graph is generated with the user and topic as nodes, and edges are drawn between the nodes based on the user's utterance and behavior history. By inputting the topic word into the embedding unit, the topic embedding representation obtained and the trained user embedding representation are used as feature quantities for the topic word and user, respectively. This training of a graph neural network yields trained topic embedding representations and user embedding representations that appropriately reflect the characteristics of the topic word and user. Since the relationships between these entities are reflected in the obtained topic embedding representations and user embedding representations, it is possible to calculate the distance between the user and the topic. [Effects of the Invention]

[0008] This makes it possible to obtain an embedded representation of entities that appropriately shows the relationships between different entities. [Brief explanation of the drawing]

[0009] [Figure 1] This is a block diagram showing the functional configuration of the embedded representation generation device of this embodiment. [Figure 2] This is a hard block diagram of the embedded representation generation device. [Figure 3] This diagram provides a schematic explanation of the process for acquiring spoken text. [Figure 4] This figure shows an example of the structure of a language model and the machine learning processing of the language model. [Figure 5]This figure shows an example of the process of obtaining an embedded representation using the embedded part of a pre-trained language model. [Figure 6] This figure shows an example of edge acquisition for generating a relational graph. [Figure 7] This figure shows an example of a relationship graph and an example of extracting positive and negative examples from the relationship graph. [Figure 8] This figure shows an example of an embedding representation of each entity obtained by training a graph neural network that constitutes a relational graph. [Figure 9] This flowchart shows the processing details of the embedded representation generation method in the embedded representation generation device. [Figure 10] This flowchart shows the processing steps involved in machine learning for language models. [Figure 11] This diagram shows the configuration of the embedded representation generation program. [Modes for carrying out the invention]

[0010] Embodiments of the embedded representation generation system according to the present invention will be described with reference to the drawings. Where possible, the same parts will be denoted by the same reference numerals, and redundant descriptions will be omitted.

[0011] Figure 1 shows the functional configuration of the embedded expression generation system according to this embodiment. The embedded expression generation system 1 of this embodiment is a system that generates embedded expressions of at least the user and the topic, and is composed of an embedded expression generation device 10 as an example.

[0012] As shown in Figure 1, the embedded expression generation device 10 functionally comprises a speech log acquisition unit 11, a speech recognition unit 12, a text acquisition unit 13, an emotion acquisition unit 14, a language understanding unit 15, a topic extraction unit 16, an embedded expression acquisition unit 17, a relationship extraction unit 18, a relationship learning unit 19, an embedded expression output unit 20, and a link prediction unit 21. Each of these functional units 11 to 21 may be configured in a single device as illustrated in Figure 1, or they may be distributed across multiple devices.

[0013] Note that the block diagram shown in FIG. 1 shows blocks in terms of functions. These functional blocks (components) are realized by any combination of at least one of hardware and software. Also, the method of realizing each functional block is not particularly limited. That is, each functional block may be realized using one physically or logically combined device, or two or more physically or logically separated devices may be directly or indirectly connected (for example, using wired, wireless, etc.), and realized using these multiple devices. The functional block may be realized by combining software with the above one device or the above multiple devices.

[0014] Functions include, but are not limited to, judgment, decision, determination, calculation, computation, processing, derivation, investigation, search, confirmation, reception, transmission, output, access, solution, selection, selection, establishment, comparison, assumption, expectation, regarded as, notification (broadcasting), notification (notifying), communication (communicating), forwarding (forwarding), configuration (configuring), reconfiguration (reconfiguring), allocation (allocating, mapping), assignment (assigning), etc. For example, a functional block (component) that functions to transmit is called a transmitting unit or a transmitter. In any case, as described above, the realization method is not particularly limited.

[0015] For example, the embedded expression generation device 10 in an embodiment of the present invention may function as a computer. FIG. 2 is a diagram showing an example of the hardware configuration of the embedded expression generation device 10 according to the present embodiment. Physically, the embedded expression generation device 10 may be configured as a computer device including a processor 1001, a memory 1002, a storageIn the following description, the term "device" can be read as a circuit, device, unit, etc. The hardware configuration of the embedded representation generation device 10 may be configured to include one or more of each device shown in the figure, or may be configured without including some devices.

[0017] Each function in the embedded representation generation device 10 is realized by causing a predetermined software (program) to be loaded onto hardware such as the processor 1001 and the memory 1002, so that the processor 1001 performs calculations and controls communication by the communication device 1004 and reading and / or writing of data in the memory 1002 and the storage 1003.

[0018] The processor 1001 controls the entire computer by operating, for example, an operating system. The processor 1001 may be composed of a central processing unit (CPU: Central Processing Unit) including an interface with peripheral devices, a control device, an arithmetic device, registers, etc. For example, each functional unit 11 to 21 shown in FIG. 1 may be realized by the processor 1001.

[0019] Also, the processor 1001 reads a program (program code), software module, and data from the storage 1003 and / or the communication device 1004 into the memory 1002, and executes various processes according to these. As the program, a program that causes a computer to execute at least a part of the operations described in the above embodiments is used. For example, each functional unit 11 to 21 of the embedded representation generation device 10 may be stored in the memory 1002 and realized by a control program operating on the processor 1001. Although it has been described that the above various processes are executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001. The processor 1001 may be implemented on one or more chips. Note that the program may be transmitted from a network via a telecommunication line.

[0020] Memory 1002 is a computer-readable recording medium and may consist of at least one of the following: ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. Memory 1002 may also be called a register, cache, main memory, etc. Memory 1002 can store executable programs (program code), software modules, etc., for carrying out the embedded representation generation method according to one embodiment of the present invention.

[0021] The storage 1003 is a computer-readable recording medium and may consist of at least one of the following: an optical disc such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disk, a magneto-optical disk (e.g., a compact disc, a digital multipurpose disc, a Blu-ray® disc), a smart card, flash memory (e.g., a card, a stick, a key drive), a floppy® disk, a magnetic strip, etc. The storage 1003 may also be called an auxiliary storage device. The above-mentioned storage medium may be, for example, a database, server, or other suitable medium including memory 1002 and / or storage 1003.

[0022] The communication device 1004 is hardware (transceiver / receiver device) for communicating between computers via a wired and / or wireless network, and is also referred to as a network device, network controller, network card, communication module, etc.

[0023] The input device 1005 is an input device that accepts input from an external source (e.g., a keyboard, mouse, microphone, switch, button, sensor, etc.). The output device 1006 is an output device that outputs to an external source (e.g., a display, speaker, LED lamp, etc.). The input device 1005 and the output device 1006 may be configured as an integrated unit (e.g., a touch panel).

[0024] Furthermore, each device, such as the processor 1001 and the memory 1002, is connected by a bus 1007 for communicating information. The bus 1007 may consist of a single bus or different buses may be used for communication between devices.

[0025] Furthermore, the embedded representation generation device 10 may include hardware such as a microprocessor, a digital signal processor (DSP), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array), and some or all of each functional block may be realized by such hardware. For example, the processor 1001 may be implemented using at least one of these pieces of hardware.

[0026] Next, the various functions of the embedded expression generation device 10 will be described. The speech log acquisition unit 11 acquires a speech log that represents the content of the user's speech. The speech recognition unit 12 converts the speech log into text if it is in the form of speech. The text acquisition unit 13 acquires speech text, which is text that represents the content of the user's speech, based on the speech log. The emotion acquisition unit 14 acquires emotion information that represents the user's emotions at the time the user's speech was uttered, based on the speech audio or the user's facial expression, and associates the acquired emotion information with the speech text that represents the content of the speech.

[0027] Referring to Figure 3, the processing details of the speech log acquisition unit 11, speech recognition unit 12, text acquisition unit 13, and emotion acquisition unit 14 will be explained in detail. Figure 3 is a schematic diagram illustrating the process of acquiring speech text.

[0028] The speech log acquisition unit 11 may acquire a speech log representing the content of the user's speech in text form based on input via an input device 41, such as a keyboard or touch panel. Alternatively, the speech log acquisition unit 11 may acquire a speech log representing the content of the user's speech in audio data form based on voice input via a microphone 42, for example.

[0029] The speech log acquired by the speech log acquisition unit 11 may be audio or text (chat) representing the content of the user's speech in a predetermined virtual space. The predetermined virtual space may, for example, be a virtual space known as a metaverse. The user's speech may be speech made by an avatar in a virtual space such as a metaverse, and the speech log acquisition unit 11 may acquire the speech log representing the avatar's speech in the form of audio or text.

[0030] The speech recognition unit 12 converts speech into text when the speech log acquisition unit 11 acquires a speech log of speech characteristics. The speech recognition unit 12 may convert the speech log consisting of speech into text by any method, for example, by using well-known speech recognition technology.

[0031] The text acquisition unit 13 acquires speech text, which is text representing the content of the user's utterance, based on the speech log. If the speech log is acquired in text form by the speech log acquisition unit 11, the text acquisition unit 13 acquires the text representing the speech log as speech text. If the speech log is acquired in voice form by the speech log acquisition unit 11, the text acquisition unit 13 acquires the speech log converted into text by the speech recognition unit 12 as speech text. The text acquisition unit 13 then sends the acquired speech text t1 to the language understanding unit 15.

[0032] The emotion acquisition unit 14 acquires emotion information representing the user's emotions at the time the user utters a speech, based, for example, on the user's speech voice acquired via the microphone 42, or on an image representing the user's facial expression acquired via the camera 43.

[0033] The emotion acquisition unit 14 may acquire the user's emotional information from the spoken voice by any method, for example, by using well-known emotion recognition technology. Furthermore, the emotion acquisition unit 14 may acquire the user's emotional information from an image representing the user's facial expression by any method, for example, by using well-known facial expression recognition technology.

[0034] Furthermore, the source of emotional information is not limited to the user's facial expressions and spoken voice; the emotional information acquisition unit 14 may also acquire it from the state of the user's avatar when they speak in the virtual space.

[0035] Emotional information includes categories such as "joy," "anger," "sadness," and "surprise," and certain specific emotional categories such as "happy" and "calm" can be classified as positive emotions.

[0036] The emotion acquisition unit 14 associates emotion information acquired from the user's facial expressions and voice during speech with the speech text t1 that represents the content of the speech. Therefore, the language understanding unit 15 can acquire the speech text t1 to which the emotion information is associated.

[0037] The language understanding unit 15 performs machine learning on a language model composed of an encoder-decoder model. Figure 4 shows an example of the configuration of the language model and the machine learning processing of the language model. The language model md is an encoder-decoder model composed of a neural network, and includes an embedding unit en (encoder) and a decoding unit de (decoder).

[0038] The configuration of the language model md is not limited, but it may be an encoder-decoder model consisting of pairs of recurrent neural networks such as seq2seq, or it may be composed of transformers such as T5 (Text-to-Text Transfer Transformer).

[0039] The embedding unit en encodes the input text and outputs an embedding representation that represents the characteristics of the text. The decoding unit de decodes the embedding representation, which includes at least the output from the embedding unit en, and outputs the decoded text dt. In the description of the input and output of the language model, "text" refers to vector data obtained by converting text using a predetermined method, or to vector data that represents text and is output.

[0040] The language understanding unit 15 inputs a first user utterance text, which represents the content of one user's utterance, from among the utterance texts representing the content of the user's utterance, into the embedding unit en, thereby obtaining the user utterance embedding representation output from the embedding unit en.

[0041] In the example shown in Figure 4, the language understanding unit 15 inputs the first user utterance text ut1 ("Tonight's dinner is" -> "Curry") from the utterance text ut ("Tonight's dinner is" -> "Curry") which is training data for learning the language model md, into the embedding unit en. The language understanding unit 15 then obtains the user utterance embedding representation ebs that has been encoded and output by the embedding unit en.

[0042] Here, the language understanding unit 15 acquires user embeddings, which are the user's embedded representations. For example, the embedding representation generation system 1 may further include a user embedding representation management unit 22. The user embedding representation management unit 22 may generate and manage initial user embeddings before learning. The user embedding representation management unit 22 may also manage user embeddings during the learning process. The user embedding representation management unit 22 may be configured as a functional unit of the embedding representation generation device 10 shown in Figure 1, or it may be configured as a separate device.

[0043] The user embedding representation is represented by a real-valued vector. The initial user embedding representation may be a random real-valued vector, or it may be a real-valued vector consisting of feature quantities that reflect some characteristics of the user. In the embedding representation generation system 1 of this embodiment, the method for obtaining the initial user embedding representation is not limited and may be any well-known method.

[0044] The language understanding unit 15 generates a composite embedding expression by combining a user utterance embedding expression and a user embedding expression which is the embedding expression of that particular user. The language understanding unit 15 may also generate a composite embedding expression by concatenating the user utterance embedding expression and the user embedding expression. In the example shown in Figure 4, the language understanding unit 15 obtains the user embedding expression ebu of user A from the user embedding expression management unit 22, and concatenates the user utterance embedding expression ebs, which is the embedding expression of the first user utterance text ut1, with the user embedding expression ebu of user A to generate a composite embedding expression ebl. Then, the language understanding unit 15 inputs the composite embedding expression ebl to the decoding unit de, and obtains the decoded text dt which is decoded by the decoding unit de.

[0045] The language understanding unit 15 performs machine learning to adjust the language model and user embedding representation so that the error between the second user utterance text following the first user utterance text and the decoded text in the utterance text is reduced. In the example shown in Figure 4, the language understanding unit 15 adjusts the language model md and user embedding representation ebu so that the error between the second user utterance text ut2 (curry) following the first user utterance text ut1 in the utterance text ut ("Tonight's dinner is" "curry") and the decoded text dt is reduced.

[0046] Furthermore, the language understanding unit 15 may perform machine learning to adjust the language model md and user embedding representations using speech text associated with emotion information representing predetermined positive emotions. As mentioned above, speech text ut can be accompanied by emotion information representing the user's emotions at the time the speech text was uttered. In such cases, the language understanding unit 15 may perform machine learning to adjust the language model md and user embedding representations using speech text ut associated with emotion information representing positive emotions such as "happy" or "calm" as training data.

[0047] In this way, by using speech texts associated with emotional information representing positive emotions in machine learning, it is possible to use combinations of first and second user speech texts that are likely to be expressed when the user is experiencing positive emotions as training data. By performing machine learning using such training data, it is possible to obtain an embedding unit and a user embedding expression that can generate topic embedding expressions that reflect a favorable relationship between the user and topic words, etc.

[0048] The language model md, which includes a pre-trained neural network, can be understood as a program that is loaded or referenced by a computer, causing the computer to perform predetermined processes and realize predetermined functions.

[0049] In other words, the trained language model md of this embodiment is used in a computer equipped with a CPU and memory. Specifically, the computer's CPU operates in accordance with instructions from the trained language model md stored in memory, performing calculations on the input data input to the input layer of the neural network, for example, based on trained weight coefficients (parameters) and response functions corresponding to each layer, and outputting the result (probability) from the output layer.

[0050] Referring again to Figure 1, the topic extraction unit 16 extracts topic words from the utterance text, which are words or phrases that represent the topic in the user's utterance. The method applied to the extraction of topic words is not limited, and the topic extraction unit 16 can extract topic words by using, for example, well-known methods such as morphological analysis and text mining.

[0051] The embedding expression acquisition unit 17 inputs the topic word into a trained embedding unit and acquires the topic embedding expression output from the embedding unit. Figure 5 shows an example of the embedding expression acquisition process using the embedding unit of a trained language model. As shown in Figure 5, the embedding expression acquisition unit 17 acquires the topic embedding expression ebt by inputting the topic word tp extracted by the topic extraction unit 16 into the trained embedding unit en. The trained embedding unit en can output a suitable topic embedding expression that appropriately reflects the characteristics of the topic in response to the input topic word.

[0052] Furthermore, the embedding representation acquisition unit 17 may further acquire location embedding representations output from the embedding unit en by inputting location text representing a location into the learned embedding unit en. The location text may be, for example, the name of the location and a descriptive text explaining the location. This allows for obtaining location embedding representations that suitably reflect the characteristics of the location.

[0053] The relationship extraction unit 18 generates a relationship graph with at least the user and topic as nodes, based on the user's speech history (speech log) and action history. The relationship extraction unit 18 may also generate a relationship graph that further includes location as a node.

[0054] The relationship extraction unit 18 extracts relationships between nodes based on the user's utterances and actions, and draws edges based on the extracted relationships. In this embodiment, the relationship extraction unit 18 generates a relationship graph based on the user's utterance history and action history in a predetermined virtual space.

[0055] Figure 6 shows an example of edge acquisition for generating a relation graph. As shown in Figure 6, the relation extraction unit 18 acquires the user's utterance history hs (utterance log and utterance text, etc.) in a virtual space such as the metaverse. The relation extraction unit 18 extracts the actual user interaction r1 from the user's utterance history hs and assigns it as the edge ed1 between the user's nodes in the relation graph.

[0056] Furthermore, the relationship extraction unit 18 extracts the user's utterance record r2 of a topic word from the user's utterance history hs and assigns it as an edge ed2 that connects the user's node and the topic word's node.

[0057] Furthermore, the relationship extraction unit 18 acquires the user's behavior history ha in the virtual space. The relationship extraction unit 18 then extracts the user's visit record r3 to a location from the user's behavior history ha and assigns it as an edge ed3 that connects the user's node and the node of that location.

[0058] The relation learning unit 19 obtains the learned embedding representation for each node by training a graph neural network that uses the learned user embedding representation and topic embedding representation, respectively, as features of the user and topic nodes in the relation graph.

[0059] Alternatively, the relation learning unit 19 may obtain learned embedding representations for each node in a relation graph that further includes location nodes, by training a graph neural network of the relation graph using location embedding representations as features of the location nodes.

[0060] Specifically, the relation learning unit 19 associates the trained user embedding representation ebu, obtained by machine learning by the language understanding unit 15, and the topic embedding representation ebt, obtained by the embedding representation acquisition unit 17, with each user and topic node in the relation graph as features. In addition, the relation learning unit 19 associates the location embedding representation obtained by the embedding representation acquisition unit 17 with the location node in the relation graph as a feature.

[0061] Then, the relational learning unit 19 learns a graph neural network of relational graphs, using the embedding representations as features for each node, thereby modifying the features and weights of each node and obtaining the learned embedding representations for each node.

[0062] The relational learning unit 19 can learn relational graphs using well-known graph neural network learning methods. The learning of relational graphs will be briefly explained with reference to Figure 7. Figure 7 shows an example of a relational graph and an example of extracting positive and negative examples from the relational graph.

[0063] The relation graph gn illustrated in Figure 7 includes nodes n1 to n5, each corresponding to either a user, topic, or location. The relation learning unit 19 randomly samples a node of interest. In the example shown in Figure 7, node n2 is sampled as the node of interest.

[0064] The relation learning unit 19 extracts a positive example graph g1 and a negative example graph g2 from the relation graph gn. The positive example graph g1 includes node n2, which is the node of interest, and nodes n1 and n5, which are connected to node n2 by edges. The negative example graph g2 includes node n2, which is the node of interest, and nodes n3 and n4, which are not connected to node n2 by edges. Note that the negative example graph g2 does not need to include all nodes that are not connected to the node of interest by edges.

[0065] The following describes an example of learning a relational graph (gn). However, since the learning process for graph neural networks is a well-known technique, the explanation will be brief.

[0066] First, let's explain the learning process using the positive example graph g1. Based on the positive example graph g1, the relational learning unit 19 extracts an adjacency matrix A in which the nodes included in the graph are represented as rows and columns, and the connection relationships with the node of interest, node n2, via edges are represented as elements.

[0067] Furthermore, the relational learning unit 19 extracts a diagonal matrix I in which the nodes included in the graph are used as rows and columns, and the self-loops of the nodes are used as elements. Then, if the real vector representing the features of a node is denoted as node feature quantity X, the features of each node are expressed by the following equation as the sum (convolution) of the features of related nodes represented by the adjacency matrix A and the features of the node itself represented by the diagonal matrix I. (A+I)·X

[0068] The relational learning unit 19 multiplies the feature quantities of each convolved node by a weight W, as shown by the following equation, and then inputs this into the activation function f to obtain the output H. H(positive example) = f((A+I)·X·W) Then, the relational learning unit 19 learns weights and features such that the output H (positive example) obtained based on the positive example graph g1 becomes 1.

[0069] The relation learning unit 19 similarly obtains the output H(negative examples) based on the negative example graph g2. Then, the relation learning unit 19 learns the weights and features so that the output H(negative examples) obtained based on the negative example graph g2 becomes 0.

[0070] Referring again to Figure 1, the embedding representation output unit 20 outputs the embedding representation of each node that has been learned by the relation learning unit 19. Figure 8 is a diagram showing an example of the embedding representation of each entity obtained by learning the graph neural network that constitutes the relation graph. As shown in Figure 8, the embedding representation output unit 20 outputs the embedding representation EB of entities 1, 2, 3, 4, 5, ... corresponding to each node of the relation graph gn, based on the learning gm of the graph neural network targeting the relation graph gn by the relation learning unit 19.

[0071] The resulting embedding representation of each node is a real vector that appropriately reflects the characteristics of each entity corresponding to each node, as well as the relationships between entities, making it possible to calculate the distance between entities. Therefore, each node in the relational graph corresponds to a different type of entity, such as a user, topic, or location, and it becomes possible to calculate the distance between different types of entities.

[0072] The manner in which the embedded expression is output by the embedded expression output unit 20 is not limited and may include storage by a predetermined storage means, transmission to a predetermined device, display on a predetermined display device, etc.

[0073] Referring again to Figure 1, the link prediction unit 21 calculates the distance between nodes based on the learned embedding representation of each node, and calculates link prediction information indicating the likelihood of edges being formed between each node based on the calculated distance between nodes.

[0074] Specifically, the link prediction unit 21 determines, for example, whether the distance between nodes, calculated as the distance between real vectors, is less than or equal to a given threshold. If the link prediction unit 21 determines that the distance between nodes is less than or equal to the threshold, it outputs link prediction information indicating that it predicts the existence of an edge between those nodes.

[0075] Thus, by training the graph neural network gm on the relational graph gn, an embedding representation expressed as a real vector, which can calculate the distance between entities of different types, is obtained. Therefore, link prediction information is calculated that allows for the evaluation of the likelihood of edges being formed between each node in the graph. Consequently, it becomes possible to predict whether there is a relationship of a certain degree or higher between the entities corresponding to each node.

[0076] Furthermore, the link prediction unit 21 outputs link prediction information that indicates each node whose distance from the node is less than or equal to the given threshold, based on a given threshold for the distance between nodes.

[0077] Specifically, the link prediction unit 21 determines, for example, whether the distance between nodes calculated as the distance between real number vectors is less than or equal to a given threshold, and outputs information indicating the entity corresponding to the node whose distance is determined to be less than or equal to the threshold as link prediction information. If at least one of the entities corresponding to the node whose distance is determined to be less than or equal to the threshold is a user, the user may be provided with information indicating the other entity as recommendation information.

[0078] Figure 9 is a flowchart showing the processing details of the embedded representation generation method in the embedded representation generation device 10.

[0079] In step S1, the text acquisition unit 13 acquires speech text, which is text representing the content of the user's utterance, based on the speech log.

[0080] In step S2, the language understanding unit 15 performs machine learning on a language model composed of an encoder-decoder model. The processing details of step S2 will be explained with reference to Figure 10.

[0081] Figure 10 is a flowchart showing the processing steps of the language model's machine learning. In step S21, the language understanding unit 15 inputs a first user utterance text, representing the content of one user's utterance from among the spoken texts, into the embedding unit en.

[0082] In step S22, the language understanding unit 15 obtains the user utterance embedding representation ebs encoded and output by the embedding unit en.

[0083] In step S23, the language understanding unit 15 generates a composite embedding expression "ebl" by combining the user utterance embedding expression and the user embedding expression, which is the embedding expression of that user. The language understanding unit 15 then inputs the composite embedding expression ebl to the decoding unit de.

[0084] In step S24, the language understanding unit 15 obtains the decoded text dt, which has been decoded by the decoding unit de.

[0085] In step S25, the language understanding unit 15 performs machine learning to adjust the language model and user embedding representation so that the error between the second user utterance text following the first user utterance text and the decoded text in the utterance text is reduced.

[0086] In step S26, the language understanding unit 15 determines whether or not to terminate the machine learning of the language model. If it is determined that the machine learning of the language model should be terminated, the process proceeds to step S27. On the other hand, if it is not determined that the machine learning of the language model should be terminated, the processes in steps S21 to S25 are repeated using the utterance texts (first and second user utterance texts) as training data.

[0087] In step S27, the language understanding unit 15 outputs the trained language model and user embedding representations. The language understanding unit 15 may, for example, store the trained language model in a predetermined storage means. The language understanding unit 15 may also store the trained user embedding representations in a predetermined storage means, or have them managed by the user embedding representation management unit 22.

[0088] Referring again to Figure 9, in step S3, the topic extraction unit 16 extracts topic words from the utterance text, which are words or phrases that represent the topic in the user's utterance.

[0089] In step S4, the embedding expression acquisition unit 17 inputs the topic word into the learned embedding unit en and acquires the topic embedding expression output from the embedding unit en. Here, the embedding expression acquisition unit 17 may further acquire a location embedding expression output from the embedding unit en by inputting location text representing a place into the learned embedding unit en.

[0090] In step S5, the relationship extraction unit 18 generates a relationship graph with at least the user and topic as nodes, based on the user's speech history (speech log) and action history. The relationship extraction unit 18 may also generate a relationship graph that further includes location as a node.

[0091] In step S6, the relation learning unit 19 performs training of a graph neural network, using the trained user embedding representation and topic embedding representation as features of the user and topic nodes in the relation graph. The relation graph used for training may further include locations as nodes, and location embedding representations may be used as features of the location nodes.

[0092] In step S7, the relation learning unit 19 learns a graph neural network of relation graphs where the embedding representations are the features of each node, thereby modifying the features and weights of each node and obtaining the learned embedding representations for each node.

[0093] In step S8, the embedding representation output unit 20 outputs the embedding representation of each node that has been learned by the relation learning unit 19.

[0094] Next, with reference to Figure 11, an embedded expression generation program for causing a computer to function as the embedded expression generation device 10 of this embodiment will be described. Figure 11 is a diagram showing the configuration of the embedded expression generation program. The embedded expression generation program P1 is composed of a main module m10 that comprehensively controls the embedded expression generation process in the embedded expression generation device 10, a speech log acquisition module m11, a speech recognition module m12, a text acquisition module m13, an emotion acquisition module m14, a language understanding module m15, a topic extraction module m16, an embedded expression acquisition module m17, a relationship extraction module m18, a relationship learning module m19, an embedded expression output module m20, and a link prediction module m21. Each of the modules m11 to m21 realizes the respective functions for each of the functional units 11 to 21.

[0095] The embedded expression generation program P1 may be transmitted via a transmission medium such as a communication line, or it may be stored in a recording medium M1, as shown in Figure 11.

[0096] According to the embedded expression generation device 10, embedded expression generation method, and embedded expression generation program P1 of this embodiment described above, a language model composed of an encoder-decoder model uses a pair of first and second user utterance texts as training data, inputs the first user utterance text into the embedding unit, synthesizes the user utterance embedding expression obtained from inputting the first user utterance text into the embedding unit, and inputs the synthesized embedding expression obtained from the user embedding expression into the decoding unit. The language model and user embedding expression are machine-learned so that the error between the decoded text output from the decoding unit and the second user utterance text is reduced, thereby obtaining an embedding unit (encoder) that outputs a suitable topic embedding expression in response to the input of a topic word, and a user embedding expression that appropriately reflects the characteristics of the user. A relationship graph is generated with the user and topic as nodes, and edges are drawn between the nodes based on the history of the user's utterances and actions. By learning a graph neural network that uses the topic embedding expression obtained by inputting a topic word into the embedding unit and the learned user embedding expression as features of the topic word and the user, respectively, learned topic embedding expressions and user embedding expressions that appropriately reflect the characteristics of the topic word and the user are obtained. Since the obtained topic embedding representations and user embedding representations reflect the relationships between those entities, it is possible to calculate the distance between the user and the topic.

[0097] The invention described herein can be understood, for example, as follows:

[0098] An embedding representation generation system relating to a first aspect of this disclosure is an embedding representation generation system that generates embedding representations of at least a user and a topic, comprising a language understanding unit that learns a language model comprising an encoder-decoder model including an embedding unit and a decoding unit, wherein the embedding unit outputs an embedding representation that represents the features of the input text, the decoding unit decodes the embedding representation which includes at least the output from the embedding unit, obtains a user utterance embedding representation output from the embedding unit by inputting a first user utterance text which represents the content of an utterance of one user from among utterance texts which represents the content of a user's utterance, obtains a decoded text output from the decoding unit by inputting a composite embedding representation which is a combination of the user utterance embedding representation and the user embedding representation which is the embedding representation of the one user, and the language understanding unit learns a language model comprising an encoder-decoder model which includes an embedding unit and a decoding unit, wherein the language model and The system includes: a language understanding unit that performs machine learning to adjust user embedding representations, where the user embedding representations are either initial user embedding representations before learning or user embedding representations during the learning process; a topic extraction unit that extracts topic words, which are words or phrases representing the topic in the user's utterance, from the utterance text; an embedding representation acquisition unit that inputs the topic words into a learned embedding unit and acquires topic embedding representations output from the embedding unit; a relationship extraction unit that generates a relationship graph based on the user's utterance history and action history, where at least the user and the topic are nodes, the history of dialogue between users are edges connecting the users, and the history of the user's utterance of topic words are edges connecting the user and the topic; a relationship learning unit that obtains learned embedding representations for each node by learning a graph neural network that uses the learned user embedding representations and topic embedding representations, respectively, as features of the user and topic nodes in the relationship graph; and an embedding representation output unit that outputs the embedding representations for each node.

[0099] As described above, a language model composed of an encoder-decoder model uses a pair of first and second user utterance texts as training data. The first user utterance text is input to the embedding unit, and the resulting user utterance embedding representation is synthesized with the user embedding representation. This composite embedding representation is then input to the decoding unit. The language model and user embedding representation are machine-learned to minimize the error between the decoded text output from the decoding unit and the second user utterance text. This process results in an embedding unit (encoder) that outputs a suitable topic embedding representation in response to topic word input, as well as a user embedding representation that appropriately reflects the user's characteristics. A relationship graph is generated with the user and topic as nodes, and edges are drawn between the nodes based on the user's utterance and behavior history. By inputting the topic word into the embedding unit, the topic embedding representation obtained and the trained user embedding representation are used as feature quantities for the topic word and user, respectively. This training of a graph neural network yields trained topic embedding representations and user embedding representations that appropriately reflect the characteristics of the topic word and user. Since the relationships between these entities are reflected in the obtained topic embedding representations and user embedding representations, it is possible to calculate the distance between the user and the topic.

[0100] The embedding representation generation system relating to the second aspect further includes an emotion acquisition unit that, in the embedding representation generation system relating to the first aspect, acquires emotion information representing the user's emotions at the time the user utters an utterance based on the voice of the utterance or the user's facial expression, and associates the acquired emotion information with the utterance text representing the content of the utterance, and the language understanding unit may perform machine learning to adjust the language model and user embedding representation using the utterance text to which emotion information representing a predetermined positive emotion is associated.

[0101] Based on the above aspects, speech texts representing utterances made by users when they are likely to be experiencing positive emotions are used in machine learning. Therefore, the combination of first and second user speech texts that constitute the training data is a combination that is likely to be expressed when the user is experiencing positive emotions. By performing machine learning using such training data, an embedding unit and a user embedding expression capable of generating topic embedding expressions that reflect a favorable relationship between the topic word and the user can be obtained.

[0102] In the embedding representation generation system relating to the third aspect, in the embedding representation generation system relating to the first or second aspect, the embedding representation acquisition unit further acquires location embedding representations output from the embedding unit by inputting location text representing a place into the learned embedding unit, the relationship extraction unit generates a relationship graph based on the user's utterance history and action history, in which at least the user, topic and place are nodes, the history of dialogue between users are edges connecting users, the history of the user uttering topic words are edges connecting the user and the topic, and the history of the user visiting a place are edges connecting the user and the place, and the relationship learning unit may obtain learned embedding representations for each node by learning a graph neural network that uses each of the learned user embedding representations, topic embedding representations and location embedding representations as features of the user, topic and place nodes in the relationship graph.

[0103] Based on the above aspects, by inputting location text into the embedding section of a trained language model, a location embedding representation that appropriately reflects the characteristics of the location can be obtained. Then, a relationship graph is generated with the user, topic, and location as nodes, and edges are drawn between the nodes based on the user's utterance and behavior history. By training a graph neural network that uses the topic embedding representation, location embedding representation, and trained user embedding representation as features of the topic word, location, and user, respectively, trained topic embedding representations, location embedding representations, and user embedding representations that appropriately reflect the characteristics of the topic word, location, and user can be obtained. Since the relationships between these entities are reflected in the obtained topic embedding representations, location embedding representations, and user embedding representations, it is possible to calculate the distance between the user and the topic and location.

[0104] In the embedding representation generation system relating to the fourth aspect, the embedding representation generation system relating to any one of the first to third aspects may further include a link prediction unit that calculates the distance between nodes based on the learned embedding representation of each node, and calculates link prediction information indicating the possibility of edges being drawn between each node based on the calculated distance between nodes.

[0105] Based on the above aspects, training a graph neural network on a relational graph yields an embedding representation expressed as real vectors, which allows for the calculation of distances between entities of different types. This enables the calculation of link prediction information that allows for the evaluation of the likelihood of edges being formed between each node in the graph. Consequently, it becomes possible to predict whether there is a relationship of a certain degree or greater between the entities corresponding to each node.

[0106] In the embedded representation generation system relating to the fifth aspect, the link prediction unit in the embedded representation generation system relating to the fourth aspect may output information indicating each node whose distance between nodes is less than or equal to a given threshold, based on a given threshold for the distance between nodes, as link prediction information.

[0107] Based on the aspects described above, it becomes possible to obtain information about entities that have a relationship of a certain degree or higher, based on information indicating nodes whose distance from other nodes is below a given threshold.

[0108] In the embedded representation generation system relating to the sixth aspect, in the embedded representation generation system relating to any one of the first to fifth aspects, the utterance text may be obtained based on an utterance log of audio or text representing the content of the user's utterance in a predetermined virtual space.

[0109] Based on the above aspects, in a virtual space, it is easy to obtain audio or text representing user utterances, thus facilitating the acquisition of spoken text.

[0110] In the embedded representation generation system relating to the seventh aspect, in the embedded representation generation system relating to any one of the first to sixth aspects, the relationship extraction unit may generate a relationship graph based on the user's speech history and action history in a predetermined virtual space.

[0111] Based on the above aspects, in a virtual space, it is easy to obtain the user's speech history and action history, making it easy to generate relational graphs.

[0112] Although the present disclosure has been described in detail above, it will be clear to those skilled in the art that the present disclosure is not limited to the embodiments described herein. The present disclosure can be implemented in modified and altered forms without departing from the intent and scope of the present disclosure as defined by the claims. Therefore, the descriptions in the present disclosure are illustrative and not intended to be restrictive in any way.

[0113] Information notification is not limited to the embodiments described herein and may be carried out by other means. For example, information notification may be carried out by physical layer signaling (e.g., DCI (Downlink Control Information), UCI (Uplink Control Information)), upper layer signaling (e.g., RRC (Radio Resource Control) signaling, MAC (Medium Access Control) signaling, broadcast information (MIB (Master Information Block), SIB (System Information Block))), other signals, or combinations thereof. RRC signaling may also be called RRC messages, and may be, for example, RRC Connection Setup messages, RRC Connection Reconfiguration messages, etc.

[0114] Each aspect / embodiment described herein may be applied to systems utilizing LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G, 5G, FRA (Future Radio Access), W-CDMA®, GSM®, CDMA2000, UMB (Ultra Mobile Broadband), IEEE 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, UWB (Ultra-WideBand), Bluetooth®, and other appropriate systems, and / or next-generation systems extended based thereon. Furthermore, multiple systems may be applied in combination (for example, a combination of at least one of LTE and LTE-A with 5G).

[0115] The processing procedures, sequences, flowcharts, etc., of each aspect / embodiment described herein may be reordered, provided they are consistent with each other. For example, the methods described herein present the elements of various steps in an exemplary order and are not limited to that specific order.

[0116] The specific operations described in this disclosure as being performed by a base station may, in some cases, be performed by its upper node. In a network consisting of one or more network nodes having a base station, it is clear that various operations performed for communication with a terminal can be performed by the base station and at least one other network node (for example, an MME or S-GW, but not limited to these). Although the above example illustrates a case where there is one other network node besides the base station, it may also be a combination of multiple other network nodes (for example, an MME and an S-GW).

[0117] Information, etc. (see the "Information, Signals" section) can be output from a higher layer (or lower layer) to a lower layer (or higher layer). Input and output may also occur via multiple network nodes.

[0118] Input and output information may be stored in a specific location (e.g., memory) or managed in a management table. Input and output information may be overwritten, updated, or appended to. Output information may be deleted. Input information may be sent to other devices.

[0119] The determination may be made by a value represented by 1 bit (0 or 1), by a boolean value (true or false), or by a numerical comparison (for example, a comparison with a predetermined value).

[0120] Each aspect / embodiment described herein may be used individually, in combination, or switched between as needed during implementation. Furthermore, notification of specific information (e.g., notification that "X is") is not limited to explicit notification, but may also be implicit (e.g., by not providing such notification).

[0121] Software should be broadly interpreted to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, execution threads, procedures, functions, and so on, whether they are called software, firmware, middleware, microcode, hardware description languages, or by any other name.

[0122] Furthermore, software, instructions, etc., may be transmitted and received via a transmission medium. For example, if software is transmitted from a website, server, or other remote source using wired technologies such as coaxial cable, fiber optic cable, twisted pair, and digital subscriber lines (DSL) and / or wireless technologies such as infrared, radio, and microwave, these wired and / or wireless technologies are included in the definition of a transmission medium.

[0123] The information, signals, etc. described in this disclosure may be represented using any of the various different techniques. For example, the data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may be represented by voltage, current, electromagnetic waves, magnetic fields or magnetic particles, optical fields or photons, or any combination thereof.

[0124] In addition, terms described in this disclosure and / or terms necessary for understanding this specification may be replaced with terms having the same or similar meaning.

[0125] The terms “system” and “network” as used herein are interchangeable.

[0126] Furthermore, the information, parameters, etc., described herein may be expressed as absolute values, relative values from a given value, or by corresponding other information. For example, wireless resources may be indicated by an index.

[0127] The names used for the parameters described above are not restrictive in any way. Furthermore, the formulas and other expressions using these parameters may differ from those expressly disclosed in this disclosure. Various channels (e.g., PUCCH, PDCCH, etc.) and information elements can be identified by any suitable name, and therefore, the various names assigned to these various channels and information elements are not restrictive in any way.

[0128] As used in this disclosure, the terms “determining” and “determining” may encompass a wide variety of actions. “Determining” may include, for example, judging, calculating, computing, processing, deriving, investigating, looking up, searching, inquiry (e.g., searching in a table, database, or other data structure), and ascertaining. “Determining” may also include, for example, receiving (e.g., receiving information), transmitting (e.g., sending information), input, output, and accessing (e.g., accessing data in memory). Furthermore, "judgment" and "decision" can include considering something as having been "judged" or "decided" after resolving, selecting, choosing, establishing, comparing, etc. In other words, "judgment" and "decision" can include considering something as having been "judged" or "decided" after some action. Also, "judgment (decision)" can be reinterpreted as "assuming," "expecting," or "considering."

[0129] As used in this disclosure, the phrase "based on" does not mean "based solely on" unless otherwise specified. In other words, the phrase "based on" means both "based solely on" and "based at least on."

[0130] Where the designations “first,” “second,” etc., are used herein, no reference to those elements shall generally limit the quantity or order of those elements. These designations may be used herein as a convenient way to distinguish between two or more elements. Thus, references to the first and second elements shall not imply that only two elements may be employed therein, or that the first element must precede the second element in any way.

[0131] To the extent that “include,” “including,” and their variations are used herein or in the claims, these terms are intended to be inclusive, as is the term “comprising.” Furthermore, the term “or” as used herein or in the claims is not intended to be exclusive OR.

[0132] In this disclosure, if articles are added through translation, such as a, an, and the in English, this disclosure may include the fact that the noun following these articles is plural.

[0133] In this disclosure, the term "A and B are different" may mean "A and B are different from each other." The term may also mean "A and B are each different from C." Terms such as "separate" and "combine" may be interpreted similarly to "different." [Explanation of symbols]

[0134] 1...Embedded expression generation system, 10...Embedded expression generation device, 11...Speech log acquisition unit, 12...Speech recognition unit, 13...Text acquisition unit, 14...Emotion acquisition unit, 15...Language understanding unit, 16...Topic extraction unit, 17...Embedded expression acquisition unit, 18...Relationship extraction unit, 19...Relationship learning unit, 20...Embedded expression output unit, 21...Link prediction unit, 22...Expression management unit, de...Decoding unit, en...Embedding unit, gn...Relationship graph, M1...Recording medium, m10...Main module, m11...Speech log acquisition module, m12...Speech recognition module, m13...Text acquisition module, m14...Emotion acquisition module, m15...Language understanding module, m16...Topic extraction module, m17...Embedded expression acquisition module, m18...Relationship extraction module, m19...Relationship learning module, m20...Embedded expression output module, m21...Link prediction module, md...Language model, P1...Embedded expression generation program.

Claims

1. An embedding representation generation system that generates at least user and topic embedding representations, A language understanding unit that learns a language model composed of an encoder-decoder model including an embedding unit and a decoding unit, The aforementioned embedding unit outputs an embedding representation that shows the characteristics of the input text. The decoding unit decodes the embedded representation which includes at least the output from the embedding unit. By inputting a first user utterance text representing the content of one user's utterance from among the utterance texts representing the content of the user's utterance into the embedding unit, a user utterance embedding representation output from the embedding unit is obtained; by inputting a composite embedding representation obtained by combining the user utterance embedding representation and the user embedding representation which is the embedding representation of that one user into the decoding unit, a decoded text output from the decoding unit is obtained; and machine learning is performed to adjust the language model and the user embedding representation so that the error between the second user utterance text following the first user utterance text and the decoded text in the utterance text is reduced. The user embedding representation is an initial user embedding representation before learning or a user embedding representation during the learning process, and is a language comprehension unit. A topic extraction unit extracts topic words, which are words or phrases that represent the topic of the user's utterance, from the aforementioned utterance text. An embedding expression acquisition unit that inputs the aforementioned topic word into the embedding unit which has already learned the topic word and acquires the topic embedding expression output from the embedding unit, A relationship extraction unit generates a relationship graph based on the user's utterance history and action history, in which at least the user and the topic are nodes, the history of dialogue between users are edges connecting the users, and the history of the user uttering the topic word is an edge connecting the user and the topic. A relational learning unit obtains a learned embedding representation for each node by training a graph neural network that uses the learned user embedding representation and the topic embedding representation, respectively, as features of the user and topic nodes in the relational graph. An embedding representation output unit that outputs the learned embedding representation for each node, An embedded representation generation system equipped with the following features.

2. The system further comprises an emotion acquisition unit that acquires emotion information representing the user's emotions at the time the user utters the utterance, based on the audio of the utterance or the user's facial expression, and associates the acquired emotion information with the utterance text representing the content of the utterance. The language understanding unit performs machine learning to adjust the language model and the user embedding representation using the utterance text to which emotion information representing a predetermined positive emotion is associated. The embedded representation generation system according to claim 1.

3. The aforementioned embedding representation acquisition unit further acquires a location embedding representation output from the embedding unit by inputting location text representing a location into the previously learned embedding unit. The relation extraction unit generates the relation graph, which is a graph in which, based on the user's utterance history and action history, at least the user, topic and place are nodes, the history of conversations between users are edges connecting users, the history of a user uttering the topic word is an edge connecting the user and the topic, and the history of a user visiting a place is an edge connecting the user and the place. The relation learning unit obtains the learned embedding representation for each node by learning a graph neural network that uses the learned user embedding representation, the topic embedding representation, and the location embedding representation, respectively, as features of the user, topic, and location nodes in the relation graph. An embedded representation generation system according to claim 1 or 2.

4. The system further includes a link prediction unit that calculates the distance between nodes based on the learned embedding representation of each node, and then calculates link prediction information indicating the likelihood of edges being formed between each node based on the calculated distance between nodes. The embedded representation generation system according to claim 1.

5. The link prediction unit outputs information as link prediction information that indicates each node whose distance from the node is less than or equal to the given threshold for the distance between nodes, based on a given threshold for the distance between nodes. The system according to claim 4.

6. The aforementioned utterance text is obtained based on an utterance log of audio or text representing the content of the user's utterance in a predetermined virtual space. The embedded representation generation system according to claim 1.

7. The relationship extraction unit generates the relationship graph based on the user's speech history and action history in a predetermined virtual space. The embedded representation generation system according to claim 1.