Composite video generation system, composite video generation method, and composite video generation program

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The composite video generation system simplifies the creation of personalized avatars by selecting and processing videos, addressing complexity and legal issues, and ensuring compliance through a distributed ledger system.

WO2026140383A1PCT designated stage Publication Date: 2026-07-02POCKETRD CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: POCKETRD CO LTD
Filing Date: 2025-09-16
Publication Date: 2026-07-02

Application Information

Patent Timeline

16 Sep 2025

Application

02 Jul 2026

Publication

WO2026140383A1

IPC: G06T13/40; G06T19/00; H04L67/131; H04N23/60

AI Tagging

Technology Topics

Computer graphics (images)Video output

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure JP2025032536_02072026_PF_FP_ABST

Patent Text Reader

Abstract

A composite video generation system comprising: a background video database 1 storing background videos that are video material to be used as a background video; a content video database 2 storing content videos that are similarly used as video material; a person video input unit 3 for inputting a person video; a background video selection unit 4 for selecting a background video to be used for generating a composite video from among the videos stored in the background video database 1; a content video selection unit 5 for selecting a content video to be used for generating a composite video from among the videos stored in the content video database 2; an insertion target video generation unit 7 for generating an insertion target video composed of the background video and the content video; an avatar generation unit 8 for generating an avatar relating to a person on the basis of the person video; an avatar insertion unit 9 for inserting the generated avatar into the insertion target video to generate a composite video; and a composite video output unit 10 for outputting the composite video.

Need to check novelty before this filing date? Find Prior Art

Description

Synthetic video generation system, synthetic video generation method, and synthetic video generation program

[0001] The present invention relates to a technique for generating a synthetic video by inserting an avatar generated based on a person video into an insertion target video composed of one or more video materials.

[0002] In recent years, with the improvement of processing capabilities in electronic computers such as computers, many computer graphics of human figures, so-called avatars, that reflect the characteristics of real people have been utilized. For example, an avatar is used as one's own icon when using SNS (Social Networking Service), or one's own avatar is used as the character of the protagonist in an online game or the like. In addition, services have been proposed in which some of the characters appearing in content such as still images and moving images are replaced with one's own avatar for viewing and the like.

[0003] By using an avatar that reflects one's own characteristics in this way, for example, when a protagonist consisting of an avatar expressing the characteristics of a user in a game battles with an enemy character, an effect of improving the user's immersion in the game world occurs. By using an avatar that abstractly represents the user himself / herself as an icon indicating the user in SNS, it is expected that an effect such as promoting communication between users in a virtual space in the same sense as the real world will occur.

[0004] Patent Documents 1 and 2 both disclose techniques for using avatars that mimic the actual appearance of the player himself / herself or co-players in a computer game that uses a head-mounted display to represent a virtual space.

[0005] Japanese Unexamined Patent Application Publication No. 2019-012509, Japanese Unexamined Patent Application Publication No. 2019-139673

[0006] However, since the appearances of real people vary greatly, using an avatar that reflects the characteristics of a real person requires creating an avatar individually, which presents a problem due to the complexity of avatar generation. Furthermore, if existing still images or videos are insufficient as the target video for inserting one's avatar, it becomes necessary to prepare original still images or videos. However, for ordinary users who do not possess specialized skills in video generation, it is not easy to generate still images or videos to insert an avatar into. Moreover, even if one possesses the technology to generate original still images or videos using existing videos, legal issues may arise depending on the terms of use of the existing videos. Despite these problems, neither Patent Documents 1 nor 2 disclose any technology to solve these problems.

[0007] The present invention has been made in view of the above problems, and aims to provide a technology for easily generating a composite video, which is generated by inserting an avatar based on a person's image into a target video generated using multiple video materials.

[0008] To achieve the above objective, the composite video generation system according to claim 1 is a composite video generation system that generates a composite video by inserting an avatar generated based on a predetermined person video into a part of a target video, and is characterized by comprising: background video selection means for selecting a background video which is a video material relating to the background of the target video; content video selection means for selecting a content video which is a video material that is arranged separately from the background video in the target video; content video processing means for performing necessary processing on the content video selected by the content video selection means according to the arrangement in the target video; target video generation means for generating the target video based on the background video selected by the background video selection means and the content video selected by the content video selection means; avatar generation means for generating the avatar based on the person video; avatar insertion means for inserting the avatar generated by the avatar generation means into all or part of the content video in the target video and / or part of the background video in the target video; and composite video output means for outputting a composite video generated by inserting the avatar into the target video by the avatar insertion means.

[0009] Furthermore, in order to achieve the above objective, the composite video generation system according to claim 2, in the above invention, the background video selection means comprises a first usage mode information generation means that generates first usage mode information which is information relating to the usage mode of the generated composite video and / or the usage mode of the background video in the composite video, and a first usability determination means that determines whether the usage mode of the background video indicated by the first usage mode information satisfies predetermined usage conditions for the background video, the content video selection means comprises a second usage mode information generation means that generates second usage mode information which is information relating to the usage mode of the content video in the video to be inserted, and a second usability determination means that determines whether the usage mode of the content video indicated by the second usage mode information satisfies predetermined usage conditions for the content video, and the video to be inserted generation means generates the video to be inserted using the background video which the first usability determination means has determined satisfies the usage conditions and the content video which the second usability determination means has determined satisfies the usage conditions.

[0010] Furthermore, in order to achieve the above objective, the composite video generation system according to claim 3, in the above invention, the background video selection means comprises: a first usage mode information generation means that generates first usage mode information which is information relating to the usage mode of the composite video including the background video and / or the usage mode of the background video in the composite video; a first candidate video extraction means that extracts one or more candidate background videos from one or more background videos for which usage conditions that allow use in the usage mode indicated by the first usage mode information are set; and a first video determination means that determines the background video to be used in the composite video from among the one or more candidate background videos extracted by the first candidate video extraction means, and the content video selection means is the usage of the composite video including the content video The insertion target video generation means comprises: a second usage mode information generation means for generating second usage mode information which is information relating to the manner and / or manner of use of the content video in the composite video; a second candidate video extraction means for extracting one or more candidate content videos from among one or more content videos for which usage conditions that permit use in the manner indicated in the second usage mode information are set; and a second video determination means for determining the content video to be used in the composite video from among the one or more candidate content videos extracted by the second candidate video extraction means, wherein the insertion target video generation means generates the insertion target video using the background video determined by the first video determination means and the content video determined by the second video determination means.

[0011] Furthermore, in order to achieve the above objective, the composite video generation method according to claim 4 is a composite video generation method that generates a composite video by inserting an avatar generated based on a predetermined person video into a part of the video to be inserted, and is characterized by including: a video material selection step of selecting one or more video materials to form the video to be inserted; a video material processing step of performing necessary processing on the video materials selected in the video material selection step according to their arrangement in the video to be inserted; an insertion target video generation step of generating the video to be inserted based on the video materials selected in the video material selection step; an avatar generation step of generating the avatar based on the person video; an avatar insertion step of inserting the avatar generated in the avatar generation step into all or part of the video materials in the video to be inserted; and a composite video output step of outputting the composite video generated by inserting the avatar into the video to be inserted in the avatar insertion step.

[0012] Furthermore, in order to achieve the above objective, the composite video generation method according to claim 5 is characterized in that, in the above invention, the video material selection step further includes a usage method information generation step which generates usage method information which is information relating to the manner in which the composite video to be generated and / or the manner in which the video material is used in the composite video, and a usability determination step which determines whether the manner in which the video material is used as shown in the usage method information satisfies predetermined usage conditions for the video material, and the insertion target video generation step generates the insertion target video using the video material which the usability determination step has determined satisfies the usage conditions.

[0013] Furthermore, in order to achieve the above objective, the composite video generation program according to claim 6 is a composite video generation program that causes a computer to generate a composite video by inserting an avatar generated based on a predetermined person image into a part of the video to be inserted, and is characterized in that it causes the computer to execute: a video material selection function that, when determining one or more video materials to form the video to be inserted, selects video materials that have usage conditions that allow the manner in which the composite video including the video materials is used and / or the manner in which the video materials are used in the composite video; a video material processing function that performs necessary processing on the one or more video materials selected by the video material selection function according to the arrangement in the video to be inserted; a video to be inserted generation function that generates the video to be inserted based on the video materials selected by the video material selection function; an avatar generation function that generates the avatar based on the person image; an avatar insertion function that inserts the avatar generated by the avatar generation function into all or part of the video material in the video to be inserted; and a composite video output function that outputs the composite video generated by inserting the avatar into the video to be inserted by the avatar insertion function.

[0014] According to the present invention, the generation of a composite video, which is created by inserting an avatar based on a person's image into a target video generated using multiple video materials, is made easier.

[0015] This is a schematic diagram showing the configuration of the composite image generation system according to Embodiment 1. This is a schematic diagram showing the configuration of the composite image generation system according to Embodiment 2.

[0016] The embodiments of the present invention will now be described in detail based on the drawings. The following embodiments describe the most appropriate examples of the present invention, and naturally, the content of the present invention should not be limited to the specific examples shown in these embodiments. It goes without saying that any configuration other than the specific configurations shown in the embodiments that produces similar functions and effects is also included in the technical scope of the present invention. Furthermore, while embodiments 1 and 2 below describe an avatar insertion system as an example, the present invention is not limited to physical devices such as systems, and may be configured by a method that includes the content described below, or by a program that causes a computer to execute the content described below, or by a storage medium in which such a program is stored and readable by the computer.

[0017] (Embodiment 1) First, the composite video generation system according to Embodiment 1 will be described. As shown in Figure 1, the composite video generation system according to Embodiment 1 includes a background video database 1 that stores background videos, which are video materials used as background videos in the videos to be inserted that constitute the composite video; a content video database 2 that stores content videos, which are video materials arranged together with the background videos in the videos to be inserted that constitute the composite video; a person video input unit 3 that inputs person videos that will be the basis for avatars to be inserted into the composite video; a background video selection unit 4 for selecting background videos to be used to generate the composite video from the videos stored in the background video database 1; a content video selection unit 5 for selecting content videos to be used to generate the composite video from the videos stored in the content video database 2; a content video processing unit 6 that performs necessary processing such as scaling on the selected content video according to its arrangement in the videos to be inserted that constitute the composite video; an insertion target video generation unit 7 that generates an insertion target video consisting of the selected and processed background video and content video; and a person video input via the person video input unit 3 that generates an insertion target video for the person. Avatar generation unit 8 generates avatars related to the above, avatar insertion unit 9 inserts the generated avatars into the target video to generate a composite video, composite video output unit 10 outputs the generated composite video, background token generation unit 11 generates background tokens which are non-fungible tokens that have a one-to-one correspondence with individual background videos, content token generation unit 12 generates content tokens which are non-fungible tokens that have a one-to-one correspondence with individual content videos, avatar token generation unit 13 generates avatar tokens which are non-fungible tokens that have a one-to-one correspondence with avatars, video token generation unit 14 which are non-fungible tokens that have a one-to-one correspondence with the generated composite video, background transaction generation unit 15 generates background transactions which are information stored in blocks in a distributed ledger (described later) that have a one-to-one correspondence with individual background tokens, and content transaction generation unit 16 generates content transactions which are information stored in blocks in a distributed ledger that have a one-to-one correspondence with individual content tokens.The system comprises: an avatar transaction generation unit 17 that generates avatar transactions, which are information stored in blocks within a distributed ledger that have a one-to-one correspondence with individual avatar tokens; a video transaction generation unit 18 that generates video transactions, which are information stored in blocks within a distributed ledger that have a one-to-one correspondence with individual synthesized videos; an electronic signature generation unit 19 that generates electronic signatures, which are data that proves that the generation and output of the information contained in each transaction conforms to the intentions of the avatar token holder; and an output unit 20 that outputs each transaction and the electronic signature corresponding to each transaction to the corresponding distributed ledger.

[0018] Background video database 1 is one of the video materials used to generate composite video, and is for storing background videos used as background videos in the videos to be inserted that constitute the composite video. Specifically, the background database has the function of storing data on multiple background videos in a manner in which background videos and identification information, which is information for identifying the background video, are related to each other. In this embodiment 1, each background video, background token, and background transaction (and furthermore, the distributed ledger in which the background transaction is stored) has a configuration in which the correspondence between each other is defined using the identification information of the background video. In addition, information regarding the usage conditions and usage history of each background video is generated by the background transaction generation unit 15, which will be described later, and this information is recorded in the distributed ledger corresponding to the background video.

[0019] The content video database 2 is one of the video materials used to generate the composite video, and is for storing content videos that are each component placed together with the background video in the insertion target video that constitutes the composite video, such as artificial objects such as individual buildings that make up a landscape, natural objects such as rocks and plants, and living things such as animals and people. Specifically, the content video database 2 has the function of storing data related to multiple content videos in a manner in which the content video and the identification information which is information for identifying the content video are related to each other. In this embodiment 1, each content video, content token and content transaction (and furthermore, the distributed ledger in which the content transaction is stored) have a configuration in which the correspondence between them is defined using the identification information of the content video. In addition, information regarding the usage conditions and usage history of each content video is generated by the content transaction generation unit 16, which will be described later, and this information is recorded in the distributed ledger corresponding to the content video.

[0020] Furthermore, both background video and content video function as video materials that constitute the composite video. Although the former is displayed as a background over the entire composite video, while the latter is displayed superimposed on a portion of the composite video, they are essentially identical as video materials that constitute the composite video. Therefore, in this embodiment 1, the video to be inserted before avatar insertion (as described later, the composite video is generated by inserting an avatar into the video to be inserted) can be composed of background video only, or content video only, or background video and content video can be treated collectively as video materials without distinction, and a video database combining background video database 1 and content video database 2 can be used (this also applies to other components such as background token generation unit 11 and content token generation unit 12). In addition, the specific configuration of background video and content video can be either 2D video or 3D video, and can be either still images or moving images. Furthermore, the content video may be configured in the form of an avatar, similar to the case of human video described later.

[0021] The person video input unit 3 is for inputting person video, which is material used to generate avatars that make up the composite video. Specifically, the person video input unit 3 has the function of inputting person video, which is video of the whole or a part of the person to be inserted into the video to be inserted. It may be configured to input video from an external source, or it may be configured to have an imaging mechanism for acquiring video. The specific configuration of the person video may be a full-body video of the target person, or a video of a part of the person, such as a facial image. It may also be a still image or a video, and may be two-dimensional or three-dimensional. In this embodiment 1, the explanation will be given using a two-dimensional still image of the face of the target person as the person video, but it goes without saying that person video with configurations other than this can also be used in the composite video generation system according to this embodiment 1.

[0022] The background video selection unit 4 is for selecting background videos to be used in generating a composite video. Specifically, the background video selection unit 4 includes a video identification unit 21 for identifying the background video to be used, a usage mode information generation unit 22 for generating usage mode information, which is information regarding the usage mode of the identified background video in the composite video, a usage mode information acquisition unit 23 for acquiring usage mode information, which is information regarding the content of the usage conditions for the background video identified by the video identification unit 21, a usability determination unit 24 for determining whether the usage mode of the background video included in the usage mode information generated by the usage mode information generation unit 22 satisfies the usage conditions included in the usage mode information acquired by the usage mode information acquisition unit 23, and a video extraction unit 25 for extracting the target background video from the background video database 1 if the usability determination unit 24 determines that the usage conditions are met.

[0023] The video identification unit 21 is for identifying the background video to be selected by the background video selection unit 4. Specifically, the video identification unit 21 has the function of searching the background video database for the background video to be selected and obtaining identification information for the retrieved background video. Using the identification information of the background video obtained by the video identification unit 21, the usage condition information acquisition unit 23 performs information acquisition processing, etc.

[0024] The usage information generation unit 22 generates usage information (corresponding to the first usage information in the claims) for the background video to be selected, which is information regarding the usage of the background video in relation to the composite video to be generated, and corresponds to the first usage information generation means in the claims. The content of the usage information includes, but is not limited to, one or more pieces of information from the following: the usage of the composite video generated using the background video as video material (whether or not the composite video is made public and the scope of public disclosure, the purpose (video distribution, output to paper media such as glossy paper or postcards of still images, etc.), whether or not there is commercial use at the time of use, the size of the composite video to be used, etc.), the usage of the background video in the composite video (the content of the processing if processing is performed when using it in the composite video, the relationship with the content video used if content video is also used in the composite video, etc.), attribute information of the person who generates and / or uses the composite video (gender, age, nationality, beliefs, identification information, eligibility to use this system (e.g., paid user or unpaid user), etc.), and the costs that can be borne regarding the use of the background video. The usage information generation method by the usage information generation unit 22 may involve inputting character information corresponding to each piece of information, or it may involve directly generating information by performing image analysis processing on the specific data of the composite video to be generated. The usage information generated by the usage information generation unit 22 is output to the usability determination unit 24 and used by the usability determination unit 24 as material for determining whether the background video can be used.

[0025] The usage conditions information acquisition unit 23 is for acquiring usage conditions information, which is information regarding the usage conditions of the background video identified by the video identification unit 21. Specifically, the usage conditions information acquisition unit 23 has the function of accessing a distributed ledger that corresponds one-to-one with the background video, and acquiring usage conditions information, which is information regarding the usage conditions of the background video, from the background transactions stored in the blocks of the distributed ledger. The usage conditions information includes, but is not limited to, one or more of the following: usage conditions for composite videos that use background video as video material (to what extent is the scope of public release of the composite video permitted, to what extent is the use of the composite video permitted (video distribution, output to paper media such as glossy paper or postcards as still images, etc.), to what extent is commercial use permitted, etc.), usage conditions for background video in composite videos (to what extent is processing permitted when processing is performed, which content videos from the same composite video are permitted to be used, etc.), the range of attributes that may be permitted for a person who generates and / or uses composite videos using background video (gender, age, nationality, beliefs, identification information, eligibility to use this system (e.g., paid user or unpaid user), etc.), and usage fees for background video. The usage conditions information acquired by the usage conditions information acquisition unit 23 is output to the usability determination unit 24 and used as material for determining whether or not the background video can be used by the usability determination unit 24.

[0026] The usability determination unit 24 determines whether or not a background image can be used by comparing the usage mode information and usage condition information related to the background image identified by the image identification unit 21, and corresponds to the first usability determination means in the claims. Specifically, the usability determination unit 24 extracts information from the usage mode information items (for example, usage mode of the composite image to be generated) that correspond to each item related to the usage conditions included in the usage condition information (for example, usage conditions of the composite image to be generated), and determines whether or not the information in the usage mode information items satisfies the conditions specified in the usage condition information items. If the content of the usage mode information for all items is within the acceptable range of the usage condition information, the usability determination unit 24 determines that the background image can be used, and otherwise determines that the background image cannot be used.

[0027] The video extraction unit 25 is for extracting image data of background videos that have been determined to be usable by the usability determination unit 24. Specifically, when the usability determination unit 24 determines that a background video is usable, the video extraction unit 25 has the function of extracting the background video corresponding to the identification information acquired by the video identification unit 21 from the background video database 1 and outputting it to the video generation unit 7.

[0028] The content video selection unit 5 is for selecting content videos to be used to generate a composite video. Specifically, the content video selection unit 5 includes a video identification unit 26 for identifying the content video to be used, a usage mode information generation unit 27 for generating usage mode information, which is information regarding the usage mode of the identified content video in the composite video, a usage mode information acquisition unit 28 for acquiring usage mode information, which is information regarding the content of the usage conditions for the content video identified by the video identification unit 26, a usability determination unit 29 for determining whether the usage mode of the content video included in the usage mode information generated by the usage mode information generation unit 27 satisfies the usage conditions included in the usage mode information acquired by the usage mode information acquisition unit 28, and a video extraction unit 30 for extracting the target content video from the content video database 2 if the usability determination unit 29 determines that the usage conditions are met.

[0029] The video identification unit 26 is for identifying the content video to be selected by the content video selection unit 5. Specifically, the video identification unit 26 has the function of searching the content video database 2 for the content video to be selected and obtaining identification information of the retrieved content video. Using the identification information of the content video obtained by the video identification unit 26, the usage condition information acquisition unit 28 performs information acquisition processing, etc. Note that the function of the video identification unit 26 is basically the same as the function of the video identification unit 21, and both may be formed by a single component to perform processing related to both background video and content video.

[0030] The usage information generation unit 27 generates usage information (corresponding to the second usage information in the claims) for the selected content video, which is information regarding the usage of the content video in relation to the composite video to be generated, and corresponds to the second usage information generation means in the claims. The content of the usage information includes, but is not limited to, one or more pieces of information from the following: the usage of the composite image generated using the content video (whether or not the composite video is made public and the scope of its publication, its use (video distribution, output of still images to paper media such as glossy paper or postcards, etc.), whether or not it is used for commercial purposes, the size of the composite video used, etc.), the usage of the content video in the composite video (position and size of the content video in the composite video, its relationship with other content videos and background videos, processing details if the content video is processed, etc.), attribute information of the person who generates and / or uses the composite video (gender, age, nationality, beliefs, identification information, eligibility to use this system (e.g., paid user or unpaid user), etc.), and the costs that can be borne in relation to the use of the content video. The usage information generation method of the usage information generation unit 27 may be to input character information corresponding to each piece of information, or it may be to directly acquire information by performing image analysis processing on the specific data of the composite video to be generated. The usage information generated by the usage information generation unit 27 is output to the usability determination unit 29 and used by the usability determination unit 29 as material for determining whether the content video can be used. The function of the usage information generation unit 27 is basically the same as that of the usage information generation unit 22, and both may be formed by a single component to generate usage information for both background video and content video.

[0031] The usage conditions information acquisition unit 28 is for acquiring usage conditions information, which is information regarding the usage conditions of the content video identified by the video identification unit 26. Specifically, the usage conditions information acquisition unit 28 has the function of accessing a distributed ledger that corresponds one-to-one with the target content video, and acquiring usage conditions information, which is information regarding the usage conditions of the content video, from the content transactions stored in the blocks of the distributed ledger. The usage conditions information includes, but is not limited to, one or more of the following: usage conditions for composite videos that use content video as video material (to what extent is the scope of public release of the composite video permitted, to what extent is the use of the composite video permitted (video distribution, output to paper media such as glossy paper or postcards as still images, etc.), to what extent is commercial use permitted, etc.), usage conditions for content video in composite videos (to what extent is processing permitted when processing is performed, which background videos and content videos placed in the same composite video are permitted to be used for the content video, etc.), the range of attributes that may be permitted for a person who generates and / or uses a composite video using content video (gender, age, nationality, beliefs, identification information, eligibility to use this system (e.g., paid user or unpaid user), etc.), and usage fees for content video. The usage conditions information acquired by the usage conditions information acquisition unit 28 is output to the usability determination unit 29 and used as material for determining whether the content video can be used by the usability determination unit 29. The function of the usage conditions information acquisition unit 28 is basically the same as that of the usage conditions information acquisition unit 23, and both may be formed by a single component, and the information may be acquired by accessing a distributed ledger that corresponds to the background video and content video, respectively.

[0032] The Usability Determination Unit 29 determines whether or not a content video can be used by comparing the usage pattern information and usage condition information related to the content video identified by the Video Identification Unit 26, and corresponds to the second Usability Determination Means in the claims. Specifically, the Usability Determination Unit 29 extracts information from the usage pattern information items (for example, the usage pattern of the composite video to be generated) that correspond to each item included in the usage condition information (for example, the usage conditions of the composite video to be generated), and determines whether or not the information in the usage pattern information items satisfies the conditions specified in the usage condition information items. If the content of the usage pattern information for all items is within the acceptable range of the usage condition information, the Usability Determination Unit 29 determines that the content video can be used, and otherwise determines that the content video cannot be used. The function of the Usability Determination Unit 29 is basically the same as that of the Usability Determination Unit 24, and both may be formed by a single component to determine the usability of background video and content video intended for use in composite video.

[0033] The video extraction unit 30 is for extracting image data of content video that has been determined to be usable by the usability determination unit 29. Specifically, when the usability determination unit 29 determines that it is usable, the video extraction unit 30 has the function of extracting background video corresponding to the identification information acquired by the video identification unit 26 from the content video database 2 and outputting it to the insertion target video generation unit 7.

[0034] The content video processing unit 6 is for processing content videos used as video material for generating composite videos, and is necessary when generating the video to be inserted (composite video before avatar insertion). While it is possible to incorporate the content video directly into the video to be inserted (and even the composite video) as stored in the content video database 2, it is usually necessary to process the content video by scaling it up or down, changing its orientation, fine-tuning its shape, or changing its movement, depending on its position within the composite video, its relationship with other content videos, and the arrangement of the content video within the video to be inserted (position, posture, direction, specific shape, etc.), which is determined based on the concept of the composite video. The content video processing unit 6 is for processing content videos in such cases. For example, if the content video is of a specific person, it can scale it up or down to balance it with other people, or change a static, upright image to one with arms outstretched to match the concept of the composite video. Furthermore, the processing performed by the content video processing unit 6 can be carried out using previously known techniques. For example, if the content video is formed by an avatar, processing will be performed to change the posture, etc., by changing the positional relationship of the bones based on the skeletal information. In this embodiment 1, the content video processing unit 6 is based on the premise that it performs processing on the content video, but it may also be provided with a function to perform necessary processing on the background video, or a separate background video processing unit may be newly provided for processing the background video, separate from the content video processing unit 6. In addition, in this embodiment 1, the content video processing unit 6 performs processing on the content video after the selection process by the content video selection unit 5 has been completed, that is, on the content video after it has been determined to be usable by the usability determination unit 29, but it may also be configured to perform processing before the determination. After performing processing on the content video before the determination, the actual processing content may be incorporated as part of the usage mode information generated by the usage mode information generation unit 27, and the determination process in the usability determination unit 29 may be performed including the processing content.

[0035] The Insertion Target Video Generation Unit 7 generates an insertion target video by combining the background video selected by the background video selection unit 4 and the content video selected by the content video selection unit 5. The insertion target video is the video that will be used as the basis for the composite video, and more specifically, the composite video is the insertion target video to which avatar insertion processing has been applied. The Insertion Target Video Generation Unit 7 generates the insertion target video by arranging the background video and the content video, which has undergone the necessary processing by the content video processing unit 6, in predetermined positions. As a specific example of the Insertion Target Video Generation Unit 7, it is possible to have a configuration that has image processing functions using conventional technology, and there are no particular restrictions on the configuration of the generated insertion target video, such as video / still images, color video / black and white video, 2D video / 3D video, etc., according to the form of the finished video.

[0036] The avatar generation unit 8 generates an avatar to replace all or part of the content video in the video to be inserted, based on a person video input via the person video input unit 3. Specifically, the avatar generation unit 8 may consist of skeletal information (bones), surface information (skin), and weight information defining the relationship between the two, but it is also preferable to have a configuration consisting only of surface information including three-dimensional shape and color tone information on the surface, and in an even simpler configuration, it may consist of the image data itself. Furthermore, the avatar generated by the avatar generation unit 8 may be an avatar corresponding to the full body image of a person, but it may also be an avatar consisting only of a part of a person, for example, only a part of the face. In this embodiment 1, an avatar consisting only of the head from the neck up is generated. The avatar generation unit 4 extracts feature points based on the human video input via the human video input unit 3, according to the position and shape of the surface features (eyes, eyebrows, nose, mouth, ears, hairstyle, etc.) and internal features (joints, etc.) of the human video. By reflecting the positional relationships between the extracted feature points onto the avatar, it generates a realistic avatar that reflects the physical features of the model. However, if a simpler configuration is adopted, for example, the avatar may be generated using the human video as is, or with only the minimum necessary modifications such as adjustments to the size and orientation of the human video.

[0037] The avatar insertion unit 9 is for generating a composite image by inserting the generated avatar into the target image. Specifically, the avatar insertion unit 9 has the function of inserting the avatar generated by the avatar generation unit 8 into at least one of the content video area and / or a part of the background video area in the target image, thereby completing the composite image. If the target image is a still image, the avatar insertion unit 9 acquires information regarding the size and direction of the replacement target in the target image, and appropriately changes the size and direction of the avatar based on that information, and then inserts the avatar into the target image in a manner that replaces the replacement target. If the target image is a video, the avatar insertion unit 9 acquires information regarding the operation content of the replacement target in the target image, for example by detecting changes in the position of feature points, and appropriately changes the size, direction and operation of the avatar based on that information, and then inserts the avatar into the target image in a manner that replaces the replacement target. While it is generally preferable to use avatars to replace content footage related to people within the video being inserted, this is not limited to that. For example, if content footage of a locomotive is present, an avatar consisting of a portion of a face image could be inserted in front of the locomotive. Furthermore, the replacement target is not limited to content footage. For example, a composite image could be generated by replacing a portion of the background footage with an avatar.

[0038] The composite video output unit 10 is for outputting a composite video generated by inserting an avatar into the target video by the avatar insertion unit 9. The composite video output by the composite video output unit 10 may output the generated composite video as video data as is, or it may be output after converting the data format, or it may be output as a photograph by printing it on glossy paper, or it may be output as a postcard print.

[0039] The background token generation unit 11 is for generating non-fungible tokens that have a one-to-one correspondence with each individual background image. A "non-fungible token" is a so-called NFT (Non-Fungible Token), that is, a token that has the property of being infungible with other tokens by possessing unique data, and is issued based on the Ethereum® standard ERC721, for example. The non-fungible token in this embodiment 1 is issued based on ERC721 or other predetermined standards, and is configured so that the transaction history of the object, information about the owner, and information such as the establishment, transfer, and modification of rights such as the right to use the object are recorded in a distributed ledger on the blockchain that corresponds to the non-fungible token. The information to be recorded regarding the background token is generated by the background transaction generation unit 15, which will be described later, and is output to the distributed ledger by the output unit 20 along with the electronic signature generated by the electronic signature generation unit 19.

[0040] Blockchain is a technology that synchronizes data between multiple computers constituting a decentralized network using cryptographic techniques. Specifically, each block is composed of a collection of token information, such as agreed-upon transaction records, and information for connecting to other blocks (information from the previous block). A blockchain is formed by linking multiple such blocks. Even if data is tampered with on some of the computers, the correct data is selected by majority vote among the other computers, making it extremely difficult to destroy or tamper with the data. Blockchains can be classified into public blockchains, which do not restrict who can participate in majority voting, meaning that an unspecified number of people can participate in the recording process on the distributed ledger; and consortium blockchains and private blockchains, where only a select number of specific individuals can participate in majority voting, meaning that only predetermined individuals can participate in the recording process on the distributed ledger. The former is a concept included in the open distributed ledger concept in this invention, while the latter is a concept included in the closed distributed ledger concept in this invention.

[0041] Specific methods for linking background tokens and background images include associating the identifier of the background token with the identification information of the background image. More simply, the identifier of the background token may be matched with the identification information of the background image. Furthermore, the generation of non-fungible tokens may be performed by the background token generation unit 11 itself, or by an external system generating the tokens by issuing a predetermined command to an external system directly or indirectly connected to the background token generation unit 11. In addition, the specific format of the non-fungible token is not limited to that conforming to the Ethereum® standard ERC721, but may be any format as long as it has non-fungible properties and information such as transaction history can be stored in an open distributed ledger or a closed distributed ledger.

[0042] The content token generation unit 12 is for generating content tokens, which are non-fungible tokens that have a one-to-one correspondence with each individual content video. The content tokens generated by the content token generation unit 12 are similar to the background tokens generated by the background token generation unit 11 in that the tokens are generated by so-called NFTs, except that the object on which the correspondence is established is the content video rather than the background video. For this reason, the content token generation unit 12 may be formed integrally with the background token generation unit 11, and both background tokens and content tokens may be generated with a single component. The correspondence between content tokens and content videos is set in a format in which the identifier of the content token and the identification information of the content video are mutually associated.

[0043] The avatar token generation unit 13 is for generating avatar tokens, which are non-fungible tokens that have a one-to-one correspondence with avatars respectively. The avatar tokens generated by the avatar token generation unit 13 are the same as the background tokens generated by the background token generation unit 11 and the content tokens generated by the content token generation unit 12 in terms of points such as the tokens being generated by so-called NFTs, except that the object for which the correspondence is constructed is an avatar instead of the background video or content video. Therefore, the avatar token generation unit 13 may be integrally formed with the background token generation unit 11 and the content token generation unit 12 to generate background tokens, content tokens, and avatar tokens with a single component. Regarding the setting of the correspondence between the avatar token and the avatar, it is preferable to set the identification information for the avatar and then associate the identification information with the identifier of the avatar token in a mutually related form.

[0044] The video token generation unit 14 is for generating video tokens, which are non-fungible tokens that have a one-to-one correspondence with the generated composite video. The video tokens generated by the video token generation unit 14 are the same as other tokens in terms of points such as the tokens being generated by so-called NFTs, except that the object for which the correspondence is constructed is the composite video. Therefore, the video token generation unit 14 may be integrally formed with the background token generation unit 11, etc., to generate each token including the video token with a single component. Regarding the setting of the correspondence between the video token and the composite video, it is preferable to set the identification information for the composite video and then associate the identification information with the identifier of the video token in a mutually related form.

[0045] The background transaction generation unit 15 is for generating background transactions, which are information stored in a distributed ledger that corresponds to a background token. The background transaction generation unit 15 has the function of generating a background transaction that includes information about the holder of the background video, usage condition information for the corresponding background video, and usage pattern information when the background video is used in a composite video (i.e., when its use is permitted by the usage approval unit 24), and outputting it to the electronic signature generation unit 19 and the output unit 20. Regarding the generation of usage condition information, it is preferable to generate it based on regulations predetermined by laws and regulations (for example, prohibiting the use of stimulating videos in environments where minors can view them), as well as based on usage conditions set by the creator or holder of the background video. Furthermore, the background transaction generation unit 15 has the function of generating a new background transaction that includes this information when the holder changes, when the content of the usage condition information is updated due to a change in usage conditions, or when usage pattern information is generated when a composite video using the background video is newly created, and outputting it to the electronic signature generation unit 19 and the output unit 20.

[0046] The content transaction generation unit 16 is for generating content transactions, which are information stored in a distributed ledger that corresponds to a content token. The content transaction generation unit 16 has the function of generating a content transaction that includes information about the owner of the content video, usage condition information for the corresponding content video, and usage pattern information when the content video is used in a composite video (i.e., when its use is permitted by the usage approval unit 29), and outputting it to the electronic signature generation unit 19 and the output unit 20. Regarding the generation of usage condition information, it is preferable to generate it based on regulations predetermined by laws and regulations (for example, prohibiting the use of stimulating videos in environments where minors can view them), as well as based on usage conditions set by the creator or owner of the content video. Furthermore, the content transaction generation unit 16 has the function of generating a new content transaction that includes this information when the owner changes, when the content of the usage condition information is updated due to a change in usage conditions, or when usage pattern information is generated when a new composite video using the content video is created, and outputting it to the electronic signature generation unit 19 and the output unit 20.

[0047] The avatar transaction generation unit 17 is for generating an avatar transaction which is information stored in a distributed ledger associated with an avatar token. The avatar transaction generation unit 17 generates an avatar transaction consisting of, in addition to information about the owner of the avatar, usage condition information of the associated avatar and information about the content video or / and background video that becomes the insertion target when the avatar is used in a composite video, and has a function of outputting it to the electronic signature generation unit 19 and the output unit 20. Regarding the generation of the usage condition information, in addition to generating it based on regulations predetermined by laws and regulations, etc., it is preferable to generate it based on the usage conditions set by the avatar owner. Also, when the owner of the avatar is changed, when the content of the usage condition information is updated due to a change in the usage condition, etc., and when a new composite video using the avatar is generated and a new insertion target is set, the avatar transaction generation unit 17 generates a new content transaction including these information and outputs it to the electronic signature generation unit 19 and the output unit 20.

[0048] The video transaction generation unit 18 is for generating a video transaction which is information stored in a distributed ledger associated with a video token. The video transaction generation unit 18 generates a video transaction consisting of, in addition to information about the owner of the associated composite video, identification information of the background video, content video, and avatar constituting the composite video and usage mode information about the background video and content video, and has a function of outputting it to the electronic signature generation unit 19 and the output unit 20.

[0049] The electronic signature generation unit 19 is for generating electronic signatures, which are data that proves that the information contained in each of the background transaction, content transaction, avatar transaction, and video transaction is genuine. Specifically, the electronic signature generation unit 19 has the function of generating a hash value for each of the background transaction, content transaction, avatar transaction, and video transaction, and then encrypting the hash value with a private key to generate an electronic signature. The "hash value" is a fixed-length value obtained by applying a certain calculation procedure to the original data. Since this calculation procedure is irreversible, it is considered impossible to recover the original data from the hash value. The "private key" is a sequence of numbers used in the encryption process and is configured to be decryptable using a corresponding "public key". Normally, the "private key" is managed by the holders of the background token, content token, avatar token, and video token, respectively, so the electronic signature generation unit 19 may be provided with a separate electronic signature generation unit corresponding to each token.

[0050] The output unit 20 outputs the background transactions, content transactions, avatar transactions, and video transactions generated by the background transaction generation unit 15, the content transaction generation unit 16, and the avatar transaction generation unit 17, along with their corresponding digital signatures, to a distributed ledger associated with each of the background tokens, content tokens, avatar tokens, and video tokens. The output unit 20 functions as one embodiment of the output means in the claims, and in this embodiment, it is configured to directly output each transaction and its corresponding digital signature, but it may also be configured to merely instruct other components (including those provided outside the system) to output. In this embodiment 1, the output unit 20 is directly or indirectly connected to a network where the distributed ledger is installed, and is configured to output predetermined data to the distributed ledger. For the data output from the output unit 20, a hash value is generated for each transaction. This hash value is then compared with the data obtained by decrypting the digital signature corresponding to each transaction using the public key. If the two values are different, the transaction is determined not to be legitimate and registration to the distributed ledger is rejected. If the two values match, the transaction is determined to be legitimate, and the information composed of the transaction is stored in a block that constitutes the distributed ledger. The correspondence between each transaction and the distributed ledger can be established, for example, by linking the identification information of the distributed ledger with the identification information of each transaction. More simply, the same information as the identification information of each transaction may be used as the identification information of the distributed ledger. The identification information of the distributed ledger is used when outputting information by the output unit 20, as well as when acquiring information in the usage condition information acquisition units 23 and 28.

[0051] Next, the advantages of the composite video generation system according to this embodiment 1 will be described. First, in generating the video to be inserted, which is the basis for the composite video, the composite video generation system according to this embodiment 1 does not create the background video and content video to be used each time, but rather selects them appropriately from those stored in a database in advance, making it possible to generate the video to be inserted quickly and easily. Furthermore, the composite video generation system according to this embodiment 1 has the advantage that, by adopting a configuration in which usage condition information, which is information about the usage conditions set in advance for the video material stored in the database, it is possible to easily and quickly determine whether there are any problems regarding the specific usage manner of the composite video in which the video material is used and the specific usage manner of the video material within the composite video, before the composite video is generated. Moreover, the composite video generation system according to this embodiment 1 has a configuration in which usage manner information and usage condition information regarding the video material are stored in corresponding distributed ledgers. With this configuration, it has the advantage that accurate and genuine information can be used for both usage manner information and usage condition information, which is not subject to tampering by third parties.

[0052] (Embodiment 2) Next, a composite image generation system according to Embodiment 2 will be described. In Embodiment 2, components that have the same name and the same reference numerals as those in Embodiment 1 will perform the same functions as those in Embodiment 1 unless otherwise specified.

[0053] Figure 2 is a schematic diagram showing the configuration of the composite video generation system according to Embodiment 2. As shown in Figure 2, the composite video generation system according to Embodiment 2 newly includes: a video candidate extraction unit 31 that extracts one or more background video candidates for which usage conditions that allow the usage mode are set, based on the content of usage mode information regarding the background video generated by the usage mode information generation unit 22; a video determination unit 32 that determines the background video to be actually used from among the background video candidates extracted by the video candidate extraction unit 31; a video candidate extraction unit 33 that extracts one or more content video candidates for which usage conditions that allow the usage mode are set, based on the content of usage mode information regarding the content video generated by the usage mode information generation unit 27; and a video determination unit 34 that determines the content video to be actually used from among the content video candidates extracted by the video candidate extraction unit 33.

[0054] The video candidate extraction unit 31 (corresponding to the first video candidate extraction means in the claims) is for extracting candidate background videos to be used in the generation of a composite video based on the content of the usage mode information of the background videos during the generation of a composite video. Specifically, the video candidate extraction unit 31 has the function of extracting one or more background videos for which usage conditions are set that permit use based on the usage mode indicated in the presented usage mode information. The video candidate extraction unit 31 has the function of accessing the corresponding distributed ledger for each background video stored in the background video database 1 and confirming the content of the usage condition information stored in the distributed ledger, thereby extracting background videos for which usage conditions are set that permit the usage mode indicated in the usage mode information newly generated by the usage mode information generation unit 22.

[0055] The video determination unit 32 (corresponding to the first video determination means in the claims) is for determining which of the one or more background video candidates extracted by the video candidate extraction unit 31 will actually be used as the background video to be used in the composite video. Specifically, the video determination unit 32 may determine which of the one or more candidates extracted by the video candidate extraction unit 31 is to be used according to a predetermined algorithm based on the preferences of the person who instructed the generation of the composite video, the person who intends to possess the composite video, the person who intends to use it, and other components, such as the relationship with avatars and content videos (for example, commonality in shape, color tone, etc.), or it may be configured to visually display the candidate background videos and allow the person who instructed the generation of the composite video to make a selection.

[0056] The video candidate extraction unit 33 (corresponding to the second video candidate extraction means in the claims) is for extracting candidate content videos to be used for generating a composite video based on the content video usage information during the generation of a composite video. Specifically, the video candidate extraction unit 33 has the function of extracting one or more content videos for which usage conditions are set that permit use based on the usage patterns indicated in the presented usage pattern information. The video candidate extraction unit 31 has the function of accessing the corresponding distributed ledger for each content video stored in the content video database 2 and confirming the content of the usage condition information stored in the distributed ledger, thereby extracting content videos for which usage conditions are set that permit the usage patterns indicated in the usage pattern information newly generated by the usage pattern information generation unit 27.

[0057] The video determination unit 34 (corresponding to the second video determination means in the claims) is for determining which of the one or more content video candidates extracted by the video candidate extraction unit 33 will actually be used as the content video for the composite video. Specifically, the video determination unit 34 may determine which of the one or more candidates extracted by the video candidate extraction unit 33 is to be used according to a predetermined algorithm based on the preferences of the person who instructed the generation of the composite video, the person who intends to possess the composite video, the person who intends to use it, and other components, such as avatars, background videos, and relationships with other content videos (for example, commonality in shape, color tone, etc.), or it may be configured to visually display the candidate content videos and allow the person who instructed the generation of the composite video to make a selection.

[0058] Next, the advantages of the composite video generation system according to this second embodiment will be described. In addition to the advantages shown in the first embodiment, the composite video generation system according to this second embodiment has the advantage that, when generating the video to be inserted, which is the basis of the composite video, even if no specific candidate video materials for background video and content video have been set, the system automatically acquires video candidates that allow the usage patterns indicated in the usage pattern information based on the usage pattern information of the background video and content video. This eliminates the need to consider the content of the video materials from scratch, and makes it easy to acquire options for the video materials to be used. Furthermore, in this second embodiment, the video candidate extraction unit has the function of extracting video materials for which usage conditions that allow the usage patterns have been set as video candidates. This eliminates the problem of having to redo the selection of video materials due to usage condition issues after the design has been finalized, and has the advantage that the video to be inserted can be generated simply and quickly. Note that the video candidate extraction units 31 and 33 may extract video materials (background video and content video) that have been used in the past in the usage patterns in question, rather than extracting video materials for which usage conditions that satisfy the specific usage patterns used in the generation of the composite video have been set. In this case, new advantages arise, such as being able to consider the specific configuration of new composite images while referring to past usage examples.

[0059] This invention can be used as a technique to generate a composite image by inserting an avatar, which is generated based on a person's image, into a target image composed of one or more video materials.

[0060] 1 Background video database 2 Content video database 3 Person video input unit 4, 35 Background video selection unit 5, 36 Content video selection unit 6 Content video processing unit 7 Insertion target video generation unit 8 Avatar generation unit 9 Avatar insertion unit 10 Composite video output unit 11 Background token generation unit 12 Content token generation unit 13 Avatar token generation unit 14 Video token generation unit 15 Background transaction generation unit 16 Content transaction generation unit 17 Avatar transaction generation unit 18 Video transaction generation unit 19 Electronic signature generation unit 20 Output unit 21, 26 Video identification unit 22, 27 Usage mode information generation unit 23, 28 Usage condition information acquisition unit 24, 29 Usability determination unit 25, 30 Video extraction unit 31, 33 Video candidate extraction unit 32, 34 Video determination unit

Claims

1. A composite video generation system that generates a composite video by inserting an avatar generated based on a predetermined person video into a portion of a target video, comprising: background video selection means for selecting a background video which is video material relating to the background of the target video; content video selection means for selecting a content video which is video material to be placed together with the background video in the target video; content video processing means for performing necessary processing on the content video selected by the content video selection means according to its arrangement in the target video; target video generation means for generating the target video based on the background video selected by the background video selection means and the content video selected by the content video selection means; avatar generation means for generating the avatar based on the person video; avatar insertion means for inserting the avatar generated by the avatar generation means into all or part of the content video in the target video and / or part of the background video in the target video; and composite video output means for outputting a composite video generated by the insertion of the avatar into the target video by the avatar insertion means.

2. The composite video generation system according to claim 1, wherein the background video selection means comprises a first usage mode information generation means for generating first usage mode information which is information relating to the usage mode of the composite video to be generated and / or the usage mode of the background video in the composite video, and a first usability determination means for determining whether the usage mode of the background video indicated by the first usage mode information satisfies predetermined usage conditions for the background video, and the content video selection means comprises a second usage mode information generation means for generating second usage mode information which is information relating to the usage mode of the content video in the video to be inserted, and a second usability determination means for determining whether the usage mode of the content video indicated by the second usage mode information satisfies predetermined usage conditions for the content video, and the video to be inserted generation means generates the video to be inserted using the background video determined by the first usability determination means to satisfy the usage conditions and the content video determined by the second usability determination means to satisfy the usage conditions.

3. The background video selection means comprises: a first usage mode information generation means for generating first usage mode information which is information relating to the usage mode of the composite video including the background video and / or the usage mode of the background video in the composite video; a first candidate video extraction means for extracting one or more candidate background videos from among one or more background videos for which usage conditions that allow use in the usage mode indicated by the first usage mode information are set; and a first video determination means for determining the background video to be used in the composite video from among the one or more candidate background videos extracted by the first candidate video extraction means; and the content video selection means comprises: a second usage mode information generation means for generating second usage mode information which is information relating to the usage mode of the composite video including the content video and / or the usage mode of the content video in the composite video; and a second candidate video extraction means for extracting one or more candidate content videos from among one or more content videos for which usage conditions that allow use in the usage mode indicated by the second usage mode information are set; The composite video generation system according to claim 1, comprising: a second video determination means for determining the content video to be used in the composite video from among one or more candidate content videos extracted by the second candidate video extraction means, wherein the insertion target video generation means generates the insertion target video using the background video determined by the first video determination means and the content video determined by the second video determination means.

4. A composite video generation method for generating a composite video by inserting an avatar generated based on a predetermined person video into a portion of a target video, comprising: a video material selection step of selecting one or more video materials to form the target video; a video material processing step of performing necessary processing on the video materials selected in the video material selection step according to their arrangement in the target video; a target video generation step of generating the target video based on the video materials selected in the video material selection step; an avatar generation step of generating the avatar based on the person video; an avatar insertion step of inserting the avatar generated in the avatar generation step into all or part of the video materials in the target video; and a composite video output step of outputting the composite video generated by inserting the avatar into the target video in the avatar insertion step.

5. The composite video generation method according to claim 4, wherein the video material selection step further includes a usage information generation step which generates usage information which is information relating to the manner in which the composite video to be generated and / or the manner in which the video material is used in the composite video; and a usability determination step which determines whether the manner in which the video material is used as indicated in the usage information satisfies predetermined usage conditions for the video material, and the insertion target video generation step generates the insertion target video using the video material that the usability determination step has determined satisfies the usage conditions.

6. A composite video generation program that causes a computer to generate a composite video by inserting an avatar generated based on a predetermined person image into a portion of an image to be inserted, the program being characterized by causing the computer to execute: a video material selection function that, when determining one or more video materials to form the image to be inserted, selects video materials for which usage conditions are set that allow the manner in which the composite video including the video materials are used and / or the manner in which the video materials are used in the composite video; a video material processing function that performs necessary processing on the one or more video materials selected by the video material selection function according to their arrangement in the image to be inserted; an insertion target video generation function that generates the image to be inserted based on the video materials selected by the video material selection function; an avatar generation function that generates the avatar based on the person image; an avatar insertion function that inserts the avatar generated by the avatar generation function into all or part of the video materials in the image to be inserted; and a composite video output function that outputs the composite video generated by inserting the avatar into the image to be inserted by the avatar insertion function.