Communication method and apparatus

By matching audio media streams and multimedia motion materials in real time through media function network elements, a digital human body motion video synchronized with voice is generated, which solves the problem of insufficient synchronization between digital human body motion and voice, and improves the call experience and fun.

WO2026123796A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-08-29
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In existing voice-driven digital human services, the synchronization between the digital human's body movements and voice is insufficient, resulting in a poor call experience.

Method used

By matching audio media streams and multimedia motion materials in real time through media function network elements, digital human body motion videos synchronized with voice are generated, including offline generated multimedia motion materials and real-time arrangement, reducing complexity and performance loss.

🎯Benefits of technology

It improves the interactive experience between digital humans and users, enhances the fun and playability of calls, and reduces the processing complexity and bandwidth requirements of media function network elements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025117701_18062026_PF_FP_ABST
    Figure CN2025117701_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A communication method and apparatus, relating to the field of communications, and capable of solving the problem of poor call experience due to current voice only driving the mouth action of an avatar, and limbs only performing simple repetitive actions. In the method, a media function network element may request, on the basis of a first request, from a first apparatus, acquisition of related information indicating an action of total multimedia action materials of an avatar, by means of simple feature matching, an audio media stream received in real time may be matched to obtain an index of multimedia action materials strongly related to an audio feature, and an action video of the avatar driven by multimedia action material orchestration is obtained, so that the experience of interaction with the avatar during a call is improved, the processing complexity is low, and the performance loss is low.
Need to check novelty before this filing date? Find Prior Art

Description

Communication methods and devices

[0001] This application claims priority to Chinese Patent Application No. 202411814303.7, filed on December 10, 2024, entitled "Communication Method and Apparatus", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of communications, and more particularly to communication methods and apparatus. Background Technology

[0003] The new call-to-screen service enhances the user experience by introducing a voice-driven 2D digital human feature. During audio calls, the call audio is used in real-time to drive the user's 2D digital avatar, generating a clear and smooth digital human call video with high lip-sync accuracy, thus increasing the fun and playability of calls and improving the overall call experience.

[0004] Based on the demand for engaging calls, the new calling services, including personal assistants, virtual characters (such as virtual celebrities) calling, and real-time calls, all involve voice-driven digital human figures. Currently, voice only drives the digital human's mouth movements, and the body only performs simple repetitive actions, resulting in an insufficient call experience. Summary of the Invention

[0005] This application provides a communication method and apparatus that can match multimedia motion materials of a digital human with voice to drive the digital human's body movements in real time, thereby improving the call experience.

[0006] To achieve the above objectives, this application adopts the following technical solution:

[0007] Firstly, a communication method is provided. This method can be executed by a media function network element, or by a component of the media function network element, such as a processor, chip, or chip system of the media function network element, or by a logic module or software capable of implementing all or part of the media function network element. Taking the method applied to a media function network element in a communication network as an example, the method includes: sending a first request to a first device, the first request being used to request information related to the actions of a digital human, the information related to the actions including an index of multimedia action material and action characteristics of the multimedia action material; receiving an audio media stream and the information related to the actions from the first device; and determining, based on the audio media stream and the information related to the actions, an index of multimedia action material used to drive the digital human and an order of multimedia action material used to drive the digital human, wherein the determined order and the determined multimedia action material are used to arrange the action video of the digital human.

[0008] In this method, the media function network element in the communication network can request relevant information about the digital human's actions from the first device. This relevant information includes the index of multimedia action materials and the action features of the multimedia action materials. The media function network element can then match the real-time received audio media stream with the action features in the relevant information to determine the index of the multimedia action materials used to drive the digital human and the order of the multimedia action materials that drive the digital human, based on the action features associated with the real-time audio media stream (e.g., semantic or tempo matching). This allows for the arrangement of the digital human's action video based on the index and order of the multimedia action materials. Therefore, in a voice-driven digital human scenario, the media function network element can request relevant information about the actions of the full range of multimedia action materials indicating the digital human from the first device. Through simple feature matching, it can match the real-time received audio media stream with multimedia action materials strongly correlated with the audio features. This not only improves the experience of interacting with the digital human during a call but also has low processing complexity and minimal performance loss.

[0009] In this embodiment of the application, the index of multimedia action material and the action features of multimedia action material in the relevant information of the action can be referred to as the action description file of multimedia action material. That is, the relevant information of the action includes the action description file, which is used to match multimedia action material.

[0010] In one possible design, determining the index for multimedia motion material used to drive the digital human based on audio media stream and motion-related information may include: extracting audio features from the audio media stream; and matching the audio features with motion features from motion-related information to determine the index for the multimedia motion material used to drive the digital human.

[0011] In this design, the media function network element can extract the audio features of the audio media stream, determine the action features that match the audio features from the relevant information of the action, and thus obtain the index of the multimedia action material associated with the action features that match the audio features. That is, the index of the multimedia action material used to drive the digital human. This makes it easier to obtain the multimedia action material used to drive the digital human based on the index. Based on the audio matching, the multimedia action material that reflects the audio content can enhance the fun and playability of the digital human service and improve the digital human experience during the call.

[0012] In one possible design, multimedia motion assets are generated offline. These assets consist of videos or a set of images that represent real or virtual actions using multimedia. For example, multimedia motion assets are motion videos generated by video generation algorithms, with each video depicting one action. Offline generation allows for the creation of a full range of multimedia motion assets that embody the actions of a digital human, reducing the complexity of driving the digital human. This enables real-time motion choreography during calls, achieving the goal of lightweight voice-driven body movements and associating body movements with the voice content.

[0013] In one possible design scheme, the real or virtual actions represented by multimedia motion materials can include at least one of the following: commonly used body movements and facial expressions in a certain semantic context, body movements and facial expressions matching the rhythm of speech, and unconscious slight body movements and facial expressions. Therefore, offline generation of different types of motion materials can enhance the fun and playability of digital human choreography, as well as improve the digital human experience during calls.

[0014] In one possible design, the communication method may further include: sending a second request to a first device, the second request requesting multimedia motion materials, the second request including an index of multimedia motion materials for driving the digital human and an order of multimedia motion materials for driving the digital human; receiving multimedia motion materials associated with the index of multimedia motion materials for driving the digital human from the first device, the multimedia motion materials associated with the index of multimedia motion materials for driving the digital human being being sent in the order of multimedia motion materials for driving the digital human; and driving the digital human based on the received multimedia motion materials.

[0015] In this design, the media function network element determines the index and order of multimedia action materials used to drive the digital human based on motion feature matching between the audio media stream and motion-related information. Then, it can request the first device to send the multimedia action materials associated with the index of the multimedia action materials used to drive the digital human, according to the determined order. Based on the sequentially sent multimedia action materials by the first device, the media function network element can typically drive the digital human frame-by-frame, including its lip movements, according to the received order. The frame-by-frame audio-driven motion video performance drives the digital human's motion sequences and lip movements. Downloading the multimedia action materials using the index obtained by matching the audio media stream with motion features reduces the bandwidth required for downloading media function network element materials, allowing the media function network element to support more users' digital human services. Furthermore, sending the multimedia action materials in the arranged order by the first device reduces the complexity of the media function network element's arrangement.

[0016] In one possible design, before driving the digital human based on the received multimedia action material, the communication method may further include: determining whether the multimedia action material associated with the index of the received multimedia action material used to drive the digital human is the next multimedia action material to be sent. Thus, before driving the digital human based on the received multimedia action material, the media function network element can also determine whether the multimedia action material sent sequentially by the first device is correct, based on the determined order of the multimedia action material used to drive the digital human, thereby ensuring the correctness of the arrangement.

[0017] In one possible design, the communication method may further include: sending a second request to a first device, the second request requesting multimedia motion materials, the second request including an index for the multimedia motion materials used to drive the digital human; receiving multimedia motion materials from the first device associated with the index for driving the digital human; and driving the digital human according to a determined order of the multimedia motion materials used to drive the digital human.

[0018] In this design, the media function network element may not require the first device to send the multimedia motion materials used to drive the digital human in sequence. Instead, it can send the index of the determined multimedia motion materials to the first device, and the first device can provide the multimedia motion materials associated with the corresponding index. The media function network element can then arrange the multimedia motion materials in sequence according to the determined order of the multimedia motion materials used to drive the digital human, and generate a motion video of the digital human. The motion video represents the sequence of actions that drive the digital human. This arrangement of the received multimedia motion materials by the media function network element locally according to the determined arrangement order can also ensure the correctness of the arrangement.

[0019] In one possible design, the information related to the action may also include multimedia action materials. Therefore, in some digital human business scenarios, such as a virtual character (e.g., a virtual celebrity) making a call, the media function network element can obtain multimedia action materials from the first device based on a first request.

[0020] In one possible design, the communication method may further include: determining multimedia motion materials associated with an index used to drive the digital human based on the multimedia motion materials. Thus, when the motion-related information also includes multimedia motion materials, the media function network element can, after determining the order and index of the multimedia motion materials used to drive the digital human, obtain the multimedia motion materials for driving the digital human based on the index and the multimedia motion materials, arrange the multimedia motion materials in sequence, and generate a motion video of the digital human.

[0021] In one possible design, the real or virtual actions represented by multimedia action materials include: commonly used body movements and facial expressions in a specific semantic environment, as well as body movements and facial expressions matched with speech rhythm. Audio features include speech content features and rhythm features. Matching audio features with action features in action-related information to determine the index of multimedia action materials used to drive the digital human can include: performing semantically related action feature matching and speech rhythm-related action feature matching based on audio features and action features in action-related information; determining the index of multimedia action materials associated with action features matched with speech content features and the index of multimedia action materials associated with action features matched with rhythm for driving the digital human. Therefore, the audio features extracted by the media function network element can include speech content features and rhythm features. Thus, when performing action feature matching, semantically related action feature matching and audio rhythm-related action feature matching can be performed to obtain the index of semantically related multimedia action materials and the index of rhythm-related multimedia action materials used to drive the digital human.

[0022] In one possible design, the real or virtual actions represented by the multimedia motion material may further include: unconscious slight body movements and facial expressions. Determining the index of the multimedia motion material used to drive the digital human based on the audio media stream and relevant motion information may further include: if the sum of the lengths of the multimedia motion material associated with motion features matching the speech content features driving the digital human and the lengths of the multimedia motion material associated with motion features matching the beat of the digital human is less than the length of the audio media stream, then the length of the audio portion in the audio media stream for which speech content features and beat features have not been extracted is used to match the unconscious slight body movements and facial expressions with the relevant motion information to determine the index of the multimedia motion material used to drive the digital human to perform unconscious slight body movements and facial expressions.

[0023] In this design scheme, in addition to matching the action features representing semantics and the action features representing audio beats, the media function network element can, for audio parts in the audio media stream where no semantic or beat features have been extracted, match the corresponding unconscious slight limb movements and facial expressions from the relevant action information based on the length of the audio part. This determines the index of multimedia action materials representing unconscious slight limb movements and facial expressions, thus supplementing or filling the audio parts of the audio media stream where no matching multimedia action materials expressing semantics and multimedia action materials expressing beats have been found. This ensures that the arrangement of action videos is coherent and can fully reflect the digital human performing an action sequence of the same length as the audio media stream.

[0024] In one possible design, receiving audio media streams can include: receiving audio media streams from an AI assistant on a first terminal device, or receiving audio media streams from a second terminal device, which communicates with the first terminal device via a communication network. The communication method can also include: sending motion video of the digital human to the first terminal device. Thus, based on different digital human service scenarios, the media function network element can receive audio media streams from different devices.

[0025] Secondly, a communication method is provided. This method can be executed by a first device, or by a component of the first device, such as a processor, chip, or chip system of the first device, or by a logic module or software capable of implementing all or part of the first device. Taking the method applied to a first device in a communication network as an example, the method includes: receiving a first request from a media function, the first request being for requesting information related to the actions of a digital human, the information related to the actions including an index of multimedia action material and action characteristics of the multimedia action material; and sending the information related to the actions to the media function network element.

[0026] In one possible design, multimedia motion assets are generated offline. Multimedia motion assets are videos or a set of images that use multimedia to represent real or virtual actions.

[0027] In one possible design scheme, the real or virtual actions represented by multimedia action materials include at least one of the following: body movements and expressions commonly used in a certain semantic environment, body movements and expressions matching the rhythm of speech, and unconscious slight body movements and expressions.

[0028] In one possible design, the communication method may further include: receiving a second request from a media function network element, the second request requesting multimedia motion materials, the second request including an index of multimedia motion materials for driving the digital human and an order of multimedia motion materials for driving the digital human; and sending multimedia motion materials associated with the index of multimedia motion materials for driving the digital human to the media function network element, the multimedia motion materials associated with the index of multimedia motion materials for driving the digital human being being sent in the order of multimedia motion materials for driving the digital human.

[0029] In one possible design, the communication method may further include: receiving a second request from a media function network element, the second request requesting multimedia motion material, the second request including an index for the multimedia motion material used to drive the digital human; and sending the multimedia motion material associated with the index for driving the digital human to the media function network element.

[0030] In one possible design scheme, the relevant information about the action may also include multimedia action materials.

[0031] The technical effects of the method described in the second aspect can be found in the relevant description of the technical effects of the method described in the first aspect above, and will not be repeated here.

[0032] Thirdly, a communication device is provided for implementing the various methods described above. This communication device can be a media function network element as described in the first aspect, or a device containing the aforementioned media function network element, or a device included in the aforementioned media function network element, such as a chip. The communication device includes corresponding modules, units, or means for implementing the methods described in the first aspect. These modules, units, or means can be implemented in hardware, software, or by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the aforementioned functions.

[0033] In some possible designs, the communication device includes a processing module and a transceiver module. The transceiver module is used to send a first request to a first device, the first request being for obtaining information related to the digital human's movements, including an index of multimedia motion materials and the motion characteristics of the multimedia motion materials. The transceiver module is also used to receive movement-related information from the first device. The transceiver module is also used to receive an audio media stream. The processing module is used to determine, based on the audio media stream and the movement-related information, the index of the multimedia motion materials used to drive the digital human and the order of the multimedia motion materials used to drive the digital human, wherein the determined order and the determined multimedia motion materials are used to orchestrate the digital human's movement video.

[0034] In one possible design, a processing module is used to determine the index of multimedia motion material for driving the digital human based on the audio media stream and motion-related information. Specifically, the processing module is used to extract audio features from the audio media stream and match the audio features with motion features in the motion-related information to determine the index of multimedia motion material for driving the digital human.

[0035] In one possible design, multimedia motion assets are generated offline. Multimedia motion assets are videos or a set of images that use multimedia to represent real or virtual actions.

[0036] In one possible design scheme, the real or virtual actions represented by multimedia action materials may include at least one of the following: body movements and expressions commonly used in a certain semantic environment, body movements and expressions matching the rhythm of speech, and unconscious slight body movements and expressions.

[0037] In one possible design, the transceiver module is further configured to send a second request to the first device. The second request requests multimedia motion materials, including an index for the multimedia motion materials used to drive the digital human and a sequence of these materials. The transceiver module is also configured to receive multimedia motion materials associated with the index for driving the digital human from the first device, wherein the multimedia motion materials associated with the index are sent in the sequence required to drive the digital human. The processing module is further configured to drive the digital human based on the received multimedia motion materials.

[0038] In one possible design, before driving the digital human based on the received multimedia motion material, the processing module is also used to determine whether the multimedia motion material associated with the index of the received multimedia motion material used to drive the digital human is the next multimedia motion material to be sent.

[0039] In one possible design, the transceiver module is further configured to send a second request to the first device, the second request requesting multimedia motion materials, the second request including an index for the multimedia motion materials used to drive the digital human. The transceiver module is also configured to receive multimedia motion materials associated with the index for driving the digital human from the first device. The processing module is further configured to drive the digital human according to a determined order of the multimedia motion materials used to drive the digital human.

[0040] In one possible design scheme, the relevant information about the action may also include multimedia action materials.

[0041] In one possible design, the processing module is further configured to determine, based on the multimedia motion material, the multimedia motion material associated with an index used to drive the digital human.

[0042] In one possible design scheme, the real or virtual actions represented by the multimedia action materials include: commonly used body movements and facial expressions in a certain semantic environment, as well as body movements and facial expressions matched with speech rhythm. Audio features include speech content features and rhythm features. A processing module is used to match the audio features with the action features in the relevant information of the action to determine the index of the multimedia action materials used to drive the digital human. Specifically, this includes: a processing module used to perform semantically related action feature matching and speech rhythm-related action feature matching based on the audio features and the action features in the relevant information of the action to determine the index of the multimedia action materials associated with the action features matched with the speech content features used to drive the digital human, and the index of the multimedia action materials associated with the action features matched with the rhythm used to drive the digital human.

[0043] In one possible design, the real or virtual actions represented by the multimedia motion material may further include: unconscious slight body movements and facial expressions. The processing module, used to determine the index of the multimedia motion material used to drive the digital human based on the audio media stream and relevant motion information, may further include: if the sum of the length of the multimedia motion material associated with the motion features matching the speech content features driving the digital human and the length of the multimedia motion material associated with the motion features matching the beat of the digital human is less than the length of the audio media stream, the processing module is used to determine the index of the multimedia motion material used to drive the digital human to perform unconscious slight body movements and facial expressions by matching the length of the audio portion in the audio media stream for which speech content features and beat features have not been extracted, from the relevant motion information.

[0044] In one possible design, the transceiver module is further configured to receive audio media streams. Specifically, this includes: a transceiver module for receiving audio media streams from an AI assistant on a first terminal device; or a transceiver module for receiving audio media streams from a second terminal device, which communicates with the first terminal device via a communication network. The transceiver module is also configured to send motion video of the digital human to the first terminal device.

[0045] In one possible design, the transceiver module may include a receiving module and a sending module. The sending module implements the sending function of the communication device described in the third aspect, and the receiving module implements the receiving function of the communication device described in the third aspect.

[0046] In one possible design, the communication device described in the third aspect may further include a storage module storing programs or instructions. When the processing module executes the program or instructions, the communication device described in the third aspect can perform the method described in the first aspect.

[0047] Fourthly, a communication device is provided for implementing the various methods described above. This communication device may be the first device described in the second aspect, or a device comprising the first device, or a device included in the first device, such as a chip. The communication device includes corresponding modules, units, or means for implementing the methods described in the second aspect. These modules, units, or means may be implemented in hardware, software, or by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the functions described above.

[0048] In some possible designs, the communication device includes a processing module and a transceiver module. The transceiver module receives a first request from the media function, requesting information related to the digital human's actions, including an index of multimedia motion material and motion characteristics of the multimedia motion material. The transceiver module also sends the action-related information to the media function network element. The processing module implements the processing functions of the communication device described in the fourth aspect, such as determining the action-related information.

[0049] In one possible design, multimedia motion assets are generated offline. Multimedia motion assets are videos or a set of images that use multimedia to represent real or virtual actions.

[0050] In one possible design scheme, the real or virtual actions represented by multimedia action materials include at least one of the following: body movements and expressions commonly used in a certain semantic environment, body movements and expressions matching the rhythm of speech, and unconscious slight body movements and expressions.

[0051] In one possible design, the transceiver module is further configured to receive a second request from the media function network element. The second request requests multimedia motion materials, including an index for the multimedia motion materials used to drive the digital human and a sequence of these materials. The transceiver module is also configured to send the multimedia motion materials associated with the index for driving the digital human to the media function network element, wherein the multimedia motion materials associated with the index are sent in the sequence required to drive the digital human.

[0052] In one possible design, the transceiver module is further configured to receive a second request from the media function network element, the second request requesting multimedia motion material, the second request including an index for the multimedia motion material used to drive the digital human. The transceiver module is also configured to send the multimedia motion material associated with the index for driving the digital human to the media function network element.

[0053] In one possible design scheme, the relevant information about the action may also include multimedia action materials.

[0054] In one possible design, the transceiver module may include a receiving module and a sending module. The sending module implements the sending function of the communication device described in the fourth aspect, and the receiving module implements the receiving function of the communication device described in the fourth aspect.

[0055] In one possible design, the communication device described in the fourth aspect may further include a storage module storing programs or instructions. When the processing module executes the program or instructions, the communication device described in the fourth aspect can perform the method described in the second aspect.

[0056] Fifthly, a communication device is provided (e.g., the communication device may be a chip or a chip system). The communication device includes a processor for implementing the functions involved in any of the preceding aspects.

[0057] In one possible design, the communication device may further include a memory for storing necessary program instructions and data. A processor is coupled to the memory and is used to execute the computer program or instructions stored in the memory, causing the communication device to perform the method described in either the first or second aspect.

[0058] In one possible design, the communication device described in the fifth aspect may further include a transceiver. This transceiver may be a transceiver circuit or an interface circuit. The transceiver can be used for communication between the communication device described in the fifth aspect and other communication devices.

[0059] In one possible design, the processor can be integrated with the memory.

[0060] In some possible designs, when the device is a chip system, it can be composed of chips or contain chips and other discrete components.

[0061] A sixth aspect provides a communication device including a processor and an interface circuit. The interface circuit is configured to receive signals from other communication devices outside the communication device and transmit them to the processor, or to send signals from the processor to other communication devices outside the communication device. The processor is configured to implement the method as described in any possible implementation of the first or second aspect through logic circuits or execution code instructions.

[0062] It is understood that when the communication device provided by either the fifth or sixth aspect is a chip, the aforementioned sending action / function can be understood as an output, and the aforementioned receiving action / function can be understood as an input.

[0063] In a seventh aspect, a computer-readable storage medium is provided, which stores a computer program or instructions that, when executed on a communication device, enable the communication device to perform the method described in any one of the first to third aspects.

[0064] Eighthly, a computer program product including instructions is provided, comprising computer program code, which, when executed on a communication device, enables the communication device to perform the method described in any one of the first to third aspects.

[0065] A ninth aspect provides a communication system, comprising: a media function network element for implementing the method described in the first aspect and a first apparatus for implementing the method described in the second aspect.

[0066] In a tenth aspect, a communication chip is provided, wherein instructions are stored that, when the chip is operated on a communication device, cause the method described in either the first or second aspect above to be implemented. Attached Figure Description

[0067] Figure 1 is a schematic diagram of the architecture of an IMS network applicable to an embodiment of this application;

[0068] Figure 2 is a schematic diagram of the architecture of a communication system provided in an embodiment of this application;

[0069] Figure 3 is a flowchart illustrating a communication method provided in an embodiment of this application;

[0070] Figure 4 is a flowchart illustrating an exemplary communication method provided in an embodiment of this application;

[0071] Figure 5 is a schematic diagram of the implementation process of MF using hierarchical matching to determine the frame order of motion video for driving digital humans according to an embodiment of this application;

[0072] Figure 6 is an example diagram of action matching provided in an embodiment of this application;

[0073] Figure 7 is a flowchart illustrating another exemplary communication method provided in an embodiment of this application;

[0074] Figure 8 is a schematic diagram of another implementation process of MF using hierarchical matching to determine the frame order of motion video for driving digital humans provided in an embodiment of this application;

[0075] Figure 9 is an example diagram of another action matching provided in an embodiment of this application;

[0076] Figure 10 is a schematic diagram of the structure of a communication device provided in an embodiment of this application;

[0077] Figure 11 is a schematic diagram of another communication device provided in an embodiment of this application. Detailed Implementation

[0078] To better understand the embodiments of this application, the following points are explained before introducing the embodiments of this application.

[0079] First, in the embodiments of this application, the terms "first," "second," and various numerical designations are merely for descriptive convenience and are not intended to limit the scope of the embodiments of this application. For example, "first device" and "second device" are only used to distinguish different devices and do not limit their order. Those skilled in the art will understand that the terms "first," "second," etc., do not limit the quantity or execution order, and that "first," "second," etc., are not necessarily different.

[0080] Second, in the embodiments of this application, descriptions such as "when," "under the circumstances," "if," and "if" all refer to the device making corresponding processing under certain objective circumstances. They are not time limits, nor do they require the device to make a judgment action when implementing it, nor do they imply any other limitations.

[0081] Third, in the embodiments of this application, the words "exemplary" or "for example" are used to indicate that they are examples, illustrations, or descriptions. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design options. Specifically, the use of words such as "exemplary" or "for example" is intended to present the relevant concepts in a specific manner to facilitate understanding.

[0082] Fourth, in this application, "sending information" can be understood as one device sending information to another device, or it can also be understood as one logic module within a device sending information to another logic module. For example, "the first device sending information" can be understood as the first device sending information to another device (such as a media function network element), or it can be understood as logic module 1 in the first device sending information to logic module 2 in the first device.

[0083] In this application, "receiving information" can be understood as one device receiving information from another device, or it can also be understood as a logic module within a device receiving information from another logic module. For example, "the first device receiving information" can be understood as the first device receiving information from another device (such as a media function network element), or it can be understood as logic module 1 in the first device receiving information from logic module 2 in the first device.

[0084] Fifth, the phrase "sending information to... (e.g., the first device)" in this application, or the relevant illustrations in the accompanying drawings, can be understood as the destination of the information being the first device. This can include sending information directly or indirectly to the first device. Similarly, "receiving information from... (e.g., the first device)," "receiving information from... (e.g., the first device)," or "receiving information sent (e.g., by the first device)," or the relevant illustrations in the accompanying drawings, can be understood as the source of the information being the first device. This can include receiving information directly or indirectly from the first device. Information may undergo necessary processing between the source and destination, such as format changes, but the destination can understand the valid information from the source. Similar expressions in this application can be interpreted similarly, and will not be elaborated further here.

[0085] This application will present various aspects, embodiments, or features relating to a system that may include multiple devices, components, modules, etc. It should be understood and appreciated that individual systems may include additional devices, components, modules, etc., and / or may not include all the devices, components, modules, etc. discussed in conjunction with the accompanying drawings. Furthermore, combinations of these approaches may also be used.

[0086] The technical solutions of this application embodiment can be applied to various communication systems, such as Internet Protocol (IP) Multimedia Subsystem (IMS) networks, Wireless Fidelity (Wi-Fi) systems, Vehicle-to-Everything (V2X) communication systems, Device-to-Device (D2D) communication systems, Vehicle-to-Everything (V2X) communication systems, Worldwide Interoperability for Microwave Access (WiMAX) communication systems, 4th generation (4G) mobile communication systems such as Long Term Evolution (LTE) systems, 5th generation (4G) mobile communication systems such as New Radio (NR) systems, and future communication systems.

[0087] For ease of understanding, the relevant technologies and technical terms involved in the embodiments of this application will be introduced below.

[0088] 1. Digital Human

[0089] A digital human is a virtual character with a digital appearance, displayed through devices such as mobile phones, televisions, augmented reality (AR) glasses, or virtual reality (VR) glasses. Digital humans possess a human-like or near-realistic appearance, with intuitive anthropomorphic features such as facial features, gender, and personality. Driven by digital technology, digital humans can exhibit human-like behaviors, including language, facial expressions, and gestures. Furthermore, artificial intelligence can enable digital humans to possess simple thoughts, recognize their environment, and interact with people. Digital humans are widely used in film and television production, virtual broadcasters, virtual education, gaming, virtual customer service, virtual tour guides, and real-time communication.

[0090] For example, the International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) has two digital human standards: ITU-T F.748.15 "Framework and metrics for digital human application system" and ITU-T F.748.14 "Requirements and evaluation methods of non-interactive 2D real-person digital human application system". In these standards, digital humans and 2D digital humans are defined as follows:

[0091] (1) Digital Human: A computer application that integrates computer graphics, computer vision, intelligent speech, and natural language processing technologies. It can be used for digital content generation and human-computer interaction to help improve content production efficiency and user experience.

[0092] (2) 2D digital human: A type of digital human whose graphic is a two-dimensional image whose graphic content contains only information about horizontal and vertical dimensions.

[0093] 2. 5G New Calling

[0094] 5G New Voice, also known as Voice over NR with enhanced services (VoNR+), is an enhanced voice call service based on 5G networks. It achieves service transmission through an additional data channel (DC) between the terminal device and IMS. On the basis of high-definition audio and video calls, arbitrary multimedia information can be transmitted synchronously through the DC, thereby upgrading real-time calls to real-time interactive / immersive calls.

[0095] It adds a Data Center (DC) to the IMS network to carry text, images, doodles, menus, and other information during calls. It can provide more services on top of traditional voice services, such as screen sharing, intelligent translation, content sharing, and fun calls. These features can bring users a more interesting and diverse calling experience, and also help operators enhance the commercial value of their basic services.

[0096] With the increasing intelligence and larger screens of mobile terminals, users' demands for real-time communication are no longer limited to the exchange of voice and image between the two parties in a call. Interactive operations such as touching, stroking, dragging, and pulling, as well as collaborative efforts on the same task, are emerging. These more complex interactive needs are giving voice services new life and vitality. As 5G technology and emerging technologies such as AR, VR, and artificial intelligence (AI) evolve towards interactive and immersive calls, new 5G calling technologies are emerging.

[0097] 3. IMS Network Architecture

[0098] IMS is one of the core technologies of network communication. IMS can meet the needs of terminal devices for newer and more diversified multimedia services. It is an important way to solve the convergence of mobile and fixed networks and introduce differentiated services such as the triple convergence of voice, data and video.

[0099] The IMS architecture is a type of IMS architecture that supports DC applications and can synchronously transmit any multimedia and data information, such as audio, video, images, text, Hypertext Markup Language 5 (H5), location, emoticons, actions, AR, VR, etc.

[0100] As shown in Figure 1, the IMS network architecture includes a data channel signaling function (DCSF) and a media function (MF), and uses a service-based interface (SBA) architecture to interoperate with the IMS network to support the implementation of DC applications.

[0101] Specifically, the user equipment (UE) communicates with the proxy-call session control function (P-CSCF) via the Gm interface (Gm); the P-CSCF communicates with the serving-call session control function (S-CSCS) via the Mw interface (Mw); the P-CSCF communicates with the IMS access gateway (IMS-AGW) via the Iq interface (Iq); the IMS-AGW communicates with the remote IMS, MF, and UE via the Mb interface (Mb); the S-CSCF communicates with the home subscriber server (HSS) via the N70 / Cx interface (N70 / Cx); the S-CSCF communicates with the IMS application server (AS) via the ISC interface (ISC); the IMS AS communicates with the HSS interface via the N71 / Sh interface (N71 / Sh); and the IMS AS communicates with the DCSF via the DC1 interface (DC1). The AS communicates with the MF via the DC2 interface (DC2); the DCSF communicates with the Network Exposure Function (NEF) via the DC3 interface (DC3); the DCSF communicates with the Data Channel Application Server (DCAS) via the DC4 or MDC3 interface (DC3 or MDC3); the DCSF communicates with the DCAR via the DC5 interface (DC5); the DCSF communicates with the MF via the MDC1 interface (MDC1); the MF communicates with the DCAS via the MDC2 interface (MDC2); and the HSS communicates with the DCSF via the N72 / Sc interface (N72 / Sc). Furthermore, the IMS AS, DCSF, MF, and NEF functions shown in Figure 1 interact using service-oriented interfaces. For example, the service-oriented interface provided by the IMS AS is Nimsas; the service-oriented interface provided by the DCSF is Ndcsf; the service-oriented interface provided by the NEF is Nnef; and the service-oriented interface provided by the MF is Nmf.

[0102] The functions of each network element in this IMS network architecture are as follows:

[0103] P-CSCF: It is the entry node for users to access the IMS network via the Session Initiation Protocol (SIP), and is mainly responsible for forwarding SIP signaling between SIP users and the home network.

[0104] S-CSCF: It is the central node of the IMS network, responsible for user registration, authentication, sessions, routing, and service triggering.

[0105] IMS-AGW: This is the IMS access gateway, primarily responsible for media plane communication between the user and network interfaces.

[0106] HSS: The main database for IMS user subscriptions is responsible for managing user subscription data and mobile user location information. It is responsible for storing the following main user-related subscription information: user identity (ID), such as IMS private identity (IMPI) and IMS public identity (IMPU); user authentication-related information; S-CSCF information registered by the user; and transparent data stored by the AS in the HSS, such as the UE's call forwarding number.

[0107] IMS AS: Generally refers to the server network element in an IMS network that processes upper-layer voice services, including basic audio and video services and supplementary services. Specifically, an AS may include the MMTel AS (processing basic audio and video services and supplementary services) and / or the Service Centralization and Continuity (SCC) AS (responsible for signaling control and called party access domain selection for single-mode service continuity (eSRVCC)). These two ASs can be configured independently or jointly.

[0108] DCSF: Provides signaling control functions for data channel control logic. DCSF supports the following functions: receiving event reports from the IMS AS and determining whether to allow data channel service during an IMS session; bootstrapping data channels and (if applicable) applying data channel resources at the MF or media resource function (MRF) via the IMS AS; supporting Hypertext Transfer Protocol (HTTP) web server functions to download data channel applications (bootstrapping) to the UE via MF and / or MRF based on the UE subscription; and downloading data channel applications from the data channel application repository.

[0109] MF: Provides media resource management and forwarding of data channel media services. MF supports the following functions: managing data channel media resources (bootstrapping and application data channel resources, if applicable) under the control of the IMS AS; terminating bootstrapping data channels from the UE and forwarding HTTP services between the UE and DCSF via MDC1; anchoring application data channels in peer-to-peer (P2P) scenarios if needed and forwarding application data services from the UE to the UE; relaying services on application-to-person (A2P) / person-to-application (P2A) application data channels between the UE and DC application server via MDC2.

[0110] NEF: Primarily used to support the opening of capabilities and events.

[0111] DCAS: Primarily used to provide services related to DC applications.

[0112] In the embodiments of this application, the above-mentioned functions can be referred to as entities or network elements. For example, P-CSCF can also be referred to as P-CSCF entity or P-CSCF network element, and S-CSCF can also be referred to as S-CSCF entity or S-CSCF network element. There is no limitation on this.

[0113] Currently, the user experience of the new call-to-screen service has introduced a voice-driven 2D digital human feature. During audio calls, the call audio is used in real time to drive the user's 2D digital avatar, generating a clear and smooth digital human call video with high lip-sync accuracy, thereby increasing the fun and playability of calls and improving the overall call experience.

[0114] Based on the demand for engaging calls, the new calling services, including personal assistants, virtual avatar calls, and real-time calls, all involve voice-driven digital avatars. Currently, voice only drives the mouth movements of the digital avatars, and their bodies only perform simple repetitive actions, resulting in an insufficient call experience.

[0115] Meanwhile, over-the-top (OTT) applications that provide various application services to users via the Internet also have many solutions for voice-driven body movements, but they generally suffer from high performance overhead during calls.

[0116] Therefore, embodiments of this application provide a communication method and apparatus that can match multimedia motion materials of a digital human with voice to drive the digital human's body movements in real time, thereby improving the call experience and reducing performance overhead.

[0117] Please refer to Figure 2, which is a schematic diagram of the architecture of a communication system applied in an embodiment of this application. As an example, as shown in Figure 2, this communication system can be applied to the IMS network architecture shown in Figure 1 above. The communication system includes: a media function network element and a first device. Optionally, it may also include terminal devices. The devices can communicate directly or indirectly with each other, without limitation.

[0118] The media function network element, located on the media plane, is responsible for media resource management and data channel media traffic forwarding. It can drive digital humans based on voice. The media function network element can also be called a media plane network element, media plane function, or unified media / medium function (UMF), etc., without limitation. In this embodiment, the media function network element can obtain relevant information about the digital human's actions from the first device based on the digital human service triggered by the terminal device, and receive audio media streams from the terminal device or the AI ​​assistant subscribed to by the terminal device. Based on the audio media stream and the relevant information about the actions, it can drive the digital human to perform actions or expressions related to the audio content and send the digital human's action video to the terminal device. For example, the media function network element can be the MF in the above-mentioned IMS network architecture, and the specific functions can be found in the functional description of the MF above. Alternatively, the media function network element can be a network element or device with media functions in other communication networks.

[0119] The first device is used to store information related to the actions of a digital human with a user subscription. This information includes multimedia action materials and action description files for the multimedia action materials. The action description files include an index and action features of the multimedia action materials, used to match them with real-time audio media streams. The device retrieves multimedia action materials that match the audio media streams to arrange the digital human's action video, driving the digital human to perform actions, expressions, etc., related to the content of the audio media stream. Essentially, the first device is a material repository or platform for storing the multimedia action materials of the digital human. For example, the first device can be a device, equipment, network element, or server corresponding to an operator's operation and maintenance platform. For instance, the first device can be Operation Administration and Maintenance (OAM), or it can be a device, equipment, network element, or server specifically used to store information related to the digital human's actions; there is no limitation on this.

[0120] Terminal devices may include first terminal devices, second terminal devices, etc., and are terminals that access the aforementioned communication system and have wireless transceiver functions, or chips or chip systems that can be installed in the terminal. These terminal devices may also be referred to as UE, user equipment, access terminal, user unit, user station, mobile station, mobile station, remote station, remote terminal, mobile device, user terminal, terminal, wireless communication equipment, user agent, or user equipment. In the embodiments of this application, the terminal device may be a mobile phone, tablet computer, computer with wireless transceiver functions, VR terminal device, AR terminal device, wireless terminal in industrial control, wireless terminal in self-driving, wireless terminal in remote medical care, wireless terminal in smart grid, wireless terminal in transportation safety, wireless terminal in smart city, wireless terminal in smart home, vehicle-mounted terminal, roadside unit (RSU) with terminal functions, etc. The terminal device of this application may also be an on-board module, on-board component, on-board chip, or on-board unit that is built into a vehicle as one or more components or units. The vehicle can implement the method provided in this application through the built-in on-board module, on-board component, on-board chip, or on-board unit.

[0121] The embodiments of this application do not limit the device form of the terminal device. The device used to implement the function of the terminal device can be the terminal device itself; it can also be a device that supports the terminal device in implementing the function, such as a chip system. The device can be installed in the terminal device or used in conjunction with the terminal device. In the embodiments of this application, the chip system can be composed of chips or can include chips and other discrete components.

[0122] In this embodiment, when the media function network element learns that the terminal device has initiated the digital human service, it can request relevant information about the digital human's actions from the first device. Based on the received audio media stream, it matches corresponding multimedia action materials from the action-related information, arranges them sequentially to obtain a digital human action video, and sends the digital human action video to the terminal device, driving the digital human's facial expressions and body movements. Thus, the media function network element can obtain information related to the digital human's actions, match multimedia action materials related to the audio content from the action-related information based on real-time audio, and determine the order of the multimedia action materials to drive the digital human to perform facial expressions and body movements corresponding to the audio content. This enhances the fun and playability of the service, thereby improving the call experience. For a detailed implementation, please refer to the following method embodiments, which will not be elaborated here.

[0123] For example, digital human services can exist in the following three business scenarios:

[0124] Scenario 1: Personal Assistant Category, i.e. Customer to Manufacturer (C2M) business scenario. In this business scenario, the user communicates with the operator's AI assistant (or AI voice assistant). During the call, the user's terminal device displays the digital human image of the AI ​​assistant, and the AI ​​assistant's audio media stream drives the digital human's facial expressions and body movements.

[0125] Scenario 2: Real-time call scenario, i.e., customer-to-customer (C2C) business scenario. In this business scenario, user A and user B are having a call. During the call, user A's terminal device displays a digital avatar of user B. User B's audio media stream drives the facial expressions and body movements of the digital avatar of user B. User B's terminal device displays a digital avatar of user A. User A's audio media stream drives the facial expressions and body movements of the digital avatar of user A.

[0126] Scenario 3: Virtual character call service scenario, similar to a personal assistant business model, where users communicate with the operator's AI assistant. The difference is that the AI ​​assistant's digital human image is a virtual character selected by the user, such as a virtual celebrity.

[0127] In one possible design, where the communication system is compatible with the aforementioned IMS network architecture, the AI ​​assistant can be deployed on a multi-modal communication function (MCF). The MCF can carry AI algorithms such as AI agents, integrate self-developed and third-party intelligent components, support modal conversions such as voice-to-text, voice-to-image, and gesture-to-animation, and bring a variety of atomic capabilities such as real-time translation, gesture recognition, and real-time voice-driven digital humans. The MCF can be connected to the MF.

[0128] It is understood that Figures 1 and 2 above are simplified schematic diagrams for ease of understanding, and may also include other devices, modules or chips, etc., which are not shown in Figures 1 and 2.

[0129] It should be noted that the solutions in the embodiments of this application can also be applied to other communication systems, and the corresponding names can be replaced by the names of the corresponding functions in other communication systems.

[0130] The communication method provided in the embodiments of this application will be described in detail below with reference to Figures 3-9.

[0131] For example, Figure 3 is a schematic flowchart of a communication method provided in an embodiment of this application. This communication method is illustrated using the communication between the media function network element and the first device shown in Figure 2 as an example. Of course, the subject executing the action of the media function network element in this method can also be a device / module in the media function network element, such as a chip, processor, or processing unit in the media function network element, and there is no limitation thereto; the subject executing the action of the first device in this method can also be a device / module in the first device, such as a chip, processor, or processing unit in the first device, and there is no limitation thereto.

[0132] As shown in Figure 3, the communication method includes:

[0133] S301, the media function network element sends a first request to the first device. Correspondingly, the first device receives the first request from the media function network element.

[0134] The first request may be a business or service request message related to driving the digital human. In this embodiment, the first request is used to request the acquisition of relevant information about the digital human's actions, which can be understood as the media function network element requesting the first device to download relevant information about driving the digital human's actions.

[0135] When the media function network element learns that the terminal device has triggered a digital human service due to a call, it sends a first request to the first device to download relevant information about the actions of the digital human driving the terminal device. In the case where the digital human service belongs to either the C2M service shown in scenario 1 or the virtual character caller service shown in scenario 3, the digital human can be a digital human subscribed to by the first terminal device.

[0136] In the case of digital human services belonging to the C2C services shown in Scenario 2 above, taking the first terminal device as the called party and the second terminal device as the caller as an example, the first terminal device and the second terminal device communicate through a communication network. Then, the digital human can include the digital human of the first terminal device and the digital human of the second terminal device. The digital human of the two terminal devices each corresponds to an action-related information. The first request can simultaneously request the action-related information of the digital human of the first terminal device and the action-related information of the digital human of the second terminal device. Alternatively, the media function network element can also send another request to request the action-related information of the digital human of the first terminal device. This is not limited. In this embodiment of the application, taking the sending of action video to the first terminal device as an example, the digital human requested by the first device can be the digital human signed by the second terminal device.

[0137] For example, the Media Function Network (MF) is applied in the aforementioned IMS network architecture. When a terminal device initiates a call through the IMS network, the control plane network elements (such as P-CSCF / I-CSCF / S-CSCF / HSS) in the IMS network can trigger the MF to start the digital human service of the terminal device. The MF can then send a first request to the first device based on the trigger message from the control plane network element. The first request may include the identifier of the terminal device, the identifier of the digital human service, etc.

[0138] Among them, the relevant information of the action is used to indicate the set of multimedia action materials that match the audio media stream of the real-time call and are used to drive the digital human to perform body movements and facial expressions related to the audio content. Multimedia action materials are videos or a set of pictures that represent real or virtual actions in a multimedia way, such as videos or a set of pictures that represent the actions of real or virtual people, animals, objects, etc. in a multimedia way.

[0139] A multimedia motion asset can be used to represent a single, indivisible, complete action that a digital human can perform, or the smallest unit of action implementation. The action represented by a multimedia motion asset can be called an atomic action, a minimum-granularity action, etc., without limitation. Examples include raising a hand (the process from raising the hand to lifting it), clapping (hands together and then apart), and shaking a fist (hands raised and then clasped and swaying back and forth). Exemplarily, a multimedia motion asset can be a video segment representing an action, a set of image frames constituting an action, or an npy (numerical python) file indicating the content of the action images, etc., without limitation. This application embodiment uses a video segment as an example to illustrate a multimedia motion asset.

[0140] It should be understood that multimedia motion materials are presented in the form of digital human figures. In addition to including a physical action performed by the digital human, they can also include facial expressions and movements of the digital human during the action, such as lip movements and eye movements.

[0141] Each terminal device subscribed to the digital human service can correspond to a multimedia action material set. This multimedia action material set can be generated offline, for example, by generating multiple action videos using video generation algorithms. Each action video content is the process of a digital human performing an action, and this multimedia action material set is pre-loaded locally on the first device.

[0142] The multimedia motion material collection can include the full set of motion materials of digital humans. The real or virtual actions represented by the multimedia motion materials can include at least one of the following: commonly used body movements and facial expressions in a certain semantic environment, body movements and facial expressions that match the rhythm of speech, and unconscious slight body movements and facial expressions.

[0143] Commonly used body movements and facial expressions in a specific semantic context can refer to the subconscious actions and expressions of real or virtual characters, animals, or objects upon hearing a certain dialogue audio content. For example, a video expressing the semantic meaning of "congratulations" might show a digital human clasping their hands in front of their chest and shaking them, with the digital human's face displaying a happy or joyful expression; a video expressing the semantic meaning of "thank you" might show a digital human bowing, with the digital human's face displaying a sincere expression; and an action expressing the semantic meaning of "OK" might show a digital human making an "OK" gesture, and so on.

[0144] Body movements and facial expressions that match the rhythm of speech can refer to body movements and facial expressions that are related to the rhythm and beat of the dialogue audio but are not semantically related; that is, body movements and facial expressions that are rhythmic but without clear semantic meaning. For example, motion videos that depict rhythmic body movements without clear semantic meaning, motion videos that depict rhythmic body movements of a digital human (such as rhythmic body swaying), motion videos that depict unconscious but rhythmic tapping of a digital human's hands (such as rhythmic tapping of fingers), etc., can be used to express the emotional state of real or virtual people, animals, or objects, etc.

[0145] Unconscious subtle body movements and facial expressions refer to actions and expressions unrelated to the dialogue content and speech rhythm. They are used to supplement or fill in audio segments of the audio media stream that do not match commonly used body movements and expressions within a specific semantic context, or that match the speech rhythm. This not only ensures the continuity of the digital human's movements but also guarantees that the length of the edited video driving the digital human's actions is the same as the length of the audio media stream. Unconscious subtle body movements and facial expressions can be used to represent the digital human's resting, pausing, or idle states; for example, videos depicting slight body swaying or unconscious blinking, etc.

[0146] It should be understood that in the embodiments of this application, the length of the motion video of the digital human obtained by driving and arranging based on the audio media stream is the same as the length of the audio media stream, and the number of video frames of the motion video can be converted into the length of the audio media stream.

[0147] In this embodiment, the action-related information includes the index of the multimedia action material and the action features of the multimedia action material. That is, each multimedia action material in the multimedia action material set has a corresponding index and feature information describing the action performed by the multimedia action material, used to match the audio content features. The index and action features of the multimedia action material can be used to describe the multimedia action material; therefore, the index and action features of the multimedia action material can be called or belong to the action description file of the multimedia action material. In other words, the action-related information includes the action description file, which can include the index and action features of each multimedia action material in the multimedia action material set. The index of the multimedia action material is associated with the action features, or there is a corresponding relationship between the index and action features. For example, the action-related information and the action description file can be stored and sent in tabular form. In some implementations, the action-related information refers to the action description file, and this is not limited.

[0148] For example, a multimedia action material is an action video. The index of this multimedia action material can be the frame order of its video frame within the multimedia action material collection. That is, the video frames of the multimedia action materials in the collection are arranged sequentially. Since a multimedia action material has multiple video frames, the index of a multimedia action material can be a frame order range or a frame sequence. Alternatively, a multimedia action material can correspond to a single index, which indicates its position within the multimedia action material collection. For example, the index can be the number or sequence number of the multimedia action material. The number or sequence number of the multimedia action materials in the collection can start from 0 or 1, and all multimedia action materials are numbered sequentially without limitation.

[0149] For example, action features can be feature vectors of actions.

[0150] It should be understood that the embodiments of this application do not specifically limit the names of information and messages. For example, an action description file can also be called action description information, an index can also be called a sequence number, number, identifier, etc., and action features can also be called action feature information, action feature vector, etc., without limitation.

[0151] Optionally, the action-related information may also include each multimedia action material in the multimedia action material set. It should be understood that each multimedia action material in the action-related information is associated with a corresponding index and action feature, or that the multimedia action material, index, and action feature have a corresponding relationship. For digital human services with a small amount of digital human material, or in some other types of digital human services, for example, when the digital human service belongs to the virtual character call service shown in scenario 3 above, when the digital human service is started, the media function network element can obtain all the action material files driving the digital human through a first request. In this case, the action-related information may include all multimedia action materials and action description files in the multimedia action material set, or the first device may feed back all the digital human action material files according to the type of digital human service; this is not limited. That is, the action-related information may include multimedia action materials, the index of the multimedia action materials, and the action features of the multimedia action materials.

[0152] S302, the first device sends action-related information to the media function network element. Correspondingly, the media function network element receives action-related information from the first device.

[0153] After receiving the first request, the first device can send information related to the actions of the digital human in the first terminal device to the media function network element according to the first request. For example, the information related to the actions can be sent in a response message (such as the first response) to the first request.

[0154] S303, Media Function Network Element receives audio media streams.

[0155] The media function network element receives and caches audio media streams in real time during a call on the first terminal device.

[0156] If the digital human service belongs to the C2M service shown in Scenario 1 above or the virtual character call service shown in Scenario 3 above, then the media function network element can receive the audio media stream from the AI ​​assistant of the first terminal device. This audio media stream can be the audio media stream in which the AI ​​assistant answers questions based on the audio media stream initiated by the first terminal device.

[0157] In the case of digital human services, which fall under the C2C service shown in scenario 2 above, taking the first terminal device as the called party and the second terminal device as the caller as an example, the second terminal device communicates with the first terminal device through the communication network. Then, the media function network element can receive the audio media stream from the second terminal device. This audio media stream is the audio media stream that the user who is communicating through the second terminal device replies to the first terminal device.

[0158] S304. The media function network element determines the index and sequence of the multimedia motion materials used to drive the digital human based on the audio media stream and motion-related information.

[0159] Among them, multimedia motion materials associated with a defined order and a defined index are used to choreograph motion videos of digital humans.

[0160] After receiving the audio media stream, the media function network element begins to buffer audio data packets. Based on the action-related information in the received audio media stream and the action-related information, it determines the index of the multimedia action material corresponding to the action feature matching the audio media stream. This multimedia action material corresponding to the action feature matching the audio media stream is the multimedia action material used to drive the digital human. It also determines the playback or execution order of the multimedia action materials associated with the index of the multimedia action material driving the digital human. The order of the multimedia action materials driving the digital human can be indicated by arranging the determined index of the multimedia action materials driving the digital human. The order of the index is the execution order of the multimedia action materials used to drive the digital human.

[0161] In one possible implementation, the media function network element can extract audio features from the audio media stream, match the audio features with motion features in motion-related information, and determine the index of multimedia motion material used to drive the digital human. For example, the audio features can be audio feature vectors.

[0162] In this implementation, if the audio features include speech content features and tempo features of the audio media stream, then the media function network element matches the audio features with the action features in the action-related information to determine the index of the multimedia action material used to drive the digital human. This can include: the media function network element can perform semantic-related action feature matching and speech tempo-related action feature matching based on the audio features and the action features in the action-related information to determine the index of the multimedia action material associated with the action features matched with the speech content features and the index of the multimedia action material associated with the action features matched with the tempo. At this time, the real or virtual actions represented by the multimedia action material can include commonly used body movements and facial expressions in a certain semantic environment and body movements and facial expressions matched with the speech tempo.

[0163] For example, if the content of an audio media stream is "That's great news, congratulations, congratulations, congratulations!", the audio features of the audio media stream can be extracted, including the speech content features of the semantic "congratulations" and the rhythm features of "good news" and "congratulations!". Then, based on the speech content features, action features that express the meaning of "congratulations" can be matched, such as clasping hands and shaking, as well as action features that match the rhythm features of "good news" and "congratulations!", such as swaying the body from side to side in a rhythm, thereby obtaining the index of multimedia action materials associated with the matched action features.

[0164] In some implementations, such as the C2C business scenario mentioned above, since the conversation between users is in segments of audio output, the media function network element can extract the voice content features only from the audio media stream, without considering the beat features, and determine the index of the multimedia action material associated with the action features that match the semantic content features from the relevant information of the action.

[0165] Furthermore, if the length of the multimedia action material obtained based on voice content feature matching does not reach the length of the audio media stream, or if the total length of the multimedia action material obtained based on voice content feature and beat feature matching does not reach the length of the audio media stream, for the audio portion of the multimedia action material in the audio media stream that does not match the semantic and beat-expressing multimedia action material, multimedia action material that expresses unconscious slight limb movements and facial expressions can be matched from the relevant information of the action based on the length of the audio portion of the multimedia action material that does not match the semantic and beat-expressing multimedia action material to fill or supplement the audio portion of the multimedia action material that does not match the semantic and beat-expressing multimedia action material corresponding to the digital human's actions and expressions. This can ensure that the length of the edited video driving the digital human's actions is the same as the length of the received audio media stream, as well as the continuity of the digital human's actions.

[0166] At this point, the real or virtual actions represented by the multimedia motion material can also include: unconscious slight body movements and facial expressions. That is, the media function network element determines the index of the multimedia motion material used to drive the digital human based on the audio media stream and relevant motion information. This can also include: if the sum of the length of the multimedia motion material associated with motion features matching the speech content features used to drive the digital human and the length of the multimedia motion material associated with motion features matching the beat used to drive the digital human is less than the length of the audio media stream, the length of the audio portion in the audio media stream for which speech content features and beat features have not been extracted is used to match unconscious slight body movements and facial expressions with relevant motion information to determine the index of the multimedia motion material used to drive the digital human to perform unconscious slight body movements and facial expressions.

[0167] Therefore, for audio segments in the audio media stream where audio features cannot be extracted, the media function network element can match the corresponding length of multimedia action material associated with unconscious subtle body movements and facial expressions from the relevant information of the action based on the length of the audio segment. This allows it to obtain multimedia action material representing unconscious subtle body movements and facial expressions to supplement or fill in the audio segments of multimedia action material that did not match the semantic or tempo-expressing multimedia action material. For the media function network element, after matching the semantic and tempo-expressing multimedia action material based on the speech content and tempo features of the audio media stream, it needs to determine whether the total length of the matched multimedia action material is equal to the length of the audio media stream. If not, the media function network element also needs to match multimedia action material associated with unconscious subtle body movements and facial expressions to fill in or supplement the audio segments of multimedia action material that did not match the semantic or tempo-expressing multimedia action material.

[0168] In one possible design, after receiving an audio media stream, the media function network element can perform hierarchical matching on the audio media stream, for example, sequentially performing semantic-related action feature matching, speech rhythm-related action feature matching, and unconscious slight body movements and facial expressions-related action feature matching.

[0169] In this embodiment, the process of how the media function network element performs feature matching based on the relevant information of the audio media stream and the action is not limited.

[0170] After the media function network element obtains the index and sequence of the multimedia motion materials used to drive the digital human, there are two possible implementations:

[0171] In one possible implementation (1), if the action-related information also includes multimedia action materials, then the media function network element can determine the multimedia action materials driving the digital human from the multimedia action materials in the action-related information based on the determined index of the multimedia action materials driving the digital human. Then, it arranges the obtained multimedia action materials for driving the digital human in sequence according to the determined order, thus obtaining the digital human's action video. In other words, the media function network element determines the multimedia action materials associated with the index of the multimedia action materials driving the digital human based on the multimedia action materials.

[0172] In one possible implementation (2), if the information related to the action does not include multimedia action materials, the media function network element requests multimedia action materials associated with the index used to drive the digital human from the first device, and sequentially arranges the acquired multimedia action materials used to drive the digital human to obtain the digital human's action video. In implementation 2, the media function network element can send a second request to the first device, and correspondingly, the first device receives the second request from the media function network element. The second request is used to request multimedia action materials. Thus, the first device sends multimedia action materials associated with the index used to drive the digital human to the media function network element, and correspondingly, the media function network element receives multimedia action materials associated with the index used to drive the digital human from the first device.

[0173] There are three possible design options for the second request:

[0174] In design scheme 1, the second request includes an index of multimedia motion materials for driving the digital human, and the order of the index of multimedia motion materials for driving the digital human in the second request is not limited. Then, for the first device, the first device matches the multimedia motion materials for driving the digital human with the associated index from the set of multimedia motion materials according to the index, and sends all the matched multimedia motion materials for driving the digital human to the media function network element. The sending order of all the matched multimedia motion materials for driving the digital human is not arranged.

[0175] In this design scheme 1, after the media function network element obtains the multimedia motion material for driving the digital human, it drives the digital human according to the determined order of the multimedia motion material for driving the digital human. Driving the digital human means arranging the obtained multimedia motion material for driving the digital human in the determined order to obtain the motion video of the digital human.

[0176] For example, the media function network element determines the indices of the multimedia action materials used to drive the digital human, including 1, 5, 7, and 9, but the determined order of the multimedia action materials used to drive the digital human is 5, 7, 9, 1. Then the second request may include the indices of the multimedia action materials used to drive the digital human, which are arranged arbitrarily, not in the order of the multimedia action materials used to drive the digital human, such as 1, 5, 7, and 9. Then the first device can send a set of multimedia action materials used to drive the digital human to the media function network element. This set includes the multimedia action materials associated with index 1, the multimedia action materials associated with index 5, the multimedia action materials associated with index 7, and the multimedia action materials associated with index 9. The four multimedia action materials in this set can be arranged arbitrarily.

[0177] In design scheme 2, the second request includes an index of multimedia action materials for driving the digital human, arranged in a determined order. Then, for the first device, it can send the multimedia action materials associated with the index of the multimedia action materials for driving the digital human to the media function network element in sequence according to the received index order, that is, the order of the multimedia action materials for driving the digital human, or send the multimedia action materials associated with the index of the multimedia action materials for driving the digital human together in sequence.

[0178] For example, the indexes of the multimedia action materials used to drive the digital human determined by the media function network element include 1, 5, 7, and 9, and the determined order of the multimedia action materials used to drive the digital human is 5, 7, 9, 1. Then, the indexes of the multimedia action materials used to drive the digital human in the second request are an index sequence, and the order of the index sequence is 5, 7, 9, 1. For the first device, the first device can send the multimedia action materials associated with the indexes of the multimedia action materials used to drive the digital human to the media function network element in the order of the index sequence in the second request, or send the multimedia action materials associated with the indexes of the multimedia action materials used to drive the digital human in sequence. That is, the sending order or arrangement order of the multimedia action materials associated with the indexes of the multimedia action materials used to drive the digital human is the multimedia action materials associated with index 5, the multimedia action materials associated with index 7, the multimedia action materials associated with index 9, and the multimedia action materials associated with index 1.

[0179] In design scheme 3, the second request includes an index of multimedia motion materials for driving the digital human and an order of multimedia motion materials for driving the digital human. Then, the first device can search for and send multimedia motion materials associated with the index of multimedia motion materials for driving the digital human according to the order of multimedia motion materials for driving the digital human. Similar to design scheme 2, the first device can send multimedia motion materials associated with the index of multimedia motion materials for driving the digital human to the media function network element in sequence, or send the multimedia motion materials associated with the index of multimedia motion materials for driving the digital human in sequence.

[0180] For design schemes 2 and 3, the multimedia motion materials associated with the index used to drive the digital human are sent in the order they are used to drive the digital human. Thus, the media function network element can drive the digital human based on the received multimedia motion materials associated with the index used to drive the digital human.

[0181] It is understandable that the first device sends multimedia motion materials associated with the index of the multimedia motion materials used to drive the digital human in the order of the multimedia motion materials used to drive the digital human. Then, the media function network element usually arranges the received multimedia motion materials associated with the index of the multimedia motion materials used to drive the digital human in the same order to drive the digital human.

[0182] In design schemes 2 and 3, before the media function network element drives the digital human based on the multimedia action materials associated with the received index of the multimedia action materials used to drive the digital human, in one possible design scheme, if the multimedia action materials used to drive the digital human are sent sequentially, the media function network element can also determine whether the received multimedia action material associated with the index of the multimedia action material used to drive the digital human is the next multimedia action material to be sent. That is, the media function network element reconfirms whether the first device sent the multimedia action materials used to drive the digital human in the order they were sent. If so, the media function network element drives the digital human in the order of the received multimedia action materials associated with the index of the multimedia action materials used to drive the digital human. If not, it can interrupt the driving process and re-request the multimedia action materials used to drive the digital human from the first device, or after receiving all the multimedia action materials used to drive the digital human, it can arrange the obtained multimedia action materials used to drive the digital human in the order they were sent and drive the digital human. There is no limitation on this.

[0183] Alternatively, if the multimedia motion materials used to drive the digital human are sent together in sequence, the media function network element can also check the order of the received multimedia motion materials used to drive the digital human again according to the determined order of the multimedia motion materials used to drive the digital human, so as to ensure the accuracy of the motion video arrangement of the digital human.

[0184] In another possible design, if the multimedia motion materials used to drive the digital human are sent sequentially, the media function network element determines that the multimedia motion material associated with the index of the multimedia motion material used to drive the digital human is the next multimedia motion material to be sent. In this case, the media function network element can send an acknowledgment message (such as ACK) to the first device to inform the first device that the received multimedia motion material used to drive the digital human is correct. If the received multimedia motion material associated with the index of the multimedia motion material used to drive the digital human is not the next multimedia motion material to be sent, the media function network element can interrupt the driving process and re-request the multimedia motion material used to drive the digital human from the first device. Alternatively, after receiving all the multimedia motion materials used to drive the digital human, the media function network element can arrange the acquired multimedia motion materials used to drive the digital human in the order they were used to drive the digital human and drive the digital human. There are no limitations on this.

[0185] It should be understood that the next multimedia action material to be sent refers to whether the multimedia action material currently being received is the one that follows the previous multimedia action material that has been correctly received.

[0186] It should also be understood that multimedia motion materials arranged in a defined order to drive digital humans can constitute a multimedia motion material sequence, and the actions performed by the corresponding multimedia motion material sequence can constitute an action sequence.

[0187] Therefore, the media function network element can arrange the multimedia action materials used to drive the digital human, sent by the first device in the order they are used to drive the digital human, to obtain a sequence of actions performed by the digital human in the video, thereby generating a video of the digital human's actions. Alternatively, the media function network element can arrange the multimedia action materials sent by the first device in the determined order to drive the digital human, thereby obtaining a sequence of actions performed by the digital human in the video, thereby generating a video of the digital human's actions.

[0188] In some implementations, the media function network element can also request multimedia action materials from the first device one by one in a predetermined order to drive the digital human, without limitation.

[0189] In this embodiment, the digital human's motion video, in addition to exhibiting body movements and facial expressions related to the speech content and rhythmic features of the audio media stream, may also include mouth movements and sounds that drive the digital human to perform the speech content of the audio media stream. Of course, the media function network element can also set a background in the motion video; this is not limited. For example, in addition to driving the digital human's movements frame-by-frame according to the multimedia motion material, the media function network element can also drive the digital human's mouth movements frame-by-frame to correspond with the audio content. That is, in the motion video of the digital human, the digital human performs a sequence of movements and mouth movements that match the audio media stream.

[0190] Furthermore, after the media function network element orchestrates and obtains the motion video driving the digital human, it can send the motion video of the digital human to the first terminal device of the call in the form of a video media stream for display. That is, the media function network element sends the motion video of the digital human to the first terminal device, and correspondingly, the first terminal device receives the motion video of the digital human from the media function network element, and the motion video of the digital human is displayed on the call interface of the first terminal device.

[0191] In the case where the digital human service belongs to the C2C service shown in Scenario 2 above, the media function network element also obtains, based on the above process, the multimedia action material and the arrangement order of the multimedia action material that match the audio media stream of the first terminal device to drive the digital human of the first terminal device, thereby arranging the action sequence of the digital human to drive the first terminal device, and sending the action video of the digital human driving the first terminal device to the second terminal device. This will not be elaborated further.

[0192] In the communication method shown in Figure 3, the media function network element in the communication network can request relevant information about the digital human's actions from the first device. This relevant information includes the index of multimedia action materials and the action features of the multimedia action materials. The media function network element can then match the real-time received audio media stream with the action features in the relevant information to determine the index of the multimedia action materials used to drive the digital human and the order of the multimedia action materials that drive the digital human, based on the action features associated with the real-time audio media stream (e.g., semantic or tempo matching). This allows for the arrangement of the digital human's action video based on the index and order of the multimedia action materials used to drive the digital human. Therefore, in a voice-driven digital human scenario, the media function network element can request relevant information about the actions of the full range of multimedia action materials indicating the digital human from the first device. Through simple feature matching, it can match the real-time received audio media stream with multimedia action materials strongly correlated with the audio features. This not only improves the experience of interacting with the digital human during a call but also has low processing complexity and minimal performance loss.

[0193] The following example illustrates the communication method shown in Figure 3, applied to the IMS network shown in Figure 1 above, in conjunction with a specific digital human service scenario.

[0194] Taking the digital human service as an example of the C2M service shown in Scenario 1 above, with the first terminal device being the UE, the digital human being being the AI ​​assistant subscribed to by the UE, and the media function network element being the MF, the UE makes a phone call, accesses the IMS network, and triggers the digital human service of the subscribed AI assistant. The user asks the AI ​​assistant a question via voice, and the AI ​​assistant will appear on the UE's screen in the form of a digital human, answering the question while performing actions.

[0195] As shown in Figure 4, the communication method includes:

[0196] S400-1: The UE makes a phone call through the IMS control plane to initiate the digital human service.

[0197] S400-2, IMS control plane notifies MF UE to start digital human service.

[0198] For S400-1 and S400-2, the UE initiates a call to the AI ​​assistant through the IMS network. IMS control plane network elements such as CSCF (P-CSCF / I-CSCF / S-CSCF) or HSS can determine whether the UE has started the digital human service based on the call initiated by the UE, and notify MF to trigger MF to obtain the UE's digital human motion material and load the digital human image.

[0199] For example, the UE sends a Session Initiation Protocol (SIP) invite message to the P-CSCF. This invite message may include the identifier of the calling UE and the identifier of the called AI assistant. For example, the identifier of the UE and the identifier of the AI ​​assistant are Uniform Resource Identifiers (URIs). The P-CSCF sends the invite message to the I-CSCF. The I-CSCF can determine that the UE has subscribed to the corresponding digital human service based on the invite message and the UE's subscription information. The I-CSCF sends the invite message and the digital human service launch notification to the MF, so that the MF can determine that the UE has launched the digital human service and execute the following S401.

[0200] S401, MF sends a service request to the first device. Correspondingly, the first device receives the service request from MF.

[0201] The service request corresponds to the first request in S301 above. The service request is used to request the UE's digital human action description file, and the service request includes the UE ID. Optionally, the service request may also include information indicating the digital human, information indicating the acquisition of the action description file, etc., without limitation.

[0202] S402, the first device sends a service response to the MF. Correspondingly, the MF receives the service response from the first device.

[0203] The service response is the message responding to the service request, and it includes an action description file. Optionally, the service response may also include the UE ID.

[0204] The motion description file includes the frame order of each motion video in the motion video set and the motion features of each motion video. In a specific embodiment, the aforementioned multimedia motion material can refer to motion videos, the aforementioned multimedia motion material set can refer to a collection of motion videos, the aforementioned multimedia motion material index can refer to the frame order of the motion videos, and the aforementioned multimedia motion material motion features can refer to the motion features of the motion videos.

[0205] Each motion video represents a single action unit of the digital human, which can be of the following types: semantic action, beat (or rhythm) action, or rest action. It should be understood that the motion video set can include all motion videos used to represent the digital human's actions, such as rest action 1, rest action 2, beat action 1, beat action 2, semantic action 1, semantic action 2, etc.

[0206] Semantic actions are actions that express semantic content. For example, when saying "OK," the digital human makes an OK gesture, which corresponds to the body movements and expressions commonly used in a certain semantic context. Beat actions are actions that express the rhythm of speech. The content of the action is strongly related to the rhythm, but it is unrelated to the semantic content. For example, the digital human's hands may make some rhythmic, meaningless movements, which correspond to the body movements and expressions that match the speech rhythm. Idle actions are actions that express rest, intermittent, pause, or idle state, which are unrelated to both semantic content and speech rhythm. For example, slight body swaying or blinking corresponds to the unconscious slight body movements and expressions mentioned above.

[0207] Since an action video consists of multiple ordered video frames, its frame order is a sequence or range of frames. Action videos in a set can be stitched together, and their frame order can be sorted starting from 0 or 1, without restriction. The corresponding action video can be found by its frame order. Therefore, in a specific implementation, the frame order can be used as an index to identify the action video.

[0208] For example, the format of the file used to describe the frame order of the motion description file can be:

[0209] The file format used in motion description files to describe the motion characteristics of motion videos can be:

[0210] It should be understood that in the embodiments of this application, the action description file may be decomposed into multiple parts and in different formats (protobuf, json+binary, xml or other forms).

[0211] After the MF obtains the motion feature file of the UE's digital human, it waits for the UE's audio input.

[0212] S403, the UE sends audio media stream #1 to the MF via the IMS control plane. Correspondingly, the MF receives audio media stream #1 from the UE via the IMS control plane.

[0213] For example, the UE's audio media stream #1 is a question posed to the AI ​​assistant.

[0214] S404, MF invokes the AI ​​assistant based on audio media stream #1.

[0215] S405, the AI ​​assistant sends audio media stream #2 to the MF. Correspondingly, the MF receives audio media stream #2 from the AI.

[0216] For example, audio media stream 2 is the answer to a question posed by the user (UE) in response to an AI assistant.

[0217] S406, MF determines the frame order and sequence of the motion video used to drive the digital human based on the audio media stream #2 and the motion description file.

[0218] MF extracts audio features from audio media stream #2, such as speech content, beat features, and idle features, and matches them with action features in the action description file to obtain the frame order of the action video associated with the matching audio features. This frame order is the action video used to drive the digital human. Simultaneously, the order of the action videos used to drive the digital human, i.e., the arrangement order of the action videos or the execution order of the actions, can also be obtained. The frame order of the action videos used to drive the digital human is one implementation of the index of the multimedia action materials used to drive the digital human in S304 above. It can be used as an index to match and obtain the action videos used to drive the digital human. The order of the action videos used to drive the digital human corresponds to the order of the multimedia action materials used to drive the digital human in S304 above, and is used to arrange the obtained action videos used to drive the digital human in sequence. Specific implementation details can be found in the relevant description in S304 above, and are not limited thereto.

[0219] For example, as shown in Figure 5, the implementation process of MF using hierarchical matching to determine the frame order of the motion video used to drive the digital human includes:

[0220] 5.1 MF receives audio media stream #2.

[0221] For example, audio media stream #2 is asynchronously input to MF in 20ms increments of audio data packets.

[0222] 5.2 MF buffer audio data packets for audio media stream #2.

[0223] 5.3. MF determines whether the length of the buffered audio data packets has reached the set threshold length. If yes, proceed to step 5.4; otherwise, proceed to step 5.2.

[0224] 5.4. MF performs semantic action feature matching on audio data packets to obtain the frame sequence of the action video used to drive the semantic actions of the digital human.

[0225] 5.5. MF performs beat motion feature matching on audio data packets to obtain the frame sequence of the motion video used to drive the beat motion of the digital human.

[0226] 5.6. MF performs idle motion feature matching on audio data packets to obtain the frame sequence of motion video used to drive the idle motion of the digital human.

[0227] The threshold length is set to change dynamically. For example, at the beginning of a response, the MF first caches 1 second of audio data packets and performs action feature matching as described in sections 5.4 to 5.6 above on these 1 second audio data packets to obtain the frame sequence of the action video used to drive the digital human. It then requests the action video used to drive the digital human from the first device and arranges the 1 second action video corresponding to these 1 second audio data packets to send to the UE for playback. During the playback of this 1 second action video, the MF synchronously caches N (N is much greater than 1, in scenarios like AI assistants) seconds of audio data packets and performs action feature matching as described in sections 5.4 to 5.6 on these cached N seconds of audio data packets again to obtain N seconds of action video. It then plays the N seconds of action video to the UE and synchronously caches subsequent audio until the end of this response.

[0228] For 5.4 above, MF matches the speech content features and action features of the extracted audio data packets. For 5.5 above, MF matches the beat features and action features of the extracted audio data packets. For 5.6 above, MF matches the audio portion of the audio data packet for which speech content features and beat features (i.e., audio features) have not been extracted, based on the length of the audio portion and the action features. At this time, the action features can indicate the length and type of the idle action as idle features, thereby matching the frame sequence of the action video that meets the length requirement to drive the idle action of the digital human.

[0229] As can be seen from the above implementation, hierarchical matching refers to matching action features in layers. In the hierarchical matching process, action features are matched one level at a time in the order of semantic action, beat action, and idle action. In other words, the matching priority of semantic action is higher than that of beat action, and the matching priority of beat action is higher than that of idle action. This is not limited.

[0230] As a further example, the logic of hierarchical action matching is shown in Figure 6. The text content corresponding to the cached audio data packet is "That's great news, congratulations, congratulations, congratulations!" MF will perform semantic action feature matching based on the cached audio data packet and the action description file. For example, it will match the action features that express "congratulations" in the audio speech. This action feature is associated with the frame sequence of semantic action 2. Therefore, MF can determine that the number of frames for semantic action 2 is much smaller than the number of frames required for the duration of the audio. Then, MF will continue to match the action features of the beat action, such as matching the action features of beat action 10 and beat action 15. Determine the frame order of beat action 10 and beat action 15. Then, check whether the number of frames of semantic action 2 + beat action 10 + beat action 15 reaches the number of frames required for the duration of the audio. If not, the remaining unmatched frames can be filled or supplemented by matching idle actions of the corresponding length. For example, if idle action 1 and idle action 2 are matched, the number of frames of the matched motion video used to drive the digital human will be the same as the number of frames of the audio data. At the same time, the execution order of the matched actions can be determined according to the matching order, such as the order of idle action 2, beat action 10, idle action 1, semantic action 2, idle action 1, beat action 15.

[0231] Therefore, MF obtains the frame sequence and the order of the motion video used to drive the digital human, and requests the first device to obtain the motion video used to drive the digital human.

[0232] S407, MF sends an action request to the first device. Correspondingly, the first device receives the action request from MF.

[0233] The motion request is used to request motion videos. The motion request includes the frame order of the motion videos used to drive the digital human, arranged in the order they are used to drive the digital human, or the MF requests the motion videos one by one from the first device in the order they are used to drive the digital human. The motion request corresponds to the second request in S304 above.

[0234] For example, the frame order of the motion video used to drive the digital human includes frame order range 10-16 for semantic action 2, frame order range 17-23 for beat action 10, frame order range 30-35 for beat action 15, frame order range 0-2 for idle action 1, and frame order range 4-8 for idle action 2. The order of the motion video driving the digital human is idle action 2, beat action 10, idle action 1, semantic action 2, idle action 1, beat action 15. Then, the frame order of the motion video used to drive the digital human in the motion request is 4-8, 17-23, 0-2, 10-16, 0-2, 30-35.

[0235] Therefore, the first device can feed back the corresponding motion video to the MF according to the frame order and frame sequence in the motion request.

[0236] S408, The first device sends an action response to the MF. Correspondingly, the MF receives the action response from the first device.

[0237] The action response is a response message to the action request. The action response may include the action videos for driving the digital human arranged in the order of the action videos used to drive the digital human, and the action videos for idle action 2, beat action 10, idle action 1, semantic action 2, idle action 1, and beat action 15 are sent to the first device in sequence.

[0238] S409 and MF obtain the motion video of the digital human based on the motion video used to drive the digital human.

[0239] After receiving the motion video used to drive the digital human, the first device can determine whether the received motion video is arranged in sequence or sent. If so, it can directly arrange the motion video in the order of receipt. The motion video can show the actions, expressions and sounds performed by the digital human in response to the audio media stream #2, without limitation.

[0240] S410 and MF transmit the motion video of the digital human to the UE via the IMS control plane. Correspondingly, the UE receives the motion video of the digital human from the MF via the IMS control plane.

[0241] For example, the motion video is displayed on the UE's call interface. It should be understood that the motion video of the digital human sent to the UE by the MF in S410 is a motion video obtained by arranging multiple motion videos representing a single action in sequence, and the resulting motion video represents a sequence of actions.

[0242] Therefore, in the C2M service scenario, the MF can learn from the IMS control plane that the UE has started the digital human service corresponding to the AI ​​assistant, and then request the first device to obtain the action description file of the digital human. Based on the audio media stream sent by the AI ​​assistant, the MF performs action feature matching, obtains the frame order and arrangement order of the action video associated with the action features of the matching audio media stream, and requests the corresponding action video from the first device according to the frame order. The voice-driven action video of the digital human is obtained by arranging the video in sequence. In this way, the call experience can be improved by driving body movements in real time based on voice content in a low-cost and high-experience way.

[0243] It should be understood that the communication method shown in Figure 4 can also be applied to the virtual character call service scenario shown in Scenario 3 above. In some implementations, since the motion materials of virtual characters such as virtual celebrity images may be different or there are other types of digital human service scenarios, the first device can feed back motion description files and motion video sets at the same time as feeding back motion description files based on service requests. For example, the first device sends motion-related files to the MF. The motion-related files include motion description files and motion video sets. Thus, the MF does not need to execute the above S407 and S410. Instead, the MF completes motion feature matching based on the audio media stream, obtains the frame order of the motion videos used to drive the digital human, and determines the motion videos associated with the frame order locally based on the frame order and the obtained motion video sets. Thus, the motion videos for driving the digital human are arranged according to the determined order of the motion videos used to drive the digital human.

[0244] For example, taking the digital human service as a C2C service as shown in Scenario 2 above, and taking a call between UE1 and UE2 as an example, where UE1 is the called party and UE2 is the caller, UE1 and UE2 respectively display the other party's preset digital human image during the call, and the actions of the displayed digital human match the other party's speech. As shown in Figure 7, this communication method includes:

[0245] S700-1 and UE2 call UE1 through the IMS control plane to initiate the digital human service.

[0246] S700-2, IMS control plane notifies MF UE1 and UE2 to start the digital human service.

[0247] For S700-1 and S700-2, UE2 calls UE1 through the IMS network. IMS control plane network elements such as CSCF (P-CSCF / I-CSCF / S-CSCF) or HSS can determine whether to trigger the digital human service between UE2 and UE1 based on UE2's call, and notify MF to trigger MF to obtain the motion materials of the digital human of UE1 and UE2 for digital human image loading. The specific implementation process is similar to that of S400-1 and S400-2 above, and will not be described in detail here.

[0248] S701, MF sends a service request to the first device. Correspondingly, the first device receives the service request from MF.

[0249] The difference from S401 above is that the service request can simultaneously request motion description files for the digital humans corresponding to UE1 and UE2, respectively, or the MF can send two service requests to the first device, one for requesting the motion description file for the digital human of UE1 and the other for requesting the motion description file for the digital human of UE2. In this example, since the process of driving the digital humans of UE1 and UE2 is similar, taking the sending of the motion video of driving the digital human of UE2 to UE1 as an example, the service request includes the UE2 ID.

[0250] S702, the first device sends a service response to the MF. Correspondingly, the MF receives the service response from the first device.

[0251] The service response is the message responding to the service request, and it includes the motion description file of the UE2's digital human. Optionally, the service response may also include the UE2 ID.

[0252] For a detailed description of the action description file, please refer to the relevant description in S402 above, which will not be repeated here. For a detailed implementation of S701 and S702, please refer to the relevant descriptions in S401 and S402 above, which will not be repeated here.

[0253] S703 and UE2 send audio media streams to MF via the IMS control plane. Correspondingly, MF receives audio media streams from UE2 via the IMS control plane.

[0254] For example, the audio media stream is the call audio sent by UE2 to UE1.

[0255] S704 and MF determine the frame order and sequence of the motion video used to drive the digital human based on the audio media stream and motion description file.

[0256] The implementation process of S704 is similar to that of S406 above, and will not be repeated here. It should be understood that the driven digital human is the digital human of UE2.

[0257] For real-time call C2C service scenarios, as exemplified in Figure 8, the implementation process of MF using hierarchical action matching to determine the frame order of the video used to drive the digital human's actions differs from the C2M service scenario mentioned above. Since C2C services do not output a long audio segment at once like C2M services, but rather output audio segments, beat-based action matching can be omitted to ensure the real-time performance of the digital human and the audio. This implementation process includes:

[0258] 8.1 MF receives audio media streams.

[0259] For example, audio media streams are asynchronously input to MF in 20ms increments of audio data packets.

[0260] 8.2 MF buffers audio data packets for audio media streams.

[0261] 8.3. MF determines whether the current action video has finished playing. If so, it executes step 8.4. Otherwise, it waits for the video to finish playing before executing step 8.4.

[0262] 8.4. MF performs semantic action feature matching on audio data packets and determines whether the semantic action features are matched. If they are matched, the frame sequence of the action video used to drive the semantic actions of the digital human is obtained. Otherwise, proceed to 8.5.

[0263] 8.5. MF performs idle motion feature matching on audio data packets to obtain the frame sequence of motion video used to drive the idle motion of the digital human.

[0264] As further exemplified, the logic of hierarchical action matching is shown in Figure 9. The text content corresponding to an audio data stream sent by the terminal device is "What great news! Congratulations! Congratulations!", where "What great news!" is the action video of the driving digital human that has been completed and played, obtained from feature matching and action orchestration in the MF's historical cache, and sent to UE1. The text content corresponding to the audio data packet currently cached by the MF is "Congratulations!". Before performing action feature matching on the cached audio data packet, the MF needs to determine whether the action video corresponding to the audio data "What great news!" has been played. If it has, then the semantic action is performed. Feature matching (if not completed, wait for the video corresponding to "That's great news," to finish playing, and perform semantic action feature matching after receiving the audio segment "Congratulations, congratulations"), if a match is found that expresses "Congratulations," this action feature is associated with the frame order of semantic action 2. Since the duration of semantic action 2 exceeds the length of the currently cached audio, audio portions exceeding the cached audio length are not matched. If the audio data cached after semantic action 2 cannot match the action feature of the semantic action, then the action feature of the idle action is matched, that is, the idle action is selected to fill the audio data stream that cannot match the semantic action, such as idle action 0.

[0265] Therefore, MF obtains the frame sequence and the order of the motion video used to drive the digital human, and requests the first device to obtain the motion video used to drive the digital human.

[0266] S705, MF sends an action request to the first device. Correspondingly, the first device receives the action request from MF.

[0267] S706, The first device sends an action response to the MF. Correspondingly, the MF receives the action response from the first device.

[0268] S707 and MF obtain the motion video of the digital human based on the motion video used to drive the digital human.

[0269] The specific implementation process of S705 to S707 can be found in the above descriptions, and S407 to S409 can be found in the above descriptions, which will not be repeated here.

[0270] S708 and MF transmit the motion video of the digital human to UE1 via the IMS control plane. Correspondingly, UE1 receives the motion video of the digital human from MF via the IMS control plane.

[0271] The motion video of the digital human is used to drive the digital human of UE2 to perform actions related to the audio media stream sent by UE2, and the motion video is displayed on the call interface of UE.

[0272] Therefore, in a C2C business scenario, MF can request the motion description file of the digital human from the first device, and perform motion feature matching based on the audio media stream sent by the terminal device to obtain the frame order and arrangement order of the motion video associated with the motion features of the matching audio media stream. Based on the frame order, MF requests the corresponding motion video from the first device and arranges it in order to obtain the voice-driven digital human motion video. Thus, in a low-cost and high-experience way, it can drive body movements in real time based on voice content, thereby improving the call experience.

[0273] Figures 4 and 7 illustrate the application in an IMS network architecture. With a fixed bandwidth between the MF and the first device, the index and feature files of the interactive motion video are used to perform feature matching with the real-time audio media stream. The relevant motion video is downloaded from the first device using the index of the matched motion video. This eliminates the need to download the entire motion video from the first device at once, reducing startup bandwidth and startup latency. The MF can then support digital human services for more users.

[0274] It is understood that, in the above embodiments, the methods and / or steps implemented by the media function network element can also be implemented by components (e.g., processors, chips, chip systems, circuits, logic modules, or software) that can be used in the media function network element; the methods and / or steps implemented by the first device can also be implemented by components (e.g., processors, chips, chip systems, circuits, logic modules, or software) that can be used in the first device.

[0275] The foregoing mainly describes the solutions provided in this application. Accordingly, this application also provides a communication device for implementing various methods in the above method embodiments. This communication device can be a media function network element in the above method embodiments, or a device containing a media function network element, or a component that can be used in a media function network element, such as a chip or chip system. Alternatively, the communication device can be the first device in the above method embodiments, or a device containing a first device, or a component that can be used in a first device, such as a chip or chip system.

[0276] It is understood that, in order to achieve the aforementioned functions, the communication device includes hardware structures and / or software modules corresponding to the execution of each function. Those skilled in the art should readily recognize that, based on the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0277] This application embodiment can divide the communication device into functional modules according to the above method embodiment. For example, each function can be divided into a separate functional module, or two or more functions can be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. It should be noted that the module division in this application embodiment is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods.

[0278] Taking the communication device as an example of the media function network element or the first device in the above method embodiment, Figure 10 is a schematic diagram of the structure of a communication device provided in an embodiment of this application. As shown in Figure 10, the communication device 1000 includes a processing module 1001 and a transceiver module 1002. The processing module 1001 is used to execute the processing functions of the media function network element or the first device in the above method embodiment. The transceiver module 1002 is used to execute the transceiver functions of the media function network element or the first device in the above method embodiment. All relevant content of each step involved in the above method embodiment can be referenced from the functional description of the corresponding functional module, and will not be repeated here.

[0279] In one possible design, according to this embodiment, the transceiver module 1002 may include a receiving module and a transmitting module (not shown in FIG10). The transmitting module and the receiving module are respectively used to implement the transmitting and receiving functions of the communication device 1000.

[0280] In one possible design, the communication device 1000 may further include a storage module (not shown in FIG. 10) that stores programs or instructions. When the processing module 1001 executes the program or instructions, the communication device 1000 can perform the functions of the media function network element or the first device in the method shown in FIG. 3, FIG. 4, or FIG. 7.

[0281] In some embodiments, the processing module 1001 involved in the communication device 1000 may be implemented by a processor or processor-related circuit components, and may be a processor or processing unit; the transceiver module 1002 may be implemented by a transceiver or transceiver-related circuit components, and may be a transceiver or transceiver unit.

[0282] For example, FIG11 is a schematic diagram of another communication device provided in an embodiment of this application. This communication device may be a media function network element or a first device in the above method embodiments, or it may be a chip (system) or other component or assembly that can be disposed in the media function network element or the first device. As shown in FIG11, the communication device 1100 may include a processor 1101, a bus 1102, a communication interface 1103, and a memory 1104. The processor 1101, the memory 1104, and the communication interface 1103 communicate via the bus 1102. It should be understood that this application does not limit the number of processors and memories in the communication device 1100.

[0283] Bus 1102 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 11, but this does not imply that there is only one bus or one type of bus. Bus 1102 can include pathways for transmitting information between various components of communication device 1100 (e.g., memory 1104, processor 1101, communication interface 1103).

[0284] Processor 1101 may include any one or more of the following processors: central processing unit (CPU), graphics processing unit (GPU), microprocessor (MP), or digital signal processor (DSP).

[0285] The memory 1104 may include volatile memory, such as random access memory (RAM). The processor 1101 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0286] The communication interface 1103 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the communication device 1100 and other devices or communication networks.

[0287] The memory 1104 stores executable program code, which the processor 1101 executes to implement the functions of the network device or the terminal device in the aforementioned method embodiments. That is, the memory 1104 stores instructions for executing the aforementioned communication methods.

[0288] In another aspect, embodiments of this application also provide a computer program product containing instructions, including computer program code, which, when run on a communication device, enables the communication device to execute the methods described in the above embodiments.

[0289] Furthermore, embodiments of this application also provide a computer-readable storage medium. This computer-readable storage medium stores a computer program or instructions that, when executed on a communication device, enable the communication device to perform the methods described in the above embodiments.

[0290] Furthermore, embodiments of this application also provide a communication system, which includes the aforementioned media function network element and the first device. Optionally, the communication system further includes the aforementioned terminal device.

[0291] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented using software programs, implementation can be, in whole or in part, in the form of a computer program product. This computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device containing one or more servers, data centers, etc., that can be integrated with the medium. The available media can be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., digital video disks, DVDs), or semiconductor media (e.g., SSDs), etc.

[0292] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0293] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0294] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0295] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, the functional units in the various embodiments of this application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

[0296] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, random access memory (RAM), magnetic disks, or optical disks.

[0297] Although this application has been described herein in conjunction with various embodiments, those skilled in the art, by reviewing the accompanying drawings, the disclosure, and the appended claims, will understand and implement other variations of the disclosed embodiments in carrying out the claimed application. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude multiple instances. A single processor or other unit can implement several functions listed in the claims. While different dependent claims may recite certain measures, this does not mean that these measures cannot be combined to produce good results.

[0298] Although this application has been described in conjunction with specific features and embodiments, it is obvious that various modifications and combinations can be made thereto without departing from the spirit and scope of this application. Accordingly, this specification and drawings are merely exemplary illustrations of this application as defined by the appended claims, and are considered to cover any and all modifications, variations, combinations, or equivalents within the scope of this application. Clearly, those skilled in the art can make various alterations and modifications to this application without departing from the spirit and scope of this application. Thus, if such modifications and modifications of this application fall within the scope of the claims of this application and their equivalents, this application is also intended to include such modifications and modifications.

Claims

1. A communication method, characterized in that, The method, applied to a media function network element in a communication network, includes: A first request is sent to the first device, the first request being used to request information related to the digital human's actions, the information related to the actions including an index of multimedia action material and the action characteristics of the multimedia action material; Receive audio media streams and information related to the actions described in the first device; Based on the audio media stream and the relevant information of the action, an index for multimedia motion materials used to drive the digital human and an order for the multimedia motion materials used to drive the digital human are determined, wherein the determined order and the multimedia motion materials associated with the determined index are used to arrange the motion video of the digital human.

2. The method according to claim 1, characterized in that, Determining the index of the multimedia motion material used to drive the digital human based on the audio media stream and the relevant information of the motion includes: Extract the audio features of the audio media stream; The audio features are matched with the action features in the relevant information of the action to determine the index of the multimedia action material used to drive the digital human.

3. The method according to claim 1 or 2, characterized in that, The multimedia motion material is generated offline. The multimedia motion material is a video or a set of pictures that uses multimedia to represent real or virtual actions.

4. The method according to claim 3, characterized in that, The multimedia motion material represents real or virtual actions, including at least one of the following: commonly used body movements and expressions in a certain semantic environment, body movements and expressions that match the rhythm of speech, and unconscious slight body movements and expressions.

5. The method according to any one of claims 1-4, characterized in that, The method further includes: A second request is sent to the first device. The second request is for requesting multimedia motion materials. The second request includes the index of the multimedia motion materials used to drive the digital human and the order of the multimedia motion materials used to drive the digital human. The system receives multimedia motion material associated with the index of the multimedia motion material used to drive the digital human from the first device. The multimedia motion material associated with the index of the multimedia motion material used to drive the digital human is sent in the order of the multimedia motion material used to drive the digital human. The digital human is driven by the received multimedia motion material.

6. The method according to claim 5, characterized in that, Before driving the digital human based on the received multimedia motion material, the method further includes: Determine that the multimedia action material associated with the index of the received multimedia action material used to drive the digital human is the next multimedia action material to be sent.

7. The method according to any one of claims 1-4, characterized in that, The method further includes: A second request is sent to the first device, the second request being for requesting multimedia motion material, the second request including the index of the multimedia motion material used to drive the digital human; Receive multimedia motion material from the first device associated with an index of the multimedia motion material used to drive the digital human; The digital human is driven according to the determined sequence of multimedia motion materials used to drive the digital human.

8. The method according to any one of claims 1-4, characterized in that, The information related to the action also includes the multimedia action material.

9. The method according to claim 8, characterized in that, The method further includes: Based on the multimedia motion material, determine the multimedia motion material associated with the index of the multimedia motion material used to drive the digital human.

10. The method according to claim 2, characterized in that, The multimedia motion materials represent real or virtual actions, including: body movements and facial expressions commonly used in a certain semantic environment, as well as body movements and facial expressions that match the rhythm of speech; the audio features include speech content features and rhythm features; The step of matching the audio features with the action features in the relevant information of the action to determine the index of the multimedia motion material used to drive the digital human includes: Based on the audio features and the relevant information of the actions, semantic-related action feature matching and speech rhythm-related action feature matching are performed to determine the index of multimedia action material associated with the action features matched with the speech content features used to drive the digital human and the index of multimedia action material associated with the action features matched with the rhythm used to drive the digital human.

11. The method according to claim 10, characterized in that, The multimedia motion materials also include unconscious, slight body movements and facial expressions, representing real or virtual actions. The step of determining the index of the multimedia motion material used to drive the digital human based on the relevant information of the audio media stream and the motion further includes: If the sum of the lengths of the multimedia motion material associated with the motion features matching the speech content features used to drive the digital human and the lengths of the multimedia motion material associated with the motion features matching the beat used to drive the digital human is less than the length of the audio media stream, the length of the audio portion in the audio media stream where the speech content features and the beat features have not been extracted is used to match unconscious slight body movements and facial expressions with the relevant information of the motion, and the index of the multimedia motion material used to drive the digital human to perform unconscious slight body movements and facial expressions is determined.

12. The method according to any one of claims 1-11, characterized in that, The received audio media stream includes: The device receives the audio media stream from the AI ​​assistant of the first terminal device, or receives the audio media stream from the second terminal device, wherein the second terminal device communicates with the first terminal device through the communication network. The method further includes: The motion video of the digital human is sent to the first terminal device.

13. A communication method, characterized in that, Applied to a first device, the method includes: Receive a first request from the media function, the first request being used to request information related to the digital human's actions, the information related to the actions including an index of multimedia action material and the action characteristics of the multimedia action material; Send the relevant information of the action to the media function network element.

14. The method according to claim 13, characterized in that, The multimedia motion material is generated offline. The multimedia motion material is a video or a set of pictures that uses multimedia to represent real or virtual actions.

15. The method according to claim 14, characterized in that, The multimedia motion material represents real or virtual actions, including at least one of the following: commonly used body movements and expressions in a certain semantic environment, body movements and expressions that match the rhythm of speech, and unconscious slight body movements and expressions.

16. The method according to any one of claims 13-15, characterized in that, The method further includes: Receive a second request from the media function network element, the second request being used to request multimedia motion materials, the second request including an index for the multimedia motion materials used to drive the digital human and an order for the multimedia motion materials used to drive the digital human; The multimedia motion material associated with the index of the multimedia motion material used to drive the digital human is sent to the media function network element. The multimedia motion material associated with the index of the multimedia motion material used to drive the digital human is sent in the order of the multimedia motion material used to drive the digital human.

17. The method according to any one of claims 13-15, characterized in that, The method further includes: Receive a second request from the media function network element, the second request being for requesting multimedia motion material, the second request including an index for driving the multimedia motion material of the digital human; Send multimedia motion material associated with the index of the multimedia motion material used to drive the digital human to the media function network element.

18. The method according to any one of claims 13-15, characterized in that, The information related to the action also includes the multimedia action material.

19. A communication device, characterized in that, Includes modules for performing the method as described in any one of claims 1-18.

20. A communication device, characterized in that, include: processor; The processor is configured to run computer programs or instructions to implement the method as described in any one of claims 1-18.

21. A communication chip, characterized in that, It stores instructions that, when the chip is running on a communication device, cause the method as described in any one of claims 1-18 to be implemented.

22. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program or instructions that, when executed by a communication device, implement the method as described in any one of claims 1-18.

23. A computer program product, characterized in that, It includes computer program code, which, when run on a communication device, implements the method as described in any one of claims 1-18.

24. A communication system, characterized in that, include: A media function network element for performing the method as described in any one of claims 1-12, and a first device for performing the method as described in any one of claims 13-18.