Media resource playing method and related apparatus

By sending audio information requests and playing media resources through the calling terminal device, the problem of users being unable to interact in real time is solved, enabling flexible control and real-time feedback of media resources, thus improving the user experience.

WO2025140028A9PCT designated stage Publication Date: 2026-06-11HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2024-12-20
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

In existing technologies, the media resources played by the calling terminal device are configured by the media platform server, and users cannot interact with them in real time, resulting in reduced user enjoyment and operability.

Method used

The calling terminal device requests the playback of media resources by sending audio information, and receives and plays the media resources determined by the audio information. Users can flexibly control the playback and interaction of media resources through voice commands, including real-time feedback of audio, images, animations and text information.

🎯Benefits of technology

It enhances the user experience, enables real-time interaction and flexible control of media resources, and increases the fun and user-friendliness of operation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2024140932_11062026_PF_FP_ABST
    Figure CN2024140932_11062026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the present application are a media resource playing method and a related apparatus. The method comprises: sending a call request to a called terminal device; sending first audio information, wherein the first audio information is used for requesting the play of a media resource; and receiving a first media resource and playing same, wherein the first media resource is determined by means of the first audio information. After sending a call request to a called terminal device, a calling terminal device requests the play of a media resource by means of sending first audio information, and the calling terminal device then receives a first media resource and plays same. Therefore, by means of a voice instruction, a user can flexibly control the media resource played by the terminal device, thereby bringing more friendly and more interesting user experience to the user.
Need to check novelty before this filing date? Find Prior Art

Description

A method and related apparatus for playing media resources

[0001] This application claims priority to Chinese Patent Application No. CN202311864928.X, filed on December 29, 2023, entitled "A Method for Playing Media Resources and Related Devices", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of communication technology, and in particular to a method and apparatus for playing media resources. Background Technology

[0003] With the continuous development of communication technology, Voice Over Long Term Evolution (VOLTE) technology has gradually entered people's lives, allowing them to enjoy various types of media resources. For example, users can enjoy video experiences such as video ringback tones and video customer service while making voice calls, making the pre-call waiting period more interesting and greatly improving the user's calling experience.

[0004] Currently, media resource playback methods are typically based on the communication technology (CT) domain, also known as the telecommunications domain. The corresponding process is as follows: the calling terminal device initiates a call to the called terminal device. When the called terminal device rings, the media server in the CT domain retrieves the media resources corresponding to the user's subscription information from the media platform server. Then, after receiving the confirmation message from the calling terminal device, the media platform server is instructed to start playing the media resources for the calling terminal device.

[0005] Since the media resources played by the calling terminal device are configured by the media platform server, the user of the calling terminal device cannot interact in real time, thus reducing the user's enjoyment and operability in using the media resources. Summary of the Invention

[0006] In a first aspect, embodiments of this application propose a media resource playback method, which is applied to a calling terminal device. The method includes: sending a call request to a called terminal device; sending first audio information, wherein the first audio information is used to request playback of media resources; and receiving and playing a first media resource, wherein the first media resource is determined by the first audio information.

[0007] In this embodiment, after sending a call request to the called terminal device, the calling terminal device requests the playback of media resources by sending first audio information. The calling terminal device then receives and plays the first media resources. This allows users to flexibly control the media resources played by the terminal device via voice commands, providing a more user-friendly and engaging experience.

[0008] In conjunction with the first aspect, in one possible implementation of the first aspect, the method further includes: playing first response information, wherein the first response information is response information generated based on the first audio information, and the first response information includes audio information, image information, animation information and / or text information.

[0009] In this embodiment, the caller's terminal device can overlay (or play) response information containing audio information on its screen (or speaker). This response information includes, but is not limited to, audio information, image information, animation information, and / or text information. This method enables real-time feedback and interaction, enhancing the user experience.

[0010] In conjunction with the first aspect, in one possible implementation of the first aspect, the first response information includes: first text information, which is text information generated by speech recognition processing based on the first audio information, and the content of the first text information corresponds to the first audio information. By feeding back the first text information corresponding to the first audio information to the calling terminal device, user operation is facilitated and user experience is improved.

[0011] In conjunction with the first aspect, in one possible implementation of the first aspect, the method further includes: sending second audio information;

[0012] The response to receiving the second audio information includes: a second media resource, the second media resource being determined by the second audio information; and / or, a second response information, the second response information being response information generated based on the second audio information, the second response information including audio information, image information, animation information, and / or text information.

[0013] In this embodiment, users can also continue to operate media resources through audio information, further enhancing the user experience.

[0014] In conjunction with the first aspect, in one possible implementation of the first aspect, the method further includes: stopping the playback of the first media resource; playing the second media resource; and / or playing the second response information.

[0015] In this embodiment, after the calling terminal device receives the second media resource, it can stop playing the first media resource and then play the second media resource. When the calling terminal device receives the second response information, it can also overlay (or play) the second response information on the screen (or speaker), improving the user experience.

[0016] In conjunction with the first aspect, in one possible implementation of the first aspect, the method further includes: receiving an off-hook message sent by the called terminal device; in response to the off-hook message, stopping the playback of the first media resource; stopping the reception of audio information; and / or stopping the speech recognition processing of the audio information.

[0017] In this embodiment, the calling terminal device triggers a request to play media resources via audio information during the call phase. This media resource can be a video ringback tone, enhancing the user experience. Furthermore, the calling terminal device can also trigger other interactive operations related to the media resource via audio information during the call phase. These other interactive operations include, but are not limited to, ordering or copying the media resource, rating or liking the media resource, etc., further enhancing the user experience.

[0018] In conjunction with the first aspect, in one possible implementation of the first aspect, sending the first audio information includes: sending the first audio information, the first audio information including a wake-up keyword, the wake-up keyword being used to trigger the media platform server to perform speech recognition processing on the first audio information; or, sending audio information including the wake-up keyword; sending the first audio information.

[0019] In this embodiment, the user can also input wake-up keywords via voice to avoid accidental operation.

[0020] Secondly, this application provides a media resource playback method, which is applied to a media platform server. The method includes: receiving a call request sent by the calling terminal device; receiving first audio information sent by the calling terminal device; determining a first media resource based on the first audio information; and sending the first media resource to the calling terminal device.

[0021] In this embodiment, after receiving a call request from the calling terminal device, the media platform server determines the corresponding first media resource based on the first audio information sent by the host terminal device, and then sends the first media resource to the calling terminal device. This allows users to flexibly control the media resources played by the terminal device through voice commands, bringing a more user-friendly and interesting experience.

[0022] In conjunction with the second aspect, in one possible implementation of the second aspect, determining the first media resource based on the first audio information includes: performing speech recognition processing based on the first audio information to generate first text information, wherein the content of the first text information corresponds to the first audio information; performing semantic understanding processing based on the first text information to generate first user intent information; and determining the first media resource based on the first user intent information.

[0023] In this embodiment, the first user intent information corresponding to the first audio information is determined by speech recognition and semantic understanding, thereby determining the first media resource and improving the recognition accuracy.

[0024] In conjunction with the second aspect, in one possible implementation of the second aspect, the method further includes: generating first response information based on the first user intent information, the first response information including audio information, image information, animation information and / or text information; and sending the first response information to the calling terminal device.

[0025] In this embodiment, corresponding response information can also be generated based on the first user intent information and sent to the calling terminal device. The responding information, including but not limited to audio information, image information, animation information, and / or text information, can be overlaid (or played) on the screen (or speaker) of the calling terminal device. This response information achieves real-time feedback and interaction, enhancing the user experience.

[0026] In conjunction with the second aspect, in one possible implementation of the second aspect, the first response information includes the first text information.

[0027] In conjunction with the second aspect, in one possible implementation of the second aspect, the first user intent information includes any one or more of the following: starting to play media resources, pausing to play media resources, switching to play media resources, switching back to play media resources, copying and subscribing to media resources, content feature keywords of the first audio information, or the weight of the content feature keywords.

[0028] In conjunction with the second aspect, in one possible implementation of the second aspect, determining the first media resource based on the first user intent information includes: determining the first media resource based on a decision recommendation model and the content feature keywords and / or the weights of the content feature keywords included in the first user intent information, wherein the decision recommendation model uses a parameter set to determine the media resource, the parameter set including any one or more of the following: the content feature keywords of the media resource, the weights of the content feature keywords of the media resource, media resource tags of the media resource library, the weights of the media resource tags of the media resource library, the popularity weights of the media resources of the media resource library, the release time of the media resources of the media resource library, or the playback rate of the media resources of the media resource library, wherein the media resource library includes one or more media resources.

[0029] In conjunction with the second aspect, in one possible implementation of the second aspect, the method further includes: detecting whether the first audio information includes a wake-up keyword; if the first audio information includes the wake-up keyword, triggering speech recognition processing based on the first audio information; or, triggering speech recognition processing based on the first audio information based on detecting that the received audio information includes the wake-up keyword.

[0030] In this embodiment, the user can also input a wake-up keyword by voice. The media platform server detects the wake-up keyword and initiates speech recognition processing of the first audio information after the user inputs the wake-up keyword, thus avoiding misoperation.

[0031] In conjunction with the second aspect, in one possible implementation of the second aspect, the method further includes: receiving second audio information sent by the calling terminal device; generating a response to the second audio information based on the second audio information, the response to the second audio information including: a second media resource, the second media resource being determined by the second audio information; and / or, a second response information, the second response information being response information generated based on the second audio information, the second response information including audio information, image information, animation information, and / or text information; and sending the response to the second audio information to the calling terminal device.

[0032] In conjunction with the second aspect, in one possible implementation of the second aspect, generating a response based on the second audio information includes: performing speech recognition processing based on the second audio information to generate second text information, the content of which corresponds to the first audio information; performing semantic understanding processing based on the second text information to generate second user intent information; and generating a response based on the second user intent information.

[0033] In this embodiment, users can also continue to operate media resources through audio information, further enhancing the user experience.

[0034] A third aspect of this application provides an electronic device, including a transceiver unit and a processing unit, which enables the electronic device to implement the method in the first aspect or any possible implementation of the first aspect, or enables the electronic device to implement the method in the second aspect or any possible implementation of the second aspect.

[0035] A fourth aspect of this application provides an electronic device, including: a processor coupled to a memory for storing programs or instructions, wherein when the program or instructions are executed by the processor, the electronic device implements the method of the first aspect or any possible implementation thereof, or implements the method of the second aspect or any possible implementation thereof.

[0036] The fifth aspect of this application provides a computer-readable medium having a computer program or instructions stored thereon, which, when run on a computer, cause the computer to perform the methods of the first aspect or any possible implementation thereof, or cause the computer to perform the methods of the second aspect or any possible implementation thereof.

[0037] A sixth aspect of this application provides a computer program product that, when executed on a computer, causes the computer to perform the methods in the aforementioned first aspect or any possible implementation thereof, or causes the computer to perform the methods in the aforementioned second aspect or any possible implementation thereof. Attached Figure Description

[0038] Figure 1 is a schematic diagram of a communication scenario proposed in an embodiment of this application;

[0039] Figure 2 is a schematic flowchart of an embodiment of a media resource playback method according to this application.

[0040] Figure 3 is a schematic diagram of another communication scenario in an embodiment of this application;

[0041] Figure 4 is a schematic flowchart of an embodiment of a media resource playback method according to this application;

[0042] Figure 5 is a schematic diagram of another communication scenario in the embodiments of this application;

[0043] Figure 6 is a schematic flowchart of an embodiment of a media resource playback method according to this application;

[0044] Figure 7 is a schematic diagram of another communication scenario in an embodiment of this application;

[0045] Figure 8 is a schematic flowchart of an embodiment of a media resource playback method according to this application;

[0046] Figure 9 is a schematic diagram of another communication scenario in an embodiment of this application;

[0047] Figure 10 is a schematic flowchart of an embodiment of a media resource playback method according to this application.

[0048] Figures 11 to 18 are schematic diagrams of a scenario in an embodiment of this application;

[0049] Figure 19 is a schematic diagram of an application scenario in an embodiment of this application;

[0050] Figure 20 is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;

[0051] Figure 21 is a schematic diagram of another structure of the electronic device in the embodiments of this application;

[0052] Figure 22 is a schematic diagram of another structure of the electronic device in the embodiments of this application. Detailed Implementation

[0053] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the description of embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of units is not necessarily limited to those units, but may include other units not explicitly listed or inherent to those processes, methods, products, or apparatuses.

[0054] The technical solutions in the embodiments of this application will be clearly described below with reference to the accompanying drawings. In the description of this application, unless otherwise stated, " / " means "or," for example, A / B can mean A or B; "and / or" in this application is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Furthermore, in the description of this application, "at least one" refers to one or more items, and "multiple" refers to two or more items. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of a single item or multiple items. For example, at least one of a, b, or c can represent: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.

[0055] First, some terms used in the embodiments of this application will be explained to facilitate understanding by those skilled in the art.

[0056] (1) Terminal device: can be a wireless terminal device that can receive network device scheduling and instruction information. The wireless terminal device can be a device that provides voice and / or data connectivity to the user, or a handheld device with wireless connection function, or other processing device connected to a wireless modem.

[0057] Terminal devices can communicate with one or more core networks or the Internet via a radio access network (RAN). Terminal devices can be mobile terminal devices, such as mobile phones (or "cellular" phones), computers, and data cards. For example, they can be portable, pocket-sized, handheld, computer-embedded, or vehicle-mounted mobile devices that exchange voice and / or data with the RAN. Examples include personal communication service (PCS) phones, cordless phones, session initiation protocol (SIP) phones, wireless local loop (WLL) stations, personal digital assistants (PDAs), tablets, and computers with wireless transceiver capabilities. Wireless terminal equipment can also be referred to as a system, subscriber unit, subscriber station, mobile station, mobile station (MS), remote station, access point (AP), remote terminal, access terminal, user terminal, user agent, subscriber station (SS), customer premises equipment (CPE), terminal, user equipment (UE), mobile terminal (MT), drone, etc. Terminal equipment can also be wearable devices and next-generation communication systems, such as terminal equipment in 5G communication systems or terminal equipment in future public land mobile networks (PLMNs).

[0058] (2) Media Subsystem (IMS) domain.

[0059] The CT domain communicates through the evolved packet core (EPC) and the Internet Protocol Multimedia Subsystem (IMS) domain core network. The IMS domain core network includes several application servers (AS), such as media platform servers. These media platform servers provide playback of media resources for terminals; for example, when providing video ringback tone services, they are also called video ringback tone platforms. The media platform server can include media resource application servers and media resource subsystems (MRS). These servers can be co-located or physically separated. The media resource server can also be called a ringback tone platform, video ringback tone platform, or echo ringback tone platform. It provides media resources such as video ringback tones, video ringback sounds, video advertisements, and video customer service; for example, it creates and manages these media resources. Media application servers and media resource servers can be co-located or physically separated. The media application server processes session initiation protocol (SIP) signaling messages, while the media resource server provides audio and / or video streams to the calling terminal and / or the called terminal.

[0060] In addition, the IMS domain core network also includes: Serving-Call Session Control Function (S-CSCF) devices, Interrogating-Call Session Control Function (I-CSCF) devices, Proxy-Call Session Control Function (P-CSCF) devices, Home Subscriber Server (HSS) devices, Session Border Controller (SBC) devices, and several application servers, such as Telephony Application Server (TAS), Multimedia Telephony Application Server (MMTelAS), and Server Centralization and Continuity Application Server (SCCAS). I-CSCF devices can be co-located with S-CSCF devices and are referred to as "I / S-CSCF" devices. SBC devices and P-CSCF devices can be co-located and are referred to as "SBC / P-CSCF" devices. The EPC may include Packet Data Network Gateway (PGW) equipment, Serving Gateway (SGW) equipment, and Mobile Management Entity (MME) equipment.

[0061] S / P-GW devices provide the functions of service gateways and packet data network gateways (PDGs). SGWs are local mobility anchors, primarily oriented towards the radio access network for service plane data transmission. P-GWs are EPS anchors, primarily oriented towards other data networks, enabling access and interaction with multiple public data networks. SGW devices can be used to connect the IMS core network to the wireless network, while PGW devices can be used to connect the IMS core network to Internet Protocol (IP) networks. MME devices are core devices in the EPC network, providing the functions of MME logical entities.

[0062] The aforementioned network devices are all corresponding network devices in existing wireless communication networks, and will not be described in detail here, but only briefly. For example: HSS devices can be used to store user subscription information and location information. SBC devices can provide secure access and media processing. MMTelAS devices provide basic and supplementary multimedia telephony services. MME devices are core devices of EPC networks. SGW devices can be used to connect the IMS core network and the wireless network, and PGW devices can be used to connect the IMS core network and the IP network. S-CSCF devices can be used for user registration, authentication control, session routing, and service triggering control, and to maintain session state information. I-CSCF devices can be used for assigning and querying S-CSCF devices registered by users. P-CSCF devices can be used for signaling and message brokering. In this application, for the sake of brevity, CSCF devices are used to represent any combination of one or more of S-CSCF devices, I-CSCF devices, and P-CSCF devices.

[0063] (3) Media resources.

[0064] The media resources in this application embodiment include, but are not limited to: audio ringtones, video ringtones, video advertisements, or video animations, etc.

[0065] Since the media resources played by the calling terminal device are configured by the media platform server, the user of the calling terminal device cannot interact in real time, thus reducing the user's enjoyment and operability in using media resources.

[0066] Based on this, this application proposes a media resource playback method, which includes: sending a call request to a called terminal device; sending first audio information, the first audio information being used to request the playback of media resources; and receiving and playing a first media resource, the first media resource being determined by the first audio information. After sending a call request to the called terminal device, the calling terminal device requests the playback of media resources by sending the first audio information, and then the calling terminal device receives and plays the first media resource. This allows users to flexibly control the media resources played by the terminal device through voice commands, bringing a more user-friendly and interesting user experience.

[0067] The embodiments of this application are described below with reference to the accompanying drawings. Please refer to Figure 1, which is a schematic diagram of a communication scenario proposed in this application embodiment. The communication scenario proposed in this application embodiment includes: a media platform server, a calling terminal device, and a called terminal device. The calling terminal device sends a call request to the called terminal device. Then, the media platform server sends a default media resource to the calling terminal device, and the calling terminal device plays the default media resource. This default media resource can be pre-ordered by the calling terminal device or pre-allocated by the media platform server to the calling terminal device. While the calling terminal device is playing the default media resource, the calling terminal device can collect the user's audio information and then send the audio information to the media platform server. The media platform server interacts based on the audio information reported by the calling terminal device, for example, by switching the media resource played by the calling terminal device, pausing the playback of the media resource, or finalizing the media resource currently played by the calling terminal device.

[0068] Based on the communication scenario illustrated in Figure 1, please refer to Figure 2, which is a schematic flowchart of an embodiment of a media resource playback method according to this application. The media resource playback method proposed in this application includes:

[0069] S1. The calling terminal device sends a call request to the called terminal device.

[0070] The calling terminal device sends a call request to the called terminal device, and the calling and called terminal devices negotiate the call. After the called terminal device rings, the media platform server and the calling terminal device complete the negotiation of media resources. During the negotiation process, the transmission direction of the media resources is set to allow both uplink and downlink transmission.

[0071] S2. The calling terminal device sends the first audio information to the media platform server.

[0072] After negotiation, the media platform server sends a default media resource to the calling terminal device. This default media resource can be media resource subscribed to by the calling terminal device from the media platform server, media resource subscribed to by the called terminal device from the media platform server, or media resource actively allocated by the media platform server to either the calling or called terminal device. The calling terminal device then plays this default media resource.

[0073] During the above process, the media platform server initiates a voice interaction service. This voice interaction service allows users to send voice commands (e.g., via audio information carrying the voice message) to the media platform server through their terminal devices, and then complete the interaction according to the voice commands. In one possible implementation, the calling terminal device sends first audio information to the media platform server, which is used to request the playback of media resources.

[0074] In one example, the calling terminal device requests a switch to a different media resource via first audio information. This first audio information can be a request to randomly switch to the next media resource, or it can be a specific instruction to switch to a particular type of media resource. Please refer to Figure 11, which is a schematic diagram of one scenario in this embodiment. In another example, the calling terminal device displays first text information corresponding to the first audio information on a flat surface. This first text information is obtained by the media platform server after performing speech recognition based on the first audio information. The first text information corresponding to the first audio information includes: "Xiao Cai Xiao Cai, change one," which is used to request a change to the currently playing default media resource. In yet another example, please refer to Figure 12, which is a schematic diagram of yet another scenario in this embodiment. The first text information corresponding to the first audio information includes: "Xiao Cai Xiao Cai, I want to see animals," which is used to request a change to the currently playing default media resource and to play animal-related media resources.

[0075] In another example, please refer to Figure 13, which is a schematic diagram of another scenario in the embodiment of this application. The first text information corresponding to the first audio information includes "Xiao Cai Xiao Cai". The first audio information is used to trigger the media platform server to perform speech recognition processing on the first audio information. "Xiao Cai Xiao Cai" is used as the wake-up keyword to wake up the voice interaction service.

[0076] In another possible implementation, the calling terminal device sends first audio information to the media platform server. This first audio information is used for interactive processing of the media resource currently being played by the calling terminal device. This interactive processing includes, but is not limited to: stopping the playback of the currently playing media resource, copying and subscribing to the currently playing media resource, or sharing the currently playing media resource with other users. In yet another example, please refer to Figure 14, which is a schematic diagram of another scenario in this application embodiment. The first text information corresponding to the first audio information includes: "Xiao Cai, Xiao Cai, copy this ringtone for me." This first audio information is used to trigger the media platform server to copy and subscribe to the media resource currently being played by the calling terminal device.

[0077] In another possible implementation, when the media platform server cannot recognize and process the first audio information, it can send a prompt message to the calling terminal device, indicating that it cannot recognize and process the first audio information. For example, please refer to Figure 15, which is a schematic diagram of another scenario in this application embodiment. The first audio information includes: "Xiao Cai, Xiao Cai, it's raining today." After the media platform server cannot recognize and process the first audio information, it sends a prompt message to the calling terminal device. The screen of the calling terminal device displays the prompt message, which includes: "Sorry, I didn't understand. What do you want Xiao Cai to do?" Subsequently, the calling terminal device can continue to collect the user's audio information, such as the second audio information, and then send the second audio information to the media platform server, which performs voice interaction processing based on the second audio information.

[0078] S3, the media platform server detects whether the first audio information includes a wake-up keyword.

[0079] The media platform server detects whether the user's voice (audio information) from the calling terminal device includes a wake-up keyword to prevent accidental triggering of voice interaction processing. Once the media platform server detects that the user's voice (audio information) from the calling terminal device includes a wake-up keyword, it performs subsequent processing on the initial audio information, such as speech recognition and semantic understanding.

[0080] In one possible implementation, after the media platform server receives the first audio information, it detects whether the first audio information includes a wake-up keyword. This wake-up keyword is used to trigger the media platform server to perform speech recognition processing on the first audio information. If the first audio information includes a wake-up keyword, then proceed to step S4. For example, the media platform server uses keyword spotting (KWS) and other technologies to identify whether the first audio information includes a wake-up keyword, which can also be called a wake-up word.

[0081] In another possible implementation, the media platform server collects the audio information of the calling terminal device after step S1 and identifies whether the audio information includes a wake-up keyword. If it includes a wake-up keyword, it continues to collect the first audio information and then executes step S4.

[0082] In another possible implementation, the media platform server can also send voice interaction prompts to the calling terminal device, which then displays the prompts on its screen in real-time overlay. For example, please refer to Figure 16, which illustrates another scenario in this embodiment. The calling terminal device displays voice interaction prompts overlaid on a plane. These prompts include: "Hi, hello. I am the intelligent voice assistant 'Xiao Cai.' You can say to me, 'Xiao Cai, Xiao Cai, change one.'" The calling terminal device continues to receive audio input from the user.

[0083] If the media platform server does not detect that the first audio information includes a wake-up keyword, or if the media platform server does not detect that the audio information of the calling terminal device includes a wake-up keyword, then it is determined that the audio information input by the current calling terminal device is not for voice interaction processing related to media resources, and the media platform server does not perform [the necessary actions].

[0084] S4. The media platform server performs speech recognition processing based on the first audio information to generate the first text information.

[0085] After receiving the first audio information, the media platform server performs speech recognition processing on the first audio information to generate the first text information. For example, the media platform server recognizes the first audio information (i.e., the user's voice from the calling terminal device) based on Automatic Speech Recognition (ASR) technology and converts the first audio information into the first text information.

[0086] Optionally, the media platform server can send the first text information to the calling terminal device, and the calling terminal device's screen will overlay the first text information. The first text information is, for example, shown in Figures 11 to 15 above.

[0087] S5. The media platform server performs semantic understanding processing based on the first text information to generate the first user intent information.

[0088] After receiving the first text information, the media platform server performs semantic understanding processing on the first text information to generate first user intent information. The first user intent information includes any one or more of the following: start playing media resources, pause playing media resources, switch playing media resources, switch back to playing media resources, copy and subscribe to media resources, content feature keywords of the first audio information, or the weight of the content feature keywords. For example, an artificial intelligence (AI) model can be used for semantic understanding.

[0089] After step S5, proceed to steps S6 and S9.

[0090] S6. The media platform server determines the first media resource based on the first user intent information.

[0091] The media platform server determines the corresponding first media resource based on the first user intent information. In one possible implementation, if the first user intent information requests a switch to play media resources, the media platform server determines the media resource to be played as the first media resource from one or more candidate media resources. The first media resource may include one or more media resources.

[0092] Specifically, the media platform server determines the media resources that match the content feature keywords and / or the weights of those content feature keywords as the first media resource, based on the content feature keywords included in the first user intent information. Referring to the example in Figure 12, if the content feature keyword included in the first user intent information is "animal," the media platform server determines the media resource containing animal images from one or more candidate media resources as the first media resource based on this content feature keyword.

[0093] For example, the media platform server determines the first media resource based on a decision recommendation model (or decision recommendation algorithm, or decision algorithm) and the content feature keywords and / or the weights of the content feature keywords included in the first user intent information. The decision recommendation model uses a set of parameters to determine the media resource, and the parameter set includes any one or more of the following: the content feature keywords of the media resource, the weights of the content feature keywords of the media resource, media resource tags in the media resource library, the weights of the media resource tags in the media resource library, the popularity weights of the media resources in the media resource library, the publication time of the media resources in the media resource library, or the playback rate of the media resources in the media resource library. The media resource library includes one or more media resources.

[0094] Optionally, if the media platform server determines that the first user intent information requests a switch to play media resources and the first user intent information does not include content feature keywords, the media platform server can randomly select a media resource as the first media resource for the calling terminal device based on the first user intent information.

[0095] S7. The media platform server sends the first media resource to the calling terminal device.

[0096] S8, The calling terminal device plays the first media resource.

[0097] After receiving the first media resource, the calling terminal device stops playing the default media resource and then plays the first media resource.

[0098] S9. The media platform server generates the first response information based on the first user intent information.

[0099] The media platform server generates first response information based on the first user intent information. The first response information includes audio information, image information, animation information, and / or text information. The first response information may include first text information generated by speech recognition based on the first audio information. The first response information may also include temporary response information generated by preliminary processing of the first text information. For example, the first response information in Figure 13 includes: "Xiao Cai Xiao Cai" and "I'm here, please say, I'm listening...".

[0100] The first response information may also include an interactive response based on the first user intent information. For example, the first response information may include a playback start prompt: "Master, the video has been switched for you"; a processing wait prompt: "Processing, please wait a moment, Master"; a voice wait prompt: "I'm here, please speak, I'm listening"; an interaction result prompt: "Master, the download has been successfully copied for you"; or, the first response information may include an intent not understood prompt: "Sorry, I didn't understand."

[0101] For ease of understanding, examples are provided with reference to the accompanying drawings. For instance, the first response message in Figure 11 includes: "Video switched for you." Another example is the first response message in Figure 12, which includes: "Video switched for you." Yet another example is the first response message in Figure 13, which includes: "Yes, please speak, I'm listening..." Another example is the first response message in Figure 14, which includes: "Master, download successfully copied for you." And yet another example is the first response message in Figure 15, which includes: "Sorry, I didn't understand. What do you want me to instruct Xiao Cai to do?"

[0102] S10, The media platform server sends the first response information to the calling terminal device.

[0103] S11. The calling terminal device plays the first response information.

[0104] After receiving the first response information, the calling terminal device overlays the first response information onto the interface for playing the first media resource or the default media resource.

[0105] It should be noted that the execution order of steps S8 and S11 is not limited in this application embodiment. Step S8 can be executed first and then step S11, or step S11 can be executed first and then step S8, or steps S8 and S11 can be executed simultaneously. For example, the screen of the calling terminal device can display the first response information while playing the first media resource.

[0106] S12, The calling terminal device sends the second audio information to the media platform server.

[0107] Steps S12-S15 are optional.

[0108] After the calling terminal device has sent the first audio information to the media platform server, the calling terminal device can also send the second audio information to the media platform server.

[0109] S13. The media platform server determines the second media resource and / or generates the second response information based on the second audio information.

[0110] In step S13, the media platform server processes the second audio information in a similar way to the media platform server processes the first audio information in steps S2-S10, and will not be described in detail here.

[0111] For example, when the second audio information is used to indicate switching the media resource to be played, the media platform service determines the second media resource to be switched based on the second user intent information corresponding to the second audio information.

[0112] It is understandable that when the user of the calling terminal device makes multiple voice interactions, the media platform server will use a similar processing method to the first audio information to make multiple voice interactions.

[0113] S14. The media platform server sends a response to the calling terminal device by sending second audio information, including: second media resources and / or second response information.

[0114] The second media resource is similar to the first media resource, and the second response information is similar to the first response information. The second response information is generated based on the second audio information and includes audio information, image information, animation information, and / or text information.

[0115] S15. The calling terminal device plays the second media resource and / or the second response information.

[0116] In step S15, the calling terminal device stops playing the first media resource and / or the first response information. Then, the calling terminal device plays the second media resource and / or the second response information.

[0117] In this embodiment, after sending a call request to the called terminal device, the calling terminal device requests the playback of media resources by sending first audio information. The calling terminal device then receives and plays the first media resources. This allows users to flexibly control the media resources played by the terminal device via voice commands, providing a more user-friendly and engaging experience.

[0118] Based on the foregoing embodiments, another communication scenario of this application embodiment will be described below. Please refer to Figure 3, which is a schematic diagram of another communication scenario of this application embodiment. In this application embodiment, the calling terminal device connects to the IMS core network through the access network, and then connects to the media platform server through the IMS core network; the called terminal device connects to the IMS core network through the access network, and then connects to the media platform server through the IMS core network. It is understood that the calling terminal device and / or the called terminal device can connect to the media platform server through other means, and this application embodiment does not impose any restrictions.

[0119] A media platform server may include one or more logical network elements. In practice, the media platform server can be deployed on a unified physical network element to implement multiple logical network elements, or multiple logical network elements can be implemented on multiple physical network elements according to their internal characteristics. Taking video ringback tones (or simply ringback tones) as an example, one possible implementation is that the media platform server includes any one or more of the following logical network elements: video ringback tones service operation and management service, artificial intelligence model, video ringback tones intelligent semantic understanding service, video ringback tones interactive response processing service, video ringback tones playback decision service, video ringback tones intelligent interactive service, video ringback tones application service, or video ringback tones media service. A detailed explanation follows:

[0120] 1. The video ringback tone application service, also known as the signaling access and processing unit for video ringback tones, implements logical functions such as media negotiation and media playback control during the ringing phase of a video ringback tone user's call. The media ringback tone application service supports calling the playback decision service based on user intent information to obtain the media resources to be switched, as well as notifying the calling terminal device to switch to the new media resources and overlaying interactive feedback response information.

[0121] 2. The video ringback tone media service is used to play video ringback tones during the ringing phase of a call. The video ringback tone content is sent to the calling terminal device via the network as an audio and video media stream. The service supports uplink processing and intelligent recognition of the calling terminal device's audio information (user voice), and real-time identification of wake-up keywords in the audio information. Upon detecting a wake-up keyword, it generates corresponding text information based on the calling terminal device's audio information and supports overlaying this text information as subtitles onto the video ringback tone media displayed on the calling terminal device's screen, enabling real-time interactive feedback.

[0122] 3. The video ringback tone playback decision service is used to select and decide which video ringback tone to play for the user (i.e., the user of the calling terminal device) during each call, based on the video ringback tone media subscribed by the calling terminal device. It selects the corresponding video ringback tone content in real time based on the content feature keywords of the user's intent information. When the calling terminal device's audio information semantically desires a video ringback tone with certain characteristics, the service uses a decision-making algorithm to make real-time decisions for the calling terminal device regarding video ringback tone content that matches the user's intent information.

[0123] 4. The video ringback tone intelligent interactive service is used to accurately understand the semantics of the audio information of the calling terminal device, obtain the user's interactive intent of the calling terminal device, and make corresponding interactive responses.

[0124] 5. Artificial intelligence models are used to pre-train the intelligent interactive video ringback tone service, enabling accurate semantic understanding and interactive feedback of text information generated by the calling terminal device based on audio information in video ringback tone voice interaction scenarios.

[0125] 6. Video ringback tone service operation and management service is used to provide users with video ringback tone service management services, including activating video material function and ordering video ringback tones.

[0126] Referring to Figure 3, which illustrates the media platform server, please refer to Figure 4, which is a schematic flowchart of an embodiment of a media resource playback method according to this application. When the media platform server includes: a video ringback tone application service, a video ringback tone media service, a video ringback tone playback decision service, and a video ringback tone intelligent interaction service, the media resource playback method proposed in this application includes:

[0127] D1. The calling terminal device sends a call request to the called terminal device.

[0128] D2. The calling terminal equipment and the called terminal equipment negotiate the call and media resources (ringback tone).

[0129] D3. Video ringback tone application service notification: Video ringback tone media service plays media resources, carrying voice recognition identifiers.

[0130] In step D3, in one possible implementation, the video ringback tone application service sends voice recognition identification information to the video ringback tone media service. This voice recognition identification information instructs the video ringback tone media service to start receiving the uplink audio media stream and to perform voice recognition of the wake-up keyword.

[0131] In another possible implementation, the video ringback tone application service sends a voice interaction prompt to the video ringback tone media service. This voice interaction prompt notifies the calling terminal device to display the prompt, which includes a prompt for the user to input a wake-up keyword. The calling terminal device displays the voice interaction prompt in real-time overlay on the interface playing the default media resource.

[0132] It should be noted that the video ringback tone application service sends voice recognition identification information and voice interaction prompts to the video ringback tone media service.

[0133] D4. The video ringback tone media service sends default media resources and voice interaction prompts to the calling terminal device.

[0134] In step D4, the default media resources are played on the screen of the calling terminal device, and voice interaction prompts are overlaid in real time.

[0135] D5. The calling terminal device sends the first audio information to the video ringback tone media service.

[0136] D6. Video ringback tone media service detection: Does the first audio information include a wake-up keyword?

[0137] The video ringback tone media service parses the first audio information of the calling terminal device and uses keyword recognition and other technologies to determine whether the first audio information includes a wake-up keyword. If it includes a wake-up keyword, proceed to step D7; otherwise, do not process it or send a message to the calling terminal device indicating that the intent was not understood, such as: "Sorry, I didn't understand."

[0138] D7. The video ringback tone media service performs speech recognition processing on the first audio information to generate the first text information.

[0139] For example, the video ringback tone media service identifies the first audio information based on automatic speech recognition technology and converts the first audio information into first text information.

[0140] D8. The video ringback tone media service sends the first text message and temporary response message to the calling terminal device.

[0141] The video ringback tone media service overlays the first text information and the corresponding temporary response information onto the video ringback tone media stream (default media resource) and sends it to the calling terminal device. The first text information and the temporary response information are overlaid and played on the screen of the calling terminal device in real time.

[0142] D9. The video ringback tone media service reports the first text information to the video ringback tone intelligent interactive service.

[0143] D10. The video ringback tone intelligent interactive service performs semantic understanding processing on the first text information to generate the first user intent information, and determines the first response information based on the first user intent information.

[0144] D11. The video ringback tone intelligent interactive service sends an interactive processing request to the video ringback tone application service. The interactive processing request is used to request the playback of new media resources. The interactive processing request includes first user intent information, first response information and / or first text information.

[0145] The video ringback tone intelligent interactive service sends an interaction processing request to the video ringback tone application service (or video ringback tone voice management service) based on the first user intent information. This interaction processing request is used to request the media ringback tone application service to perform corresponding business processing on the first user intent information. The interaction processing request includes the first user intent information, the first response information, and / or the first text information.

[0146] For example, the first user intent information includes, but is not limited to: interactive response instructions, such as switching media resources, switching back to media resources, or copying and subscribing to media resources, or content feature keywords (which may also be called video content tags) and the weight of the content feature keywords.

[0147] D12. The video ringback tone application service sends a media resource query request to the video ringback tone playback decision service. This request carries the first user intent information.

[0148] After determining that the first user intent is to switch media resources, the video ringback tone application service sends a media resource query request to the video ringback tone playback decision service. This request carries the first user intent information. For example, the request includes content feature keywords and their weights.

[0149] D13. The video ringback tone playback decision service determines the primary media resource based on the media resource query request.

[0150] The video ringback tone playback decision service processes media resource query requests based on a specified decision recommendation model, selects the media resource (i.e., video ringback tone) that best matches the corresponding content feature keywords, and returns it to the video ringback tone application service. The decision recommendation model uses a set of parameters to determine the media resource. The parameter set includes any one or more of the following: content feature keywords of the media resource, weight of content feature keywords of the media resource, media resource tags of the media resource library, weight of media resource tags of the media resource library, popularity weight of media resources in the media resource library, release time of media resources in the media resource library, or playback rate of media resources in the media resource library. The media resource library includes one or more media resources.

[0151] D14. The video ringback tone playback decision service sends the identification information of the first media resource to the video ringback tone application service.

[0152] D15. The video ringback tone application service notifies the video ringback tone media service to switch to the first media resource and play the first response information.

[0153] D16. The video ringback tone media service sends the first media resource and the first response information to the calling terminal device.

[0154] The calling terminal device displays the first media resource and the first response information on the screen.

[0155] D17. The called terminal device goes off-hook.

[0156] D18. The video ringback tone application service instructs the video ringback tone media service to stop playing media resources and stop receiving audio.

[0157] D19. The calling terminal and the called terminal renegotiate and reconnect the call.

[0158] In this embodiment, interactive functionality is achieved based on intelligent voice during video ringback tone playback. Real-time interaction based on user voice is enabled during playback, providing a more user-friendly and engaging experience. Since this solution does not rely on terminals or networks and can be implemented solely through the video ringback tone platform, it is easy to promote and expands the application scope of the service.

[0159] Based on the foregoing embodiments, another communication scenario of this application embodiment will be described next. Please refer to Figure 5, which is a schematic diagram of another communication scenario of this application embodiment. In another possible implementation, the media platform server in this application embodiment includes any one or more of the following logical network elements: video ringback tone playback decision service, video ringback tone intelligent interaction service, video ringback tone application service, or video ringback tone media service.

[0160] Referring to Figure 5, which illustrates a media platform server, please refer to Figure 6, which is a schematic flowchart of an embodiment of a media resource playback method according to this application. When the media platform server includes: a video ringback tone application service, a video ringback tone media service, a video ringback tone playback decision service, or a video ringback tone intelligent interactive service, the media resource playback method proposed in this application includes:

[0161] F1. The calling terminal device sends a call request to the called terminal device.

[0162] F2. The calling terminal and the called terminal negotiate the call and media resources (ringback tone).

[0163] F3, Video Ringback Tone Application Service Notification: Video ringback tone media service plays media resources, carrying a voice recognition identifier.

[0164] F4, Video Ringback Tone Media Service sends default media resources and voice interaction prompts to the calling terminal device.

[0165] Correspondingly, the calling terminal device plays the default media resource (i.e., the default video ringback tone), and simultaneously displays a voice interaction prompt message overlaid on the default video ringback tone. This voice interaction prompt message is shown in Figure 15, for example.

[0166] F5. The calling terminal device sends the first audio information to the video ringback tone media service.

[0167] F6. Video ringback tone media service detects whether the first audio information includes a wake-up keyword.

[0168] Optionally, if a wake-up keyword is included, the current call is marked as a voice wake-up state, and the complete audio information of the calling terminal device is recognized by voice and converted into text information.

[0169] F7, the video ringback tone media service performs speech recognition processing on the first audio information to generate the first text information.

[0170] F8, the video ringback tone media service sends the first text message and temporary response message to the calling terminal device.

[0171] Correspondingly, the calling terminal device plays the default media resource (i.e., the default video ringback tone), and simultaneously displays a temporary response message overlaid on the default video ringback tone. This temporary response message is illustrated in Figure 17, which is a schematic diagram of another scenario in an embodiment of this application. The temporary response message includes "Processing, please wait...".

[0172] F9. The video ringback tone media service reports the first text information to the video ringback tone intelligent interactive service.

[0173] F10, the video ringback tone intelligent interactive service performs semantic understanding processing on the first text information to generate the first user intent information, and determines the first response information based on the first user intent information.

[0174] F11, The video ringback tone intelligent interactive service sends an interactive processing request to the video ringback tone application service. The interactive processing request is used to request the playback of new media resources. The interactive processing request includes first user intent information, first response information and / or first text information.

[0175] F12, the video ringback tone application service sends a media resource query request to the video ringback tone playback decision service, and the request carries the first user intent information.

[0176] F13, Video Ringback Tone Playback Decision Service determines the primary media resource based on the media resource query request.

[0177] F14, the video ringback tone playback decision service sends the identification information of the first media resource to the video ringback tone application service.

[0178] In another possible implementation, the video ringback tone application service selects the next video ringback tone (which serves as the first media resource) based on the content of the existing playback rule list, and notifies the video ringback tone media service to switch to playing the first media resource.

[0179] F15, Video Ringback Tone Application Service Notification: The video ringback tone media service switches to play the first media resource and plays the first response information.

[0180] F16, the video ringback tone media service sends the first media resource and the first response information to the calling terminal device.

[0181] Correspondingly, the calling terminal device plays the first media resource on its screen and overlays the first response information. This first response information is illustrated in Figure 18, which is a schematic diagram of another scenario in an embodiment of this application. The first response information includes "Master, the video playback has been switched for you."

[0182] F17. The calling terminal device sends a second audio message to the video ringback tone media service.

[0183] F18, Video Ringback Tone Media Service detects whether the second audio information includes a wake-up keyword.

[0184] Optionally, step F18 can be skipped and step F19 can be executed directly, because the current call has already been marked as voice wake-up state in step F6.

[0185] F19, the video ringback tone media service performs speech recognition processing on the second audio information to generate the second text information.

[0186] F20, the video ringback tone media service sends a second text message and a temporary response message to the calling terminal device.

[0187] F21, The video ringback tone media service reports the second text information to the video ringback tone intelligent interactive service.

[0188] F22. The video ringback tone intelligent interactive service performs semantic understanding processing on the second text information to generate second user intent information, and determines the second response information based on the second user intent information.

[0189] F23. The video ringback tone intelligent interactive service sends an interactive processing request to the video ringback tone application service. The interactive processing request is used to request the playback of new media resources. The interactive processing request includes second user intent information, second response information and / or second text information.

[0190] F24. The video ringback tone application service sends a media resource query request to the video ringback tone playback decision service. This request carries the second user intent information.

[0191] F25, Video Ringback Tone Playback Decision Service determines the second media resource based on the media resource query request.

[0192] The video ringback tone AS selects the next track based on the existing playback rule list, notifies the media service to switch playback, and overlays the voice interaction result response message onto the video for the end user.

[0193] F26, The video ringback tone playback decision service sends the identification information of the second media resource to the video ringback tone application service.

[0194] F27. The video ringback tone application service notifies the video ringback tone media service to switch to playing the second media resource and play the second response information.

[0195] F28, Video Ringback Tone Media Service sends a second media resource and a second response information to the calling terminal device.

[0196] Accordingly, after receiving the second media resource, the calling terminal device stops playing the first media resource, then plays the second media resource and overlays the second response information.

[0197] F29, The called terminal device goes off-hook.

[0198] F30 indicates that the video ringback tone application service has stopped playing media resources and stopped receiving audio.

[0199] F31. The calling terminal and the called terminal renegotiate and reconnect the call.

[0200] Based on the foregoing embodiments, another communication scenario of this application embodiment will be described below. Please refer to Figure 7, which is a schematic diagram of another communication scenario of this application embodiment. In another possible implementation, the media platform server in this application embodiment includes any one or more of the following logical network elements: video ringback tone playback decision service, video ringback tone application service, video ringback tone intelligent interaction service, or video ringback tone media service. The video ringback tone intelligent interaction service includes: video ringback tone interactive response processing service and video ringback tone intelligent semantic understanding service. The video ringback tone media service includes: video ringback tone interactive speech recognition service and video ringback tone media playback service.

[0201] Referring to Figure 7, which illustrates the media platform server, please refer to Figure 8, which is a schematic flowchart of an embodiment of a media resource playback method according to this application. When the media platform server includes: a video ringback tone playback decision service, a video ringback tone application service, a video ringback tone intelligent interactive service, or a video ringback tone media service, the media resource playback method proposed in this application includes:

[0202] G1. The calling terminal device sends a call request to the called terminal device.

[0203] G2. The calling terminal equipment and the called terminal equipment negotiate the call and media resources (ringback tone).

[0204] G3, the video ringback tone application service identifies whether the calling terminal device is allowed to perform intelligent voice interaction.

[0205] G4, Video Ringback Tone Application Service Notification: Video ringback tone media playback service plays media resources, carrying a voice recognition identifier.

[0206] G5, the video ringback tone media playback service sends default media resources and voice interaction prompts to the calling terminal device.

[0207] G6. The calling terminal device sends the fourth audio information (without carrying the wake-up keyword) to the video ringback tone media playback service.

[0208] G7, The video ringback tone media playback service identifies that the fourth audio information does not carry a wake-up keyword.

[0209] G8. The calling terminal device sends third audio information (carrying wake-up keywords) to the video ringback tone media playback service.

[0210] G9, Video Ringback Tone Media Playback Service identifies third-party audio information carrying wake-up keywords.

[0211] G10, the video ringback tone media playback service requests the video ringback tone interactive voice recognition service to wake up the voice recognition service.

[0212] The video ringback tone media playback service parses the caller's voice media stream (third-party audio information) and identifies any wake-up keywords. It then notifies the video ringback tone voice recognition service to activate the service and uploads the third-party audio information to it. Finally, it marks the call as being in a voice wake-up state.

[0213] G11, the video ringback tone media playback service sends third-party audio information to the video ringback tone interactive voice recognition service.

[0214] G12, the video ringback tone interactive voice recognition service performs voice recognition processing on third-party audio information to generate third-party text information.

[0215] G13, the video ringback tone interactive voice recognition service sends third-party text information and temporary response information to the calling terminal device.

[0216] G14, the video ringback tone interactive voice recognition service reports third-party text information to the video ringback tone intelligent semantic understanding service.

[0217] G15, the intelligent semantic understanding service for video ringback tones, performs semantic understanding processing on third-party text information to generate third-party user intent information.

[0218] G16, The intelligent semantic understanding service for video ringback tones sends third-party user intent information to the interactive response processing service for video ringback tones.

[0219] G17, Video ringback tone interactive response processing service determines the third response information based on the third user intent information.

[0220] G18, the video ringback tone interactive response processing service sends a first interactive processing request to the video ringback tone application service. The first interactive processing request includes third user intent information, third response information and / or third text information. The third response information includes waiting for a voice prompt.

[0221] G19. The video ringback tone application service sends a third response message to the video ringback tone media playback service. The waiting voice prompt carried in the third response message is used to notify the waiting user of the voice interaction input.

[0222] G20, video ringback tone media playback service sends third response information to the calling terminal device.

[0223] Since the third audio information only includes wake-up keywords, the video ringback tone media playback service sends a third response information to the calling terminal device. The waiting voice prompt carried in the third response information is used to notify the waiting user of voice interaction input. This waiting voice prompt is an example of the one shown in Figure 13.

[0224] G21. The calling terminal device sends the first audio information to the video ringback tone media playback service.

[0225] G22, the video ringback tone media playback service sends the first audio information to the video ringback tone interactive voice recognition service.

[0226] G23, the video ringback tone interactive voice recognition service performs voice recognition processing on the first audio information to generate the first text information.

[0227] G24, the video ringback tone interactive voice recognition service sends the first text message and temporary response message to the calling terminal device.

[0228] G25, the video ringback tone interactive voice recognition service reports the first text information to the video ringback tone intelligent semantic understanding service.

[0229] G26, The intelligent semantic understanding service for video ringback tones performs semantic understanding processing on the first text information to generate the first user intent information.

[0230] G27, The intelligent semantic understanding service for video ringback tones sends the first user intent information to the interactive response processing service for video ringback tones.

[0231] G28, The video ringback tone interactive response processing service determines the first response information based on the first user intent information.

[0232] G29. The video ringback tone interactive response processing service sends a second interactive processing request to the video ringback tone application service. The second interactive processing request includes first user intent information, first response information and / or first text information. The first user intent information includes switching playback media resources.

[0233] G30, Video Ringback Tone Application Service Notification: Switch to playing the primary media resource for video ringback tone media playback service.

[0234] G31, the video ringback tone media playback service sends the first media resource and the first response information to the calling terminal device.

[0235] G32, The called terminal device goes off-hook.

[0236] G33, Video Ringback Tone Application Service instructs the Video Ringback Tone Media Playback Service to stop playing media resources and stop receiving audio.

[0237] G34. The calling terminal and the called terminal renegotiate and reconnect the call.

[0238] Based on the foregoing embodiments, another communication scenario of this application embodiment will be introduced next. Please refer to Figure 9, which is a schematic diagram of another communication scenario of this application embodiment. In another possible implementation, the media platform server in this application embodiment includes any one or more of the following logical network elements: video ringback tone operation management service, or video ringback tone call playback platform, wherein the video ringback tone call playback platform includes: video ringback tone intelligent interactive service, video ringback tone playback decision service, and video ringback tone application service.

[0239] Referring to Figure 9, which illustrates the media platform server, please refer to Figure 10, which is a schematic flowchart of an embodiment of a media resource playback method according to this application. When the media platform server includes: a video ringback tone operation management service and a video ringback tone call playback platform, the media resource playback method proposed in this application includes:

[0240] H1. The calling terminal device sends a call request to the called terminal device.

[0241] H2. The calling terminal equipment and the called terminal equipment negotiate the call and media resources (ringback tone).

[0242] H3, Video Ringback Tone Application Service identifies whether the calling terminal device is allowed to perform intelligent voice interaction.

[0243] H4, Video Ringback Tone Application Service Notification: Video ringback tone media service plays media resources, carrying a voice recognition identifier.

[0244] H5 and video ringback tone media services send default media resources and voice interaction prompts to the calling terminal device.

[0245] H6. The calling terminal device sends the first audio information (carrying the wake-up keyword and the instruction to copy and subscribe to the ringback tone) to the video ringback tone media service.

[0246] H7, Video Ringback Tone Media Service detects whether the first audio information includes a wake-up keyword.

[0247] H8 and video ringback tone media services perform speech recognition processing on the first audio information to generate the first text information.

[0248] H9, the video ringback tone media service sends the first text message and temporary response message to the calling terminal device.

[0249] H10, the video ringback tone media service reports the first text information to the video ringback tone intelligent interactive service.

[0250] H11, the video ringback tone intelligent interactive service performs semantic understanding processing on the first text information to generate the first user intent information, and determines the first response information based on the first user intent information.

[0251] H12, the intelligent interactive service for video ringback tones sends the first user intent information to the video ringback tones operation and management service.

[0252] H13, Video Ringback Tone Operation Management Service, based on the first user intent information, copies the currently playing media resources to the calling terminal device.

[0253] H14. The video ringback tone intelligent interactive service sends an interactive processing request to the video ringback tone application service. The interactive processing request carries first response information, which includes media resource subscription results.

[0254] H15, Video Ringback Tone Application Service Notification: Video ringback tone media service plays the first response information.

[0255] H16. The video ringback tone media service sends the first response information to the calling terminal device.

[0256] Correspondingly, the screen of the calling terminal device displays the first response information, as shown in Figure 14.

[0257] H17. The calling terminal device sends a second audio message to the video ringback tone media service.

[0258] H18, Video ringback tone media service detection: Does the second audio information include a wake-up keyword?

[0259] H19 and video ringback tone media services perform speech recognition processing on the second audio information to generate the second text information.

[0260] H2O and video ringback tone media services send a second text message and a temporary response message to the calling terminal device.

[0261] H21. The video ringback tone media service reports the second text information to the video ringback tone intelligent interactive service.

[0262] H22, the video ringback tone intelligent interactive service performs semantic understanding processing on the second text information to generate second user intent information, and determines the second response information based on the second user intent information. The second response information indicates that the second audio information is not understood.

[0263] H23. The video ringback tone intelligent interactive service sends an interactive processing request to the video ringback tone application service. The interactive processing request includes second user intent information, second response information and / or second text information. The second response information indicates that the second audio information is not understood.

[0264] H24, Video Ringback Tone Application Service Notification: Video ringback tone media service plays second response information.

[0265] H25, Video Ringback Tone Media Service sends a second response message to the calling terminal device.

[0266] Correspondingly, the screen of the calling terminal device displays the second response information, and the first response information is shown in Figure 16 for example.

[0267] H26. The called terminal device goes off-hook.

[0268] H27. The video ringback tone application service instructs the video ringback tone media service to stop playing media resources and stop receiving audio.

[0269] H28. The calling terminal and the called terminal renegotiate and reconnect the call.

[0270] Based on the foregoing embodiments, the application scenarios of the method provided in this application will be described below. The application scenario of the method provided in this application can be illustrated in Figure 19. Figure 19 is a schematic diagram of an application scenario in this application embodiment, which includes: user 001 and electronic device 002.

[0271] Among them, User 001: The user can interact with the electronic device 002 through gestures / voice, etc., to open the application on the electronic device (or a functional module in the application, or a mini program, quick app, etc.).

[0272] Electronic device 002: Equipped with an operating system that includes built-in system-level apps, which users can install / uninstall as needed. Electronic device 002 has a screen for user 001, allowing user 001 to operate applications on the device. Electronic device 002 typically has a large screen, such as tablets or foldable phones.

[0273] In addition to the tablet computers or foldable phones mentioned above, the electronic devices in this application embodiment can also be non-foldable phones, smartwatches, smart glasses, smart bracelets, portable game consoles, personal digital assistants (PDAs), laptops, ultra-mobile personal computers (UMPCs), handheld computers, netbooks, in-vehicle media playback devices, wearable electronic devices (e.g., watches, bracelets, glasses), virtual reality (VR) terminal devices, augmented reality (AR) terminal devices, and other digital display products. The electronic device 002 can be the calling terminal device and / or the called terminal device in this application embodiment.

[0274] Please refer to Figure 20, which is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. As shown in Figure 20, the electronic device may include a processor 210, an external memory interface 220, an internal memory 221, a universal serial bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, a headphone jack 270D, a sensor module 280, buttons 290, a motor 291, an indicator 292, a camera 293, a display screen 294, and a subscriber identification module (SIM) card interface 295, etc. The sensor module 280 may include a pressure sensor 280A, a gyroscope sensor 280B, a barometric pressure sensor 280C, a magnetic sensor 280D, an accelerometer sensor 280E, a distance sensor 280F, a proximity light sensor 280G, a fingerprint sensor 280H, a temperature sensor 280J, a touch sensor 280K, an ambient light sensor 280L, a bone conduction sensor 280M, etc.

[0275] It is understood that the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device. In other embodiments, the electronic device may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

[0276] Processor 210 may include one or more processing units, such as: application processor (AP), modem processor, graphics processing unit (GPU), image signal processor (ISP), controller, memory, video codec, digital signal processor (DSP), baseband processor, and / or neural network processing unit (NPU), etc. Different processing units may be independent devices or integrated into one or more processors.

[0277] A controller can be the nerve center and command center of an electronic device. Based on the instruction opcode and timing signals, the controller generates operation control signals to control the fetching and execution of instructions.

[0278] The processor 210 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. This memory can store instructions or data that the processor 210 has just used or that are used repeatedly. If the processor 210 needs to use the instruction or data again, it can directly retrieve it from the memory. This avoids repeated accesses, reduces the waiting time of the processor 210, and thus improves the efficiency of the system.

[0279] In some embodiments, the processor 210 may include one or more interfaces. Interfaces may include an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver / transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input / output (GPIO) interface, a subscriber identity module (SIM) interface, and / or a universal serial bus (USB) interface, etc.

[0280] It is understood that the interface connection relationships between the modules illustrated in this embodiment are merely illustrative and do not constitute a structural limitation on the electronic device. In other embodiments, the electronic device may also employ different interface connection methods or combinations of multiple interface connection methods as described in the above embodiments.

[0281] The charging management module 240 receives charging input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 240 receives charging input from the wired charger via a USB interface 230. In some wireless charging embodiments, the charging management module 240 receives wireless charging input via the wireless charging coil of the electronic device. While charging the battery 242, the charging management module 240 can also supply power to the electronic device via the power management module 241.

[0282] The power management module 241 connects the battery 242, the charging management module 240, and the processor 210. The power management module 241 receives input from the battery 242 and / or the charging management module 240, providing power to the processor 210, internal memory 221, external memory, display screen 294, camera 293, and wireless communication module 260. The power management module 241 can also monitor parameters such as battery capacity, battery cycle count, and battery health status (leakage current, impedance). In some other embodiments, the power management module 241 may also be located within the processor 210. In other embodiments, the power management module 241 and the charging management module 240 may be housed in the same device.

[0283] The wireless communication function of electronic devices can be realized through antenna 1, antenna 2, mobile communication module 250, wireless communication module 260, modem processor and baseband processor, etc.

[0284] Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the electronic device can be used to cover one or more communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, antenna 1 can be reused as a diversity antenna for a wireless local area network. In some other embodiments, the antennas can be used in conjunction with a tuning switch.

[0285] The mobile communication module 250 can provide solutions for wireless communication applications including 2G / 3G / 4G / 5G in electronic devices. The mobile communication module 250 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc. The mobile communication module 250 can receive electromagnetic waves via antenna 1, and perform filtering, amplification, and other processing on the received electromagnetic waves before transmitting them to a modem processor for demodulation. The mobile communication module 250 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves for radiation via antenna 1. In some embodiments, at least some functional modules of the mobile communication module 250 may be housed in processor 210. In some embodiments, at least some functional modules of the mobile communication module 250 and at least some modules of the processor 210 may be housed in the same device.

[0286] The modem processor may include a modulator and a demodulator. The modulator modulates the low-frequency baseband signal to be transmitted into a mid-to-high frequency signal. The demodulator demodulates the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. After processing by the baseband processor, the low-frequency baseband signal is transmitted to the application processor. The application processor outputs sound signals through an audio device (not limited to speaker 270A, receiver 270B, etc.) or displays images or videos through the display screen 294. In some embodiments, the modem processor may be a separate device. In other embodiments, the modem processor may be independent of the processor 210 and may be housed in the same device as the mobile communication module 250 or other functional modules.

[0287] The wireless communication module 260 can provide solutions for wireless communication applications in electronic devices, including wireless local area networks (WLANs) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared (IR) technologies. The wireless communication module 260 can be one or more devices integrating at least one communication processing unit. The wireless communication module 260 receives electromagnetic waves via antenna 2, performs frequency modulation and filtering of the electromagnetic wave signals, and sends the processed signal to processor 210. The wireless communication module 260 can also receive signals to be transmitted from processor 210, perform frequency modulation and amplification, and convert them into electromagnetic waves for radiation via antenna 2.

[0288] In some embodiments, antenna 1 of the electronic device is coupled to mobile communication module 250, and antenna 2 is coupled to wireless communication module 260, enabling the electronic device to communicate with networks and other devices via wireless communication technology. The wireless communication technology may include Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time-Division Code Division Multiple Access (TDSCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC, FM, and / or IR technologies. The GNSS may include Global Positioning System (GPS), Global Navigation Satellite System (GLONASS), BeiDou Navigation Satellite System (BDS), Quasi-Zenith Satellite System (QZSS), and / or Satellite Based Augmentation Systems (SBAS).

[0289] Electronic devices implement display functions through a GPU, a display screen 294, and an application processor. The GPU is a microprocessor for image processing, connecting the display screen 294 and the application processor. The GPU is used to perform mathematical and geometric calculations and for graphics rendering. The processor 210 may include one or more GPUs, which execute program instructions to generate or modify display information.

[0290] Display screen 294 is used to display images, videos, etc. Display screen 294 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a MiniLED, a MicroLED, a Micro-OLED, a quantum dot light-emitting diode (QLED), etc.

[0291] Optionally, the display screen 294 displays the running interface of the application on the electronic device in a visual window manner. In this embodiment, the display screen displays the running interface of a specific type of application (SMS software, instant messaging software) in a dual-page display manner, displaying two interfaces on one display screen, and the two interfaces can be parent and child pages of each other.

[0292] It is understandable that if the electronic device is a foldable screen phone, then the display screen 294 can also be called a foldable screen. It can be folded into two screens, left and right, along the vertical folding edge, or into two screens, top and bottom, along the horizontal folding edge, etc.

[0293] It should be noted that the electronic devices (including inward-folding electronic devices and outward-folding electronic devices) in the embodiments of this application, after being folded, form at least two screens, which can be multiple independent screens or a complete screen with an integrated structure, but are folded to form at least two parts.

[0294] For example, a foldable screen can be a flexible foldable screen. A flexible foldable screen includes folding edges made of a flexible material. Part or all of the flexible foldable screen is made of a flexible material. When folded, the flexible foldable screen forms at least two integrated screens, which are essentially a single, complete screen structure, simply folded into at least two parts.

[0295] For example, the foldable screen of this electronic device can be a multi-screen foldable screen. This multi-screen foldable screen can include multiple (two or more) screens. These multiple screens are multiple individual displays. These multiple screens can be connected sequentially via folding axes. Each screen can rotate about the folding axis to which it is connected, thus achieving the folding of the multi-screen foldable screen.

[0296] Electronic devices can achieve shooting functions through ISP, camera 293, video codec, GPU, display 294 and application processor.

[0297] The ISP (Image Signal Processor) is used to process data fed back from the camera 293. For example, when taking a picture, the shutter is opened, and light is transmitted through the lens to the camera's photosensitive element. The light signal is converted into an electrical signal, and the camera's photosensitive element transmits the electrical signal to the ISP for processing, transforming it into an image visible to the naked eye. The ISP can also perform algorithmic optimization on image noise, brightness, and skin tone. The ISP can also optimize parameters such as exposure and color temperature of the shooting scene. In some embodiments, the ISP can be set in the camera 293.

[0298] Camera 293 is used to capture still images or videos. An object is projected onto a photosensitive element by generating an optical image through the lens. The photosensitive element can be a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the light signal into an electrical signal, which is then passed to an ISP for conversion into a digital image signal. The ISP outputs the digital image signal to a DSP for processing. The DSP converts the digital image signal into image signals in standard RGB, YUV, or other formats. In some embodiments, the electronic device may include one or N cameras 293, where N is a positive integer greater than 1.

[0299] The camera 293 can also be used to provide personalized and contextualized business experiences to users based on the perceived external environment and user actions. Specifically, the camera 293 can acquire rich and accurate information, enabling the electronic device to perceive the external environment and user actions. In this embodiment, the camera 293 can be used to identify whether the user of the electronic device is a first user or a second user.

[0300] Digital signal processors (DSPs) are used to process digital signals. Besides digital image signals, they can also process other digital signals. For example, when an electronic device is selecting a frequency, a DSP can perform a Fourier transform on the frequency energy.

[0301] Video codecs are used to compress or decompress digital video. Electronic devices can support one or more video codecs. This allows the electronic device to play or record video in various encoded formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

[0302] An NPU (Neural Processing Unit) is a computational processor for neural networks (NNs). By borrowing the structure of biological neural networks, such as the transmission patterns between neurons in the human brain, it can rapidly process input information and continuously learn on its own. NPUs enable intelligent cognitive applications in electronic devices, such as image recognition, facial recognition, speech recognition, and text understanding.

[0303] The external storage interface 220 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device. The external memory card communicates with the processor 210 through the external storage interface 220 to perform data storage functions. For example, music, video, and other files can be saved on the external memory card.

[0304] Internal memory 221 can be used to store computer executable program code, which includes instructions. Processor 210 executes various functional applications and data processing of the electronic device by running the instructions stored in internal memory 221. For example, in this embodiment, processor 210 can display corresponding content on display screen 294 in response to user operation by executing instructions stored in internal memory 221. Internal memory 221 may include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback function, image playback function, etc.), etc. The data storage area may store data created during the use of the electronic device (such as audio data, phone book, etc.). In addition, internal memory 221 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.

[0305] Electronic devices can implement audio functions through audio modules 270, speakers 270A, receivers 270B, microphones 270C, headphone jacks 270D, and application processors. Examples include music playback and recording.

[0306] Audio module 270 is used to convert digital audio information into analog audio signal output, and also to convert analog audio input into digital audio signal. Audio module 270 can also be used for encoding and decoding audio signals. In some embodiments, audio module 270 may be located in processor 210, or some functional modules of audio module 270 may be located in processor 210. Speaker 270A, also called a "loudspeaker," is used to convert audio electrical signals into sound signals. The electronic device can listen to music or hands-free calls through speaker 270A. Receiver 270B, also called a "handpiece," is used to convert audio electrical signals into sound signals. When the electronic device answers a phone call or voice message, the receiver 270B can be brought close to the user's ear to hear the voice. Microphone 270C, also called a "microphone" or "voice transducer," is used to convert sound signals into electrical signals. When making a phone call, sending a voice message, or needing to trigger certain functions of the electronic device through a voice assistant, the user can bring their mouth close to microphone 270C to speak, inputting sound signals into microphone 270C. The electronic device may have at least one microphone 270C. In some embodiments, the electronic device may be equipped with two microphones 270C, which, in addition to collecting sound signals, can also perform noise reduction. In other embodiments, the electronic device may be equipped with three, four or more microphones 270C, which can collect sound signals, reduce noise, identify the sound source, and perform directional recording, etc.

[0307] The headphone jack 270D is used to connect wired headphones. The headphone jack 270D can be a USB 230 interface or a 3.5mm Open Mobile Terminal Platform (OMTP) standard interface, a CTIA (Cellular Telecommunications Industry Association of the USA) standard interface.

[0308] Pressure sensor 280A is used to sense pressure signals and convert them into electrical signals. In some embodiments, pressure sensor 280A can be disposed on display screen 294. There are many types of pressure sensors 280A, such as resistive pressure sensors, inductive pressure sensors, and capacitive pressure sensors. A capacitive pressure sensor may include at least two parallel plates with conductive material. When force is applied to pressure sensor 280A, the capacitance between the electrodes changes. The electronic device determines the pressure intensity based on the change in capacitance. When a touch operation is applied to display screen 294, the electronic device detects the intensity of the touch operation based on pressure sensor 280A. The electronic device can also calculate the touch position based on the detection signal from pressure sensor 280A. In some embodiments, touch operations applied to the same touch position but with different touch operation intensities can correspond to different operation commands. For example, when a touch operation with an intensity less than the pressure threshold is applied to the SMS application icon, a command to view an SMS is executed. When a touch operation with an intensity greater than or equal to the pressure threshold is applied to the SMS application icon, a command to create a new SMS is executed.

[0309] The gyroscope sensor 280B can be used to determine the motion attitude of an electronic device. In some embodiments, the gyroscope sensor 280B can determine the angular velocity of the electronic device around three axes (i.e., the x, y, and z axes). The gyroscope sensor 280B can be used for image stabilization. For example, when the shutter is pressed, the gyroscope sensor 280B detects the angle of the electronic device's shake, calculates the distance the lens module needs to compensate based on the angle, and allows the lens to counteract the shake of the electronic device by moving in the opposite direction, thus achieving image stabilization. The gyroscope sensor 280B can also be used in navigation and motion-sensing game scenarios. In addition, the gyroscope sensor 280B can also be used to measure the rotation amplitude or movement distance of an electronic device.

[0310] The barometric pressure sensor 280C is used to measure air pressure. In some embodiments, the electronic device calculates altitude using the air pressure value measured by the barometric pressure sensor 280C to assist in positioning and navigation.

[0311] The magnetic sensor 280D includes a Hall effect sensor. The electronic device can use the magnetic sensor 280D to detect the opening and closing of the flip cover. In some embodiments, when the electronic device is a flip phone, the electronic device can detect the opening and closing of the flip cover using the magnetic sensor 280D. Furthermore, based on the detected opening and closing state of the cover or the flip cover, features such as automatic flip unlocking can be configured.

[0312] The 280E accelerometer sensor can detect the magnitude of acceleration in various directions (typically three axes) of an electronic device. When the electronic device is stationary, it can detect the magnitude and direction of gravity. It can also be used to identify the posture of electronic devices, applicable to screen orientation switching, pedometers, and other applications. Additionally, the 280E accelerometer sensor can also be used to measure the orientation (i.e., the direction vector) of an electronic device.

[0313] The distance sensor 280F is used to measure distance. Electronic devices can measure distance using infrared or laser. In some embodiments, during a shooting scene, the electronic device can utilize the distance sensor 280F to measure distance for rapid focusing.

[0314] The proximity sensor 280G may include, for example, a light-emitting diode (LED) and a light detector, such as a photodiode. The LED may be an infrared LED. The electronic device emits infrared light outward through the LED. The electronic device uses the photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that an object is near the electronic device. When insufficient reflected light is detected, the electronic device can determine that no object is near the electronic device. The electronic device can use the proximity sensor 280G to detect when a user holds the electronic device close to their ear for a call, so as to automatically turn off the screen to save power. The proximity sensor 280G can also be used in holster mode and pocket mode for automatic unlocking and locking of the screen.

[0315] The ambient light sensor 280L is used to detect ambient light levels. Electronic devices can adaptively adjust the brightness of their displays 294 based on the detected ambient light. The ambient light sensor 280L can also be used to automatically adjust white balance when taking photos. The ambient light sensor 280L can also be used in conjunction with the proximity sensor 280G to detect whether electronic devices are in a pocket, preventing accidental touches.

[0316] The fingerprint sensor 280H is used to collect fingerprints. Electronic devices can utilize the characteristics of the collected fingerprints to achieve fingerprint unlocking, app access locks, fingerprint photography, fingerprint answering of calls, etc.

[0317] Temperature sensor 280J is used to detect temperature. In some embodiments, the electronic device uses the temperature detected by temperature sensor 280J to execute a temperature handling strategy. For example, when the temperature reported by temperature sensor 280J exceeds a threshold, the electronic device reduces the performance of a processor located near temperature sensor 280J to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is below another threshold, the electronic device heats battery 242 to prevent abnormal shutdown of the electronic device due to low temperature. In still other embodiments, when the temperature is below yet another threshold, the electronic device boosts the output voltage of battery 242 to prevent abnormal shutdown due to low temperature.

[0318] Touch sensor 280K, also known as a "touch panel," can be located on display screen 294. The touch sensor 280K and display screen 294 together form a touchscreen, also known as a "touchscreen." Touch sensor 280K detects touch operations applied to or near it. The touch sensor can transmit the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through display screen 294. In other embodiments, touch sensor 280K may also be located on the surface of the electronic device, in a different position than display screen 294.

[0319] The bone conduction sensor 280M can acquire vibration signals. In some embodiments, the bone conduction sensor 280M can acquire vibration signals from the vibrating bone segments of the human vocal cords. The bone conduction sensor 280M can also contact the human pulse to receive blood pressure signals. In some embodiments, the bone conduction sensor 280M can also be incorporated into headphones to form bone conduction headphones. The audio module 270 can parse the voice signals from the vibrating bone segments of the vocal cords acquired by the bone conduction sensor 280M to realize voice functionality. The application processor can parse heart rate information from the blood pressure signals acquired by the bone conduction sensor 280M to realize heart rate detection functionality.

[0320] Buttons 290 include a power button, volume buttons, etc. Buttons 290 can be mechanical buttons or touch-sensitive buttons. The electronic device can receive button input and generate key signal inputs related to user settings and function control of the electronic device.

[0321] The electronic device utilizes various sensors in the sensor module 280, buttons 290, and / or cameras 293, etc.

[0322] Motor 291 can generate vibration alerts. Motor 291 can be used for incoming call vibration alerts or for touch vibration feedback. For example, different vibration feedback effects can be corresponding to touch operations applied to different applications (such as taking photos, playing audio, etc.). Motor 291 can also correspond to different vibration feedback effects for touch operations applied to different areas of the display screen 294. Different application scenarios (such as time reminders, receiving messages, alarm clocks, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also be customized.

[0323] Indicator 292 can be an indicator light, which can be used to indicate charging status, power changes, messages, missed calls, notifications, etc.

[0324] The SIM card interface 295 is used to connect a SIM card. The SIM card can be inserted into or removed from the SIM card interface 295 to establish contact with the electronic device. The electronic device can support one or N SIM card interfaces, where N is a positive integer greater than 1. The SIM card interface 295 can support Nano SIM cards, Micro SIM cards, and other SIM cards. Multiple cards can be inserted into the same SIM card interface 295 simultaneously. The types of cards can be the same or different. The SIM card interface 295 is also compatible with different types of SIM cards. The SIM card interface 295 is also compatible with external memory cards. The electronic device interacts with the network through the SIM card to achieve functions such as calls and data communication. In some embodiments, the electronic device uses an eSIM, i.e., an embedded SIM card. The eSIM card can be embedded in the electronic device and cannot be separated from it.

[0325] The methods described in the foregoing embodiments can all be implemented in the electronic device 002 having the aforementioned hardware structure. For ease of understanding, the following description uses a mobile phone with a foldable screen in a fully unfolded state as an example. Whether the mobile phone includes a foldable screen, the foldable screen's shape, and the number of screens after folding are not limited here.

[0326] The media resource playback method in the embodiments of this application has been described above. The electronic device in the embodiments of this application is described below. Please refer to Figure 21. Figure 21 shows another embodiment of the electronic device in the embodiments of this application, including:

[0327] In one example, the electronic device is applied to a calling terminal device, and the electronic device includes:

[0328] Transceiver unit 2101 is used to send a call request to the called terminal device;

[0329] The transceiver unit 2101 is also used to send first audio information, which is used to request the playback of media resources;

[0330] The transceiver unit 2101 is also used to receive and play a first media resource, which is determined by the first audio information.

[0331] In one possible implementation,

[0332] The transceiver unit 2101 is also used to play first response information, which is response information generated based on the first audio information, and includes audio information, image information, animation information and / or text information.

[0333] In one possible implementation, the first response information includes:

[0334] The first text information is text information generated by speech recognition processing based on the first audio information, and the content of the first text information corresponds to the first audio information.

[0335] In one possible implementation,

[0336] The transceiver unit 2101 is also used to send second audio information;

[0337] The transceiver unit 2101 is further configured to receive a response to the second audio information, the response to the second audio information including:

[0338] The second media resource is determined by the second audio information;

[0339] And / or, second response information, which is response information generated based on the second audio information, and the second response information includes audio information, image information, animation information and / or text information.

[0340] In one possible implementation,

[0341] The transceiver unit 2101 is also used to stop playing the first media resource;

[0342] The transceiver unit 2101 is also used to play the second media resource;

[0343] And / or, play the second response information.

[0344] In one possible implementation,

[0345] The transceiver unit 2101 is also used to receive off-hook messages sent by the called terminal device;

[0346] Processing unit 2102 is configured to stop playing the first media resource in response to the off-hook message;

[0347] The processing unit 2102 is also used to stop the recording of audio information and / or stop the speech recognition processing of the audio information.

[0348] In one possible implementation,

[0349] The transceiver unit 2101 is also used to send the first audio information, the first audio information including a wake-up keyword, the wake-up keyword being used to trigger the media platform server to perform speech recognition processing on the first audio information;

[0350] or,

[0351] The transceiver unit 2101 is also used to send audio information including the wake-up keyword;

[0352] The transceiver unit 2101 is also used to send the first audio information.

[0353] In another possible implementation, the electronic device is applied to a media platform server, including:

[0354] The transceiver unit 2101 is also used to receive a call request sent by the calling terminal device;

[0355] The transceiver unit 2101 is also used to receive the first audio information sent by the calling terminal device;

[0356] The processing unit 2102 is further configured to determine the first media resource based on the first audio information;

[0357] The transceiver unit 2101 is also used to send the first media resource to the calling terminal device.

[0358] In one possible implementation,

[0359] The processing unit 2102 is further configured to perform speech recognition processing based on the first audio information to generate first text information, wherein the content of the first text information corresponds to the first audio information.

[0360] The processing unit 2102 is further configured to perform semantic understanding processing based on the first text information to generate first user intent information;

[0361] The processing unit 2102 is further configured to determine the first media resource based on the first user intent information.

[0362] In one possible implementation,

[0363] The processing unit 2102 is further configured to generate first response information based on the first user intent information, wherein the first response information includes audio information, image information, animation information and / or text information;

[0364] The transceiver unit 2101 is also used to send the first response information to the calling terminal device.

[0365] In one possible implementation, the first response information includes the first text information.

[0366] In one possible implementation, the first user intent information includes any one or more of the following:

[0367] Start playing media resources, pause playing media resources, switch playing media resources, switch back to playing media resources, copy and subscribe to media resources, content feature keywords of the first audio information, or the weight of the content feature keywords.

[0368] In one possible implementation,

[0369] Processing unit 2102 is further configured to determine the first media resource based on the decision recommendation model and the content feature keywords and / or the weights of the content feature keywords included in the first user intent information, wherein,

[0370] The decision recommendation model uses a set of parameters to determine media resources. The set of parameters includes any one or more of the following: content feature keywords of the media resource, weights of the content feature keywords of the media resource, media resource tags of the media resource library, weights of the media resource tags of the media resource library, popularity weights of the media resources of the media resource library, release time of the media resources of the media resource library, or playback rate of the media resources of the media resource library, wherein the media resource library includes one or more media resources.

[0371] In one possible implementation,

[0372] The processing unit 2102 is further configured to detect whether the first audio information includes a wake-up keyword;

[0373] The processing unit 2102 is further configured to trigger speech recognition processing based on the first audio information if the first audio information includes the wake-up keyword;

[0374] Alternatively, the processing unit 2102 is further configured to trigger speech recognition processing based on the first audio information based on the detected received audio information including the wake-up keyword.

[0375] In one possible implementation,

[0376] The transceiver unit 2101 is also used to receive second audio information sent by the calling terminal device;

[0377] Processing unit 2102 is further configured to generate a response to the second audio information based on the second audio information.

[0378] The response to the second audio information includes:

[0379] The second media resource is determined by the second audio information;

[0380] And / or, second response information, the second response information being response information generated based on the second audio information, the second response information including audio information, image information, animation information and / or text information;

[0381] The transceiver unit 2101 is also used to send a response of the second audio information to the calling terminal device.

[0382] In one possible implementation,

[0383] The processing unit 2102 is further configured to perform speech recognition processing based on the second audio information to generate second text information, wherein the content of the second text information corresponds to the first audio information;

[0384] The processing unit 2102 is further configured to perform semantic understanding processing based on the second text information to generate second user intent information;

[0385] The processing unit 2102 is further configured to generate a response of the second audio information based on the second user intent information.

[0386] Referring to Figure 22, a schematic diagram of another electronic device provided in this application is shown. This electronic device may include a processor 2201, a memory 2202, and a communication port 2203. The processor 2201, memory 2202, and communication port 2203 are interconnected via lines. The memory 2202 stores program instructions and data.

[0387] The memory 2202 stores the program instructions and data corresponding to the steps executed by the calling terminal device and / or media platform server in the embodiments shown in Figures 2 to 18 above.

[0388] Processor 2201 is configured to perform the steps shown by the calling terminal device and / or media platform server in any of the embodiments shown in Figures 2 to 18.

[0389] Communication port 2203 can be used to receive and send data, and to perform the steps related to acquisition, transmission and reception in any of the embodiments shown in Figures 2 to 18 above.

[0390] In one implementation, the electronic device may include more or fewer components relative to FIG22. This application is merely illustrative and not intended to limit the scope of the invention.

[0391] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.

[0392] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0393] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated units described above can be implemented wholly or partially through software, hardware, firmware, or any combination thereof.

[0394] When the integrated unit is implemented using software, it can be implemented wholly or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).

Claims

1. A method for playing media resources, characterized in that, The method is applied to a calling terminal device, and the method includes: Send a call request to the called terminal device; Send first audio information, which is used to request the playback of media resources; Receive and play a first media resource, which is determined by the first audio information.

2. The method according to claim 1, characterized in that, The method further includes: Play the first response information, which is a response information generated based on the first audio information, and the first response information includes audio information, image information, animation information and / or text information.

3. The method according to any one of claims 1 or 2, characterized in that, The first response information includes: The first text information is text information generated by speech recognition processing based on the first audio information, and the content of the first text information corresponds to the first audio information.

4. The method according to any one of claims 1-3, characterized in that, The method further includes: Send the second audio message; The response to receiving the second audio information includes: The second media resource is determined by the second audio information; And / or, second response information, which is response information generated based on the second audio information, and the second response information includes audio information, image information, animation information and / or text information.

5. The method according to claim 4, characterized in that, The method further includes: Stop playing the first media resource; Play the second media resource; And / or, play the second response information.

6. The method according to any one of claims 1-5, characterized in that, The method further includes: Receive the off-hook message sent by the called terminal device; In response to the off-hook message, playback of the first media resource is stopped; Stop recording audio information and / or stop speech recognition processing of the audio information.

7. The method according to any one of claims 1-6, characterized in that, Sending the first audio information includes: Send the first audio information, which includes a wake-up keyword, and the wake-up keyword is used to trigger the media platform server to perform speech recognition processing on the first audio information; or, Send audio information including the wake-up keyword; Send the first audio information.

8. A method for playing media resources, characterized in that, The method is applied to a media platform server, and the method includes: Receive the call request sent by the calling terminal device; Receive the first audio information sent by the calling terminal device; Based on the first audio information, the first media resource is determined; The first media resource is sent to the calling terminal device.

9. The method according to claim 8, characterized in that, Determining the first media resource based on the first audio information includes: Speech recognition processing is performed based on the first audio information to generate first text information, the content of which corresponds to the first audio information. Based on the first text information, semantic understanding processing is performed to generate first user intent information; The first media resource is determined based on the first user intent information.

10. The method according to claim 9, characterized in that, The method further includes: Based on the first user intent information, a first response information is generated, which includes audio information, image information, animation information and / or text information; The first response information is sent to the calling terminal device.

11. The method according to claim 9 or 10, characterized in that, The first response information includes the first text information.

12. The method according to any one of claims 9-11, characterized in that, The first user intent information includes any one or more of the following: Start playing media resources, pause playing media resources, switch playing media resources, switch back to playing media resources, copy and subscribe to media resources, content feature keywords of the first audio information, or the weight of the content feature keywords.

13. The method according to claim 12, characterized in that, Based on the first user intent information, the first media resource is determined, including: The first media resource is determined based on the decision recommendation model and the content feature keywords and / or the weights of the content feature keywords included in the first user intent information, wherein... The decision recommendation model uses a set of parameters to determine media resources. The set of parameters includes any one or more of the following: content feature keywords of the media resource, weights of the content feature keywords of the media resource, media resource tags of the media resource library, weights of the media resource tags of the media resource library, popularity weights of the media resources of the media resource library, release time of the media resources of the media resource library, or playback rate of the media resources of the media resource library, wherein the media resource library includes one or more media resources.

14. The method according to any one of claims 9-13, characterized in that, The method further includes: Detect whether the first audio information includes a wake-up keyword; If the first audio information includes the wake-up keyword, speech recognition processing based on the first audio information is triggered. Alternatively, if the received audio information is detected to include the wake-up keyword, speech recognition processing based on the first audio information can be triggered.

15. The method according to any one of claims 8-14, characterized in that, The method further includes: Receive the second audio information sent by the calling terminal device; Generate a response based on the second audio information. The response to the second audio information includes: The second media resource is determined by the second audio information; And / or, second response information, the second response information being response information generated based on the second audio information, the second response information including audio information, image information, animation information and / or text information; A response to sending the second audio information to the calling terminal device.

16. The method according to any one of claims 8-15, characterized in that, Based on the second audio information, a response to the second audio information is generated, including: Based on the second audio information, speech recognition processing is performed to generate second text information, the content of which corresponds to the first audio information; Based on the second text information, semantic understanding processing is performed to generate second user intent information; Based on the second user intent information, a response containing the second audio information is generated.

17. An electronic device, characterized in that, include: The transceiver unit and the processing unit enable the electronic device to perform the method as described in any one of claims 1-7 or 8-16.

18. An electronic device, characterized in that, include: A processor coupled to a memory for storing programs or instructions that, when executed by the processor, cause the electronic device to perform the method as described in any one of claims 1-7 or 8-16.

19. A computer storage medium, characterized in that, Includes computer instructions that, when executed on a terminal device, cause the terminal device to perform the method as described in any one of claims 1-7 or 8-16.

20. A computer program product, characterized in that, When the computer program product is run on a computer, it causes the computer to perform the method as described in any one of claims 1-7 or 8-16.