Image display method, electronic device, and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By inputting user interaction information into the image generation model and adjusting the prediction frame generation process, the problem of insufficient prediction frame quality in the existing technology is solved, higher quality prediction frame generation is achieved, and the image display effect of electronic devices is improved.

CN122297992APending Publication Date: 2026-06-30HONOR DEVICE CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HONOR DEVICE CO LTD
Filing Date: 2024-12-31
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Existing technologies fail to effectively account for changes in video footage caused by user actions when generating prediction frames, resulting in poor prediction frame quality.

Method used

By acquiring user interaction information and inputting it into the image generation model, the generation process of the predicted frames is adjusted to better reflect the changes in video images caused by user actions.

Benefits of technology

The generated predicted frames are of higher quality, better match the user's operating intentions and changes in the video, and improve the image display effect of electronic devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122297992A_ABST

Patent Text Reader

Abstract

This application provides an image display method, an electronic device, and a storage medium, applicable to the field of image processing technology. In this method, after a game application is launched, the electronic device displays a second game interface after displaying a first game interface. The electronic device acquires a first real frame and user interaction information; the first real frame corresponds to the second game interface, but does not include user interface elements of the second game interface, and the user interaction information represents user actions performed before the second game interface is displayed. The electronic device inputs the user interaction information and the first real frame into an image generation model. Then, the electronic device acquires a predicted frame output by the image generation model. After the electronic device displays the second game interface, it displays a third game interface based on the predicted frame. The electronic device can consider changes in the video image caused by user actions when generating the predicted frame. Therefore, the electronic device can generate high-quality predicted frames.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to an image display method, electronic device, and storage medium. Background Technology

[0002] With the development of terminal technology, users have increasingly higher requirements for the image display of electronic devices. In gaming scenarios, high frame rate rendering is one of the factors affecting the image display effect of electronic devices. Currently, mobile phones and other electronic devices can increase the video frame rate by adding generated predicted frames between multiple real frames.

[0003] At present, how to improve the quality of predicted frames is a problem that needs to be solved. Summary of the Invention

[0004] Based on this, embodiments of this application provide an image display method, an electronic device, and a storage medium, which can generate prediction frames based on user interaction information, thereby improving the quality of the prediction frames.

[0005] In a first aspect, embodiments of this application provide an image display method applied to electronic devices such as mobile phones and tablets. The electronic device has a game application installed. In this method, after the game application is launched, the electronic device displays a first game interface of the game application. After displaying the first game interface, the electronic device displays a second game interface. The electronic device acquires a first real frame and user interaction information; wherein the first real frame corresponds to the second game interface, the first real frame does not include user interface elements of the second game interface, and the user interaction information represents user operations before displaying the second game interface. Then, the electronic device inputs the user interaction information and the first real frame into an image generation model; so that the image generation model adjusts the reconstruction process of the first real frame according to the user interaction information, making the predicted frame output by the image generation model move closer to the direction indicated by the user interaction information. Next, the electronic device acquires the predicted frame output by the image generation model. After displaying the second game interface, the electronic device displays a third game interface based on the predicted frame.

[0006] In this embodiment, the electronic device can generate prediction frames based on user interaction information. Therefore, the electronic device can consider changes in the video frame caused by user actions when generating prediction frames. Consequently, the electronic device can generate high-quality prediction frames.

[0007] In one possible implementation of the first aspect, the second game interface includes: user interface elements and game objects; the electronic device displays the second game interface by: the electronic device rendering a first real frame including the game objects onto a first layer; the electronic device rendering the user interface elements of the second game interface onto a second layer; the electronic device merging the first layer and the second layer and then displaying the second game interface; the electronic device acquires the first real frame and user interaction information by: the electronic device acquiring the first layer and user interaction information, wherein the first layer represents the first real frame.

[0008] In this design approach, the electronic device can separate the layers of the first real frame and the user interface elements by rendering the first real frame, which includes game objects, onto the first layer and rendering the user interface elements of the second game interface onto the second layer. This ensures that the electronic device is not affected by the user interface elements when generating the prediction frame using the first real frame, thereby improving the quality of the prediction frame generated by the electronic device.

[0009] In one possible implementation of the first aspect, the user operation before displaying the second game interface includes a first user operation; the method further includes: the electronic device receiving the first user operation; the electronic device generating a touch event corresponding to the first user operation; the electronic device acquiring a first real frame and user interaction information, and further includes: the electronic device acquiring the touch event corresponding to the first user operation, wherein the touch event represents user interaction information.

[0010] In this design, the electronic device uses touch events corresponding to user actions prior to displaying the second game interface as user interaction information, and can generate predictive frames based on this information. Compared to related technologies that use historical frames to generate predictive frames, the electronic device considers the changes in the video image caused by the user actions prior to displaying the second game interface when generating predictive frames based on this user interaction information. Therefore, the generated predictive frames better match the user intent corresponding to the user actions prior to displaying the second game interface, resulting in higher accuracy.

[0011] In another possible implementation of the first aspect, when the first user operation is a swipe operation, the touch event includes: a swipe start event, a swipe process event, and a swipe end event; the electronic device obtains a swipe image based on the touch event; wherein, the first color channel of the swipe image corresponds to the swipe start event; the second color channel of the image corresponds to the swipe process event; the third color channel of the image corresponds to the swipe end event; the grayscale value of the first color channel corresponds to the timestamp of the swipe start event; the grayscale value of the second color channel corresponds to the timestamp of the swipe process event; the grayscale value of the third color channel corresponds to the timestamp of the swipe end event; the electronic device generates a model by inputting the user interaction information and the first real frame into the image, including: the electronic device generates a model by inputting the swipe image and the first real frame into the image.

[0012] In this design, when the first user operation is a swipe, the electronic device generates a swipe image based on the touch event; that is, the electronic device represents the touch event using a swipe image. This swipe image uses different color channels to represent the swipe start event, swipe process event, and swipe end event, respectively, and uses the grayscale value of the same color channel to represent the timestamp of the swipe start event, swipe process event, or swipe end event corresponding to that color channel. This swipe image accurately preserves the user's interaction trajectory and operation sequence, thus making the predicted frame generated by the electronic device based on this swipe image more accurate.

[0013] In another possible implementation of the first aspect, when the first user operation is a click operation, the touch event includes: a click event; the electronic device obtains a click image based on the touch event; wherein, the first color channel of the click image corresponds to the click event; the grayscale value of the first color channel corresponds to the timestamp of the click event; the electronic device generates a model by inputting the user interaction information and the first real frame into the image, including: the electronic device generates a model by inputting the click image and the first real frame into the image.

[0014] In this design, when the first user action is a click, the electronic device obtains a click image based on the touch event; that is, the electronic device represents the touch event with a click image. The click image uses a color channel to represent the click event under the click action, and the grayscale value of that color channel represents the timestamp of the click event. This click image accurately preserves the user's interaction position and operation timestamp, thus making the predicted frame generated by the electronic device based on this click image more accurate.

[0015] In another possible implementation of the first aspect, the second game interface includes a first game object; the electronic device displays the second game interface by: the electronic device drawing a first real frame based on the velocity vector of the first game object; the electronic device merging the first real frame and user interface elements and then displaying the second game interface; the method further includes: the electronic device acquiring the velocity vector; the electronic device inputting user interaction information and the first real frame into an image generation model by: the electronic device inputting the velocity vector, user interaction information, and the first real frame into an image generation model.

[0016] In this design approach, based on user interaction information and the first real frame, the electronic device also uses the velocity vector of the game object in the first real frame as input to the image generation model. This allows the image generation model to adjust the reconstruction process of the first real frame according to the user interaction information and the velocity vector, making the predicted frame output by the image generation model move closer to the direction that better matches the user interaction information and the velocity vector, thereby further improving the quality of the predicted frame output by the image generation model.

[0017] In another possible implementation of the first aspect, the first game interface includes a first switch in a closed state; the electronic device acquires a first real frame and user interaction information, including: after the first switch in the closed state is triggered, the electronic device acquires the first real frame and user interaction information.

[0018] In this design, the electronic device acquires the first real frame and user interaction information only after the first switch, which is in the off state, is triggered. It then generates a predicted frame based on these elements. In other words, users can flexibly control whether the electronic device generates a predicted frame based on the first switch, thereby improving the user experience.

[0019] In another possible implementation of the first aspect, the image generation model includes a latent variable autoencoder, a scheduler, a denoising network, and a control network. The electronic device inputs user interaction information and a first real frame into the image generation model, including: the electronic device inputting the first real frame into the latent variable autoencoder; the electronic device acquiring the output of the latent variable autoencoder; the electronic device inputting the output of the latent variable autoencoder into the scheduler; the electronic device acquiring the noise data of the first real frame output by the scheduler; the electronic device inputting the user interaction information into the control network; the electronic device acquiring the output of the control network; and the electronic device inputting the noise data and the output of the control network into the denoising network.

[0020] In this design approach, user interaction information is used as a conditional input to the denoising network through a control network. This allows the denoising network to adjust the reconstruction process of the first real frame based on the user interaction information, making the output of the denoising network more consistent with the user interaction information, thereby improving the quality of the output of the denoising network.

[0021] Secondly, embodiments of this application provide a model training method applied to a training device, the method comprising:

[0022] The training device acquires training data, which includes features and labels. The features include a third real frame sample; the labels include a first real frame sample, a second real frame sample, and user interaction information samples. The first real frame sample is an image frame drawn by the game application and does not include user interface elements. The second real frame sample was drawn by the game application before the first real frame sample and does not include user interface elements. The user interaction information samples represent user actions between the timestamps corresponding to the second and first real frame samples. The training device inputs the features of the training data into an initial image generation model to obtain initial training results. The training device calculates a first loss function based on the initial training results and the labels of the training data. The training device iterates the parameters in the initial image generation model based on the first loss function until the first loss function converges, thus obtaining the image generation model.

[0023] In this embodiment, the training device generates initial training results based on user interaction information samples during training. Then, the training device calculates a first loss function based on these initial training results and the labels of the training data. This ensures that the image generation model obtained when the first loss function converges can consider user interaction information when generating prediction frames. In other words, the image generation model can consider changes in the video frame caused by user actions when generating prediction frames, thus improving the quality of the prediction frames.

[0024] In another possible implementation of the second aspect, the initial image generation model includes a latent variable autoencoder, a denoising network, and a control network; the denoising network includes a pre-trained first encoder, and the control network includes a second encoder; before the training device inputs the features of the training data into the initial image generation model to obtain the initial training result, the method further includes: the training device setting the parameters of the second encoder to be the same as the parameters of the first encoder; the training device inputting the features of the training data into the initial image generation model to obtain the initial training result includes: the training device inputting a first real frame sample into the latent variable autoencoder; the training device acquiring the output result of the latent variable autoencoder; and the training device inputting a second real frame sample and user interaction information samples into the control network. The training device acquires the output of the control network; the training device inputs the output of the latent variable autoencoder and the output of the control network into the denoising network; the training device acquires the output of the denoising network; the initial training results include the output of the denoising network; the training device calculates a first loss function based on the initial training results and the labels of the training data, including: the training device calculates the first loss function based on the difference between the output of the denoising network and the third real frame sample; the training device iterates the parameters in the initial image generation model based on the first loss function until the first loss function converges to obtain the image generation model, including: the training device iterates the parameters of the second encoder based on the first loss function until the first loss function converges to obtain the image generation model.

[0025] In this design, the latent variable autoencoder (LVAE) and the control network work collaboratively in the same training loop, making the latent representation generated by the LVAE more suitable for the processing needs of the control network. The first real frame sample, the second real frame sample, and the user interaction information sample are uniformly input throughout the training process, ensuring that the initial image generation model can fully utilize this information when generating initial training results. Because the LVAE and the control network are optimized under the same objective, the overall model performance is better than training them separately, thus generating higher quality and clearer initial training results. Secondly, this method ensures that the latent representation output by the LVAE is highly matched with the needs of the control network, reducing the accumulation of errors in the latent representation. Furthermore, this joint training method of the LVAE and the control network helps the initial image generation model better learn complex conditional relationships, improving its generalization ability in diverse scenarios.

[0026] In another possible implementation of the second aspect, the first loss function is the reconstruction loss, which is the difference between the initial training result and the third real frame sample in the pixel space; or, the first loss function is the weighted sum of the reconstruction loss and the VGG feature matching loss; wherein, the VGG feature matching loss is the difference between the first feature map and the second feature map, the first feature map is the feature map extracted by the pre-trained VGG network at the target layer of the initial training result, and the second feature map is the feature map extracted by the pre-trained VGG network at the target layer of the third real frame sample.

[0027] In this design, the training device can calculate the first loss function by calculating the difference between the initial training results and the third real frame samples in the pixel space. This calculation method is simple and saves computational overhead on the training device. Alternatively, the training device can calculate the first loss function by calculating the reconstruction loss and the VGG feature matching loss. The calculation of the VGG feature matching loss considers the semantic and texture information of the initial training results at different levels, as well as the semantic and texture information of the third real frame samples at different levels. Therefore, it can make the predicted frames generated by the image generation model trained with the VGG feature matching loss more realistic and clearer in terms of detail and texture.

[0028] In another possible implementation of the second aspect, the denoising network is a lightweight U-Net denoising network, and when the first loss function is the reconstruction loss, the first loss function satisfies:

[0029]

[0030] Where L1 represents the first loss function, f θ′ Let E represent the encoder of the lightweight U-Net denoising network, D represent the decoder of the lightweight U-Net denoising network, and ∈ represent noise. Represents cosine noise, o t o represents the vector corresponding to the first real frame sample. <t This represents the vector corresponding to the second real frame sample, where t represents the time step, and a <t v represents the vector corresponding to the user interaction information sample. t The vector corresponding to the velocity vector, o t+1 This represents the vector corresponding to the third real frame sample. "‖‖" is used to indicate the sum of the absolute values of each element in the vector.

[0031] In this design approach, the first loss function considers information such as the first real frame sample, the second real frame sample, user interaction information samples, and velocity vectors, enabling the initial image generation model to fully utilize this information when generating the initial training results.

[0032] Thirdly, embodiments of this application provide an electronic device, including a memory and one or more processors. The memory and processors are coupled. The memory stores computer program code, which includes computer instructions. When the processor executes the computer instructions, it causes the electronic device to perform the methods of the first aspect and its possible design embodiments, and the methods of the second aspect and its possible design embodiments.

[0033] Fourthly, embodiments of this application provide a computer-readable storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform the method as described in the first aspect and its possible design, and the method as described in the second aspect and its possible design.

[0034] Fifthly, embodiments of this application provide a chip system applied to an electronic device. The chip system includes one or more processors, which are configured to invoke computer instructions to cause the electronic device to perform the methods of the first aspect and its possible design schemes, and the methods of the second aspect and its possible design schemes.

[0035] Sixthly, embodiments of this application provide a computer program product that, when run on a computer, causes the computer to perform the method as described in the first aspect and its possible design, and the method as described in the second aspect and its possible design.

[0036] It is understood that the beneficial effects achieved by the electronic device described in the third aspect, the computer storage medium described in the fourth aspect, the chip system described in the fifth aspect, and the computer program product described in the sixth aspect can be referred to the beneficial effects of the first aspect and any of its possible design embodiments, which will not be repeated here. Attached Figure Description

[0037] Figure 1 This is a schematic diagram illustrating the principle of an image frame interpolation technique provided in an embodiment of this application.

[0038] Figure 2 A schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application;

[0039] Figure 3 A schematic diagram of the software structure of an electronic device provided in an embodiment of this application;

[0040] Figure 4 A schematic diagram of a diffusion model architecture provided in an embodiment of this application;

[0041] Figure 5 A schematic diagram of an interface provided for an embodiment of this application;

[0042] Figure 6A timing diagram provided for an embodiment of this application;

[0043] Figure 7 A schematic diagram of an interactive trajectory multi-channel image representation provided in this application embodiment. Figure 1 ;

[0044] Figure 8 A flowchart illustrating the training process of an image generation model provided in this application embodiment. Figure 1 ;

[0045] Figure 9 A schematic diagram of the training architecture of an image generation model provided in an embodiment of this application;

[0046] Figure 10 A flowchart illustrating the training process of an image generation model provided in this application embodiment. Figure 2 ;

[0047] Figure 11 This is a schematic diagram of a training process for a latent variable autoencoder provided in an embodiment of this application;

[0048] Figure 12 A schematic diagram of an architecture for training a latent variable autoencoder is provided for an embodiment of this application;

[0049] Figure 13 A flowchart illustrating an image generation method provided in an embodiment of this application;

[0050] Figure 14 A schematic diagram of interface interaction provided for an embodiment of this application;

[0051] Figure 15 A schematic diagram of an interactive trajectory multi-channel image representation provided in this application embodiment. Figure 2 ;

[0052] Figure 16 This application provides a display intent for sending data.

[0053] Figure 17 A schematic diagram of the hardware structure of another electronic device provided in an embodiment of this application;

[0054] Figure 18 This is a schematic diagram of a chip system provided in an embodiment of this application. Detailed Implementation

[0055] Hereinafter, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this embodiment, unless otherwise stated, "a plurality of" means two or more.

[0056] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a specific manner to facilitate understanding.

[0057] With the development of terminal technology, users have increasingly higher requirements for the image display of electronic devices. In gaming scenarios, high frame rate rendering is one of the factors affecting the image display effect of electronic devices. Currently, the frame rate of video is often improved by adding predicted frames (also known as interpolated frames, intermediate frames, transition frames, or interpolated frames) between multiple real frames. For example, inserting predicted frames between adjacent real frames. The technique for inserting predicted frames between adjacent real frames can be called image frame interpolation technology.

[0058] In some solutions, electronic devices can generate interpolated frames using motion compensation techniques. For example, motion compensation techniques can be used to estimate the motion of adjacent real frames drawn in chronological order by a game application before interpolation, obtain displacement vectors (also called motion vectors) between adjacent real frames, and generate interpolated frames between adjacent real frames based on the displacement vectors.

[0059] For example, Figure 1 This is a schematic diagram illustrating the principle of an image frame interpolation technique provided in an embodiment of this application. Figure 1 As shown, real frame N-1, real frame N, and real frame N+1 are image frames drawn by the game application in chronological order. Without frame interpolation, the electronic device will display real frame N-1, real frame N, and real frame N+1 sequentially within time period A (e.g., a time period of 0.2 seconds). It is evident that without frame interpolation, the electronic device will display only 3 frames within time period A, resulting in a low number of real frames and consequently, poor game smoothness.

[0060] Therefore, in order to improve the display frame rate of the game scene, the electronic device can generate an interpolated frame M between the real frame N-1 and the real frame N based on the real frame N-1 and the real frame N, and insert the interpolated frame M between the real frame N-1 and the real frame N.

[0061] For example, the electronic device performs motion estimation on real frame N-1 and real frame N to obtain the displacement vector of each game object in real frame N-1 between real frame N-1 and real frame N. Next, the electronic device adjusts the position of the game object according to the displacement vector corresponding to the game object, thereby obtaining the interpolated frame M between real frame N-1 and real frame N.

[0062] For example, see again Figure 1 The electronic device performs motion estimation on the game object 101 in real frame N-1 and the game object 101 in real frame N, obtaining the displacement vector of the game object 101 between real frame N-1 and real frame N. Then, the electronic device adjusts the position of the game object 101 from a first position to a second position based on this displacement vector. The first position is the position of the game object 101 in real frame N-1, and the second position is the position of the game object 101 in the predicted interpolated frame M.

[0063] Similar to how an electronic device generates an interpolated frame M, it can generate an interpolated frame M+1 between real frame N and real frame N+1, and then insert the interpolated frame M+1 between real frame N and real frame N+1. Thus, within time period A, the electronic device can sequentially display five frames: real frame N-1, interpolated frame M, real frame N, interpolated frame M+1, and real frame N+1.

[0064] As can be seen, compared with the case where image frame interpolation technology is not used, when image frame interpolation technology is used, the electronic device can display more frames of images in the same display time period (e.g., the electronic device can display 5 frames of images in time period A), thereby improving the frame rate of images displayed by the electronic device when displaying the interface of a game application.

[0065] It is understood that the frame rate of the image displayed by the electronic device in this embodiment is not the refresh rate of the electronic device's display screen, but refers to the number of times the image information is updated per second.

[0066] In some situations, user actions can cause changes in the game screen, but the motion compensation techniques mentioned above do not collect information related to user actions and therefore cannot account for such changes. Consequently, the quality of the predicted frames obtained through motion compensation techniques is poor.

[0067] In view of this, embodiments of this application provide an image display method in which, after a first application is launched, the first application generates a first real frame. Next, an electronic device can acquire user interaction information and the aforementioned first real frame, and generate a predicted frame based on the user interaction information and the aforementioned first real frame. Then, the electronic device can display the predicted frame after the first real frame.

[0068] The aforementioned user interaction information can be information input by the user to the electronic device through some user input devices (such as gamepad, keyboard, mouse, touch screen, gyroscope sensor, etc.), and the user interaction information can affect the interface display of the electronic device.

[0069] In this scheme, the electronic device can generate predicted frames based on user interaction information. This demonstrates that the electronic device can consider changes in the video frame caused by user actions when generating predicted frames. Therefore, the electronic device can generate high-quality predicted frames.

[0070] For example, the electronic device in this application embodiment may be a mobile phone, tablet computer, desktop computer, laptop computer, handheld computer, laptop computer, ultra-mobile personal computer (UMPC), augmented reality (AR) or virtual reality (VR) device, etc., which are devices with display functions. This application embodiment does not impose any special restrictions on the specific form of the electronic device.

[0071] It should be understood that in some cases, the aforementioned user input device may be a user input device independent of the electronic device. For example, if the electronic device is a laptop, the aforementioned user input device may be a keyboard, mouse, gamepad, etc. In other cases, the aforementioned user input device may be deployed on the electronic device. For example, if the electronic device is a mobile phone, the aforementioned user input device may be the mobile phone's touchscreen, gyroscope sensor, etc. This application does not impose any limitations on the specific form of the input device.

[0072] The following will take a mobile phone as an example, with the user input device being the mobile phone's touchscreen, to further illustrate the technical solution provided in this application. It should be understood that this application does not impose any limitations on the product form of the electronic device.

[0073] Figure 2 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Figure 2 As shown, taking a mobile phone as an example, the electronic device 200 may include a processor 210, an external memory interface 220, an internal memory 221, a universal serial bus (USB) interface 230, a sensor module 240, and a display screen 250, etc.

[0074] It is understood that the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device. In other embodiments, the electronic device may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

[0075] Processor 210 may include one or more processing units, such as a central processing unit (CPU), an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, memory, a video codec, a digital signal processor (DSP), a baseband processor, and / or a neural network processing unit (NPU). These different processing units may be independent devices or integrated into one or more processors.

[0076] The controller can be the nerve center and command center of a mobile phone. Based on the instruction opcode and timing signals, the controller generates operation control signals to control the fetching and execution of instructions.

[0077] The processor 210 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. This memory can store instructions or data that the processor 210 has just used or that are used repeatedly. If the processor 210 needs to use the instruction or data again, it can directly retrieve it from the memory. This avoids repeated accesses, reduces the waiting time of the processor 210, and thus improves the efficiency of the system.

[0078] The mobile phone implements its display function through a GPU, a display screen 250, and an application processor. The GPU is a microprocessor for image processing, connected to the display screen 250 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 210 may include one or more GPUs, which execute program instructions to generate or modify display information.

[0079] Display screen 250 is used to display images, videos, etc. Display screen 250 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a Minied, MicroLED, Micro-OLED, a quantum dot light-emitting diode (QLED), etc.

[0080] The sensor module 240 may include a pressure sensor 240A, a fingerprint sensor 240H, a touch sensor 240K, etc.

[0081] Touch sensor 240K can be disposed on display screen 250, and touch sensor 240K and display screen 250 constitute touch screen, or "touchscreen". The touch screen can contain a grid of capacitance sensing nodes. When the electronic device determines that the capacitance value received by the capacitance sensor in at least one grid exceeds a capacitance threshold, a touch operation can be determined. Furthermore, the electronic device can determine the touch area corresponding to the touch operation based on the area occupied by at least one grid exceeding the capacitance threshold.

[0082] The external storage interface 220 can be used to connect an external storage card, such as a Micro SD card, to expand the phone's storage capacity. The external storage card communicates with the processor 210 through the external storage interface 220 to perform data storage functions. For example, music, video, and other media files can be saved on the external storage card.

[0083] Internal memory 221 can be used to store computer executable program code, which includes instructions. Processor 210 executes various mobile phone functions (APPs) and data processing by running the instructions stored in internal memory 221. Internal memory 221 may include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback, image playback, etc.), etc. The data storage area may store data created during mobile phone use (such as audio data, phonebook, etc.). Furthermore, internal memory 221 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.

[0084] It should be noted that the aforementioned Figure 2 The description and its illustration are merely examples of the hardware structure of an electronic device for the solutions provided in the embodiments of this application. Figure 2 The composition of this component does not constitute a limitation on the solutions described in the embodiments of this application. In other embodiments, the electronic device may also have a more... Figure 2 The diagram shows the number of components, either more or fewer.

[0085] The technical solutions provided in this application can be applied to the above-mentioned... Figure 2 An electronic device with a hardware structure shown.

[0086] Next, we will further introduce the software architecture of electronic devices.

[0087] Figure 3 This is a schematic diagram of the software architecture of an electronic device provided in an embodiment of this application. The software system of the electronic device can adopt a layered architecture, event-driven architecture, microkernel architecture, microservice architecture, or cloud architecture. This embodiment of the application uses a layered architecture of Android. TM Taking the system as an example, we will illustrate the software structure of the mobile phone.

[0088] like Figure 3 As shown, a layered architecture divides software into several layers, each with a clear role and function. Layers communicate with each other through software interfaces. In some embodiments, Android... TM The system is divided into at least three layers, from top to bottom: the application layer, the application framework layer, and the kernel layer.

[0089] The application layer may include a series of application packages. For example, application packages may include applications such as camera, calendar, map, video, music, SMS, gallery, calling, and navigation. In this embodiment, the application layer also includes game applications and frame interpolation services. The frame interpolation service can be used to acquire user interaction information. The application framework layer provides application programming interfaces (APIs) and programming frameworks for the applications in the application layer, supporting the operation of applications within the application layer.

[0090] The application framework layer may include a window manager, content provider, view system, resource manager, notification manager, activity manager, input manager, etc.

[0091] The window manager provides a window management service (WMS), which can be used for window management, window animation management, surface management, and as a relay station for the input system.

[0092] Content providers store and retrieve data, making that data accessible to applications. This data can include videos, images, audio, phone calls made and received, browsing history and bookmarks, phone books, etc.

[0093] A view system includes visual controls, such as controls for displaying text and controls for displaying images. View systems can be used to build applications. A display interface can consist of one or more views. For example, a display interface including a text notification icon could include views for displaying text and views for displaying images.

[0094] The file explorer provides applications with various resources, such as localized strings, icons, images, layout files, video files, and more.

[0095] The notification manager allows applications to display notifications in the status bar. These notifications can be used to deliver informational messages and can disappear automatically after a short pause, requiring no user interaction. For example, the notification manager can be used to notify users of completed downloads or message alerts. The notification manager can also display notifications as icons or scrolling text in the top status bar, such as notifications from background applications, or as dialog boxes on the screen. Examples include displaying text messages in the status bar, emitting sounds, vibrating electronic devices, and flashing indicator lights.

[0096] The Activity Manager Service (AMS) can be used to start, switch, and schedule system components (such as activities, services, content providers, and broadcast receivers), as well as manage and schedule application processes.

[0097] The input manager provides an input management service (IMS), which manages system inputs such as touchscreen input, keypad input, and sensor input. IMS retrieves events from input device nodes and, through interaction with the WMS (Windows Management System), distributes these events to the appropriate windows. For example, the input manager can capture touch events and send them to game applications.

[0098] The kernel layer is the layer between hardware and software. The kernel layer includes at least display drivers, video drivers, and sensor drivers.

[0099] The image display method provided in this application embodiment can be implemented in a mobile phone having the above-described hardware and software structures.

[0100] The following example, using frame interpolation, illustrates the workflow of mobile phone software and hardware.

[0101] After the game application is launched, when the touch sensor 240K detects a user touch operation, it generates a hardware interrupt and sends it to the kernel-level sensor driver (e.g., the touch sensor driver). The sensor driver then encapsulates the touch operation into a touch event (e.g., encapsulating the touch coordinates and timestamp of the touch operation into the touch event); the touch event represents user interaction information. The input manager acquires the touch event and sends it to the game application. The game application then draws a real image frame based on the touch event. The game application then displays this real image frame via a system call. Next, the frame interpolation service acquires both the touch event and the real image frame. The frame interpolation service then inputs the real image frame and the touch event into an image generation model, which outputs a predicted frame. Finally, the electronic device displays the predicted frame after displaying the real image frame.

[0102] The frame interpolation service can acquire the aforementioned touch events through listening, and it can also acquire the aforementioned real image frames through interception. For details on how the frame interpolation service acquires touch events and real image frames, please refer to the following text; these details will not be elaborated upon here.

[0103] In other possible implementations, once the input manager receives a touch event, the system notifies the registered accessibility service of these touch events through the accessibility manager. The accessibility service converts the touch event into a corresponding accessibility event, which is then delivered to the game application through the system's event distribution mechanism. The game application then responds based on the received accessibility events and renders a real image frame.

[0104] In this embodiment, the electronic device inputs the acquired user interaction information into the image generation model, and the image generation model outputs a predicted frame. This means that the electronic device can consider changes in the video frame caused by user actions when generating the predicted frame. Therefore, the electronic device can generate high-quality predicted frames.

[0105] Before introducing the image display method provided in the embodiments of this application, we will first introduce the relevant content of the image generation model provided in the embodiments of this application.

[0106] Understandably, the training device can train the initial image generation model. After the training termination conditions are met (e.g., the loss function converges, or if the loss function does not converge, the number of training iterations reaches a preset number), the training process of the initial image generation model is completed, and the image generation model is obtained.

[0107] The training device can be a terminal, or other computing devices such as servers or cloud devices. For example, the training device can be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the program in this application.

[0108] As one possible implementation, the initial image generation model described above can adopt a diffusion model architecture.

[0109] For example, Figure 4 This is a schematic diagram of a diffusion model architecture provided in an embodiment of this application. Figure 4As shown, the architecture of the diffusion model can include a latent variable autoencoder, a scheduler, a denoising network (e.g., a lightweight U-Net denoising network), and a control network.

[0110] Among them, latent variable autoencoders can be used to convert input images into latent representations (also known as latent variables) in the latent space.

[0111] The scheduler can be used to add random noise to the latent variables output by the latent variable autoencoder. This random noise can be Gaussian noise. In other possible implementations, the noise can also be any of the following: uniform noise, Laplace noise, Bernoulli noise, exponential noise, salt-and-pepper noise, periodic noise, autoregressive noise, etc.

[0112] In this embodiment, the lightweight U-Net denoising network can be used to denoise in the latent space and generate prediction frames by combining user interaction information and scene motion information.

[0113] like Figure 4 As shown, the lightweight U-Net denoising network can include an input layer, a temporal coding unit, an encoder, a decoder, and an output layer. The temporal coding unit is used to encode time steps. The encoder includes at least one coding block, which is used to extract high-dimensional features from the input of the lightweight U-Net denoising network. The decoder can be used to restore the feature map to an image of the same size as the input image through at least one decoding block, while recovering the spatial information of the image.

[0114] In the lightweight U-Net denoising network, each encoding block in the encoder can be connected to its corresponding decoding block via a skip connection mechanism. For example... Figure 4 In this embodiment, coded block A is connected to decoded block A via a skip connection mechanism. This application does not limit the number of coded and decoded blocks in the lightweight U-Net denoising network.

[0115] In this embodiment, the control network encodes and performs zero-convolution on user interaction information, and then sends the processing results to the lightweight U-Net denoising network.

[0116] like Figure 4 As shown, the control network may include an attention mechanism module, an encoder, and at least one zero-convolution layer. The attention mechanism module can be used to assign weights to its inputs. The encoder of the control network may include the same encoding blocks as those in the lightweight U-Net denoising network. The zero-convolution layer is used to resize its input data by padding with zero values and then perform standard convolution operations on the resized data.

[0117] It should be understood that the lightweight U-Net denoising network can be pre-trained and fixed. That is, the parameters in the lightweight U-Net denoising network do not change during training. The encoding blocks in the control network can be obtained by copying the encoding blocks of the lightweight U-Net denoising network; that is, the initial parameters of the encoding blocks in the control network are the same as the parameters of the encoding blocks in the lightweight U-Net denoising network.

[0118] After introducing the architecture of the initial image generation model, the training dataset and the methods for obtaining the training dataset will be introduced below.

[0119] As one possible implementation, the training device can train an initial image generation model using a training dataset, and after the training termination condition is met, the training device obtains the image generation model.

[0120] In this embodiment, the data acquisition device can collect and preprocess data during the user's or developer's use of the game application to obtain a training dataset. Preprocessing may include converting the data format, such as converting user interaction information into vectors. The form of the data acquisition device can be similar to the training device described above, and will not be elaborated further.

[0121] In this embodiment of the application, the training dataset may include at least one sample (also referred to as training data). Each sample may include a first vector corresponding to the user interaction information sample, a second vector corresponding to the scene motion information sample, a third vector corresponding to the current rendering frame (also referred to as the first real frame sample), a historical rendering frame (also referred to as the second real frame sample), and a fourth vector corresponding to the comparison frame (also referred to as the third real frame sample).

[0122] Understandably, the first vector, the second vector, the current rendering frame, and the third vector can be used as features of the training data, while the fourth vector can be used as the label of the training data. The features of the training data are used as input to the initial image generation model during training. The labels of the training data are used to calculate the loss function during training.

[0123] In a possible implementation, user interaction information samples, scene motion information samples, current rendering frames, and historical rendering frames can be used as features of the training data, and comparison frames can be used as labels for the training data.

[0124] As one possible implementation, user interaction information can be information related to user operations input by the user into the electronic device via a user input device. Scene motion information can reflect the movement of game objects included in the game interface. The data acquisition device can use the effective game interface at a certain moment during the user's or developer's use of the game application as the current rendered frame. Similarly, the data acquisition device can use a previous real frame as a historical rendered frame, or it can use multiple consecutive real frames before the current rendered frame as historical rendered frames. The data acquisition device can use the next real frame after the current rendered frame as a comparison frame. Here, the effective game interface can be understood as the part of the game interface displayed on the electronic device (e.g., a mobile phone) that does not include user interface (UI) elements. Game objects can respond to user operations triggered by the user on the game interface by changing one or more of their position, shape, and color. The first vector, second vector, third vector, and fourth vector can all be embedding vectors.

[0125] For example, Figure 5 This is a schematic diagram of an interface provided for an embodiment of this application.

[0126] like Figure 5 As shown, the game interface 500 includes game objects 501, 502, and 503, a direction button 504, and an acceleration button 505. Both the direction button 504 and the acceleration button 505 are UI elements. The direction button 504 also includes a first movement button 5041, a second movement button 5042, a third movement button 5043, and a fourth movement button 5044. Users can use the direction buttons 504 to move game object 501, thus changing the game interface 500.

[0127] For example, after the phone displays the game interface 500, the user can trigger the movement of the game object 501 using the fourth movement button 5044, allowing the phone to display as shown. Figure 5 The game interface shown is 520. (As shown) Figure 5 As shown, compared to game interface 500, game object 521 has moved in game interface 520.

[0128] Correspondingly, the effective game interface corresponding to game interface 500 can be as follows: Figure 5 As shown in the game's effective interface 510, the game's effective interface 510 only includes game objects 511, 512, and 513, and does not include UI elements. Moreover, the position of each game object in the game's effective interface 510 is the same as the position of each game object in the game interface 500.

[0129] Correspondingly, the effective game interface corresponding to game interface 520 can be as follows: Figure 5 As shown in the game's effective interface 530, the game's effective interface 530 only includes game objects 531, 532, and 533, and does not include UI elements. Moreover, the position of each game object in the game's effective interface 530 is the same as the position of each game object in the game interface 520.

[0130] As one possible implementation, the data acquisition device can use information related to user actions triggered by the user on the game interface as user interaction information. Specifically, the user actions triggered on the game interface occur within the time period between the moment of the previous real frame and the moment of the current rendered frame. The data acquisition device can use the velocity vectors of each game object in the current rendered frame as scene motion information.

[0131] For example, Figure 6 This is a timing diagram provided for an embodiment of this application.

[0132] like Figure 6 As shown, image frames 601, 602, 603, and 604 at time t4 are the actual frames. When the electronic device needs to predict the next image frame corresponding to image frame 603 at time t3 (i.e., the predicted frame at time t4), the current rendered frame is image frame 603 at time t3. The historical rendered frame is image frame 602 at time t2, or it can be either image frames 601 and 602 at time t1. User interaction information refers to user actions triggered on the game interface during the time period from time t2 to time t3. Scene motion information can be the velocity vectors of each game object in the current rendered frame, i.e., image frame 603 at time t3. The comparison frame is image frame 604 at time t4.

[0133] As one possible implementation, user operations can include operations such as swiping and clicking. It is understood that a clicking operation can be any of the following: clicking a single location, double-clicking the same location, clicking at least two locations simultaneously, or double-clicking at least two locations simultaneously. A swiping operation can be any of the following: a single swipe or multiple swipes performed simultaneously. Here, "location" can refer to any position on the game interface displayed on the screen of the electronic device.

[0134] In possible implementations, different types of user actions can correspond to different events. For example, a swipe action can correspond to a swipe start event, a swipe process event, and a swipe end event, while a click action can correspond to a click event. The data acquisition device can record the position of the starting point of the swipe action in the game interface and the time of the swipe start event, as well as the position of each swipe process point in the game interface and the time of the swipe process event, as well as the position of the swipe end event, as well as the position of the end point of the swipe action in the game interface and the time of the swipe end event, as well as the position of the click point in the game interface and the time of the click event, as well as the position of the click point in the game interface.

[0135] In some embodiments provided in this application, after the data acquisition device collects user interaction information, the user interaction information can be stored in the form of images, coordinates, matrices, etc.

[0136] In some implementations, the data acquisition device collects user interaction information, which can be stored in the form of images.

[0137] For example, in the form of image-based user interaction information, the data acquisition device can record different events using different color channels; and in the form of image-based user interaction information, the data acquisition device can record the temporal sequence of events using the grayscale values of the color channels. For instance, the data acquisition device can record a click event using the red (R) channel; a swipe start event using the R channel; a swipe process event using the green (G) channel; and a swipe end event using the blue (B) channel. A larger grayscale value indicates a later event occurrence.

[0138] For example, see Figure 7 , Figure 7 This diagram illustrates how user interaction information is recorded in the form of images. Figure 7 A schematic diagram of an interactive trajectory multi-channel image representation provided in this application embodiment. Figure 1 .exist Figure 7 In this diagram, triangles represent the red channel, circles represent the green channel, and rectangles represent the blue channel. The size of the triangle, circle, or rectangle represents the grayscale value; the smaller the triangle, circle, or rectangle, the larger the grayscale value, indicating that the event occurred later.

[0139] It is understandable that, in the above Figure 7 In the corresponding example, the R channel represents a click event, the R channel represents a swipe start event, the G channel represents a swipe process event, and the B channel represents a swipe end event. This application does not limit the color channels used to represent events, as long as the electronic device can distinguish between different events. For example, the R channel can be used to represent a swipe process event, and the B channel can represent a swipe start event; or the B channel can be used to represent a swipe process event, and the G channel can represent a swipe end event.

[0140] In this embodiment of the application, after the data acquisition device acquires scene motion information, the scene motion information can also be stored in the form of images, coordinates, matrices, etc.

[0141] After introducing the training dataset, the process of training the initial image generation model using the training dataset will be described below.

[0142] In one possible implementation, the training device can train the initial image generation model by simultaneously training the latent variable autoencoder and the control network as a whole, or by jointly training the latent variable autoencoder and the control network.

[0143] For example, Figure 8 A flowchart illustrating the training process of an image generation model provided in this application embodiment. Figure 1 .like Figure 8 As shown, the training process of the image generation model provided in this application embodiment may include steps S801-S804.

[0144] S801. The training device acquires the training dataset, which includes the features and labels of the training data.

[0145] The features of the training data may include a first vector, a second vector, the current rendering frame, and a third vector. The labels of the training data may include a fourth vector.

[0146] It is understandable that the training device can train the initial image generation model using all the training data included in the training dataset, or it can train the initial image generation model using only a portion of the training data included in the training dataset.

[0147] S802. The training device inputs the features of the training data into the initial image generation model to obtain the initial training results.

[0148] During the training process where the training device uses training data to train the initial image generation model, the data transmission process within the initial image generation model is as follows: Figure 9 As shown. For example, see [link to example]. Figure 9 , Figure 9 This is a schematic diagram of the training architecture of an image generation model provided in an embodiment of this application.

[0149] like Figure 9 As shown, the current rendered frame, after passing through a latent variable autoencoder and a scheduler, becomes noise data. This noise data is input to the input layer of the lightweight U-Net denoising network. The input layer of the lightweight U-Net denoising network sends the noise data to the encoder, where it is processed by each coding block and then sent to the respective decoding blocks in the decoder. Simultaneously, the time step, after processing by the time coding unit, is sent to both the coding and decoding blocks of the lightweight U-Net denoising network. The input layer of the lightweight U-Net denoising network also sends the noise data to the control network.

[0150] See you again Figure 9 The first, second, and third vectors are processed sequentially by the attention mechanism module and zero convolution to obtain the processing result. The control network fuses the processing result with the noise data sent by the input layer of the lightweight U-Net denoising network. After the fusion result is processed by each coding block in the control network, it is sent to each decoding block in the decoder of the lightweight U-Net denoising network through each zero convolution in the control network.

[0151] Subsequently, the encoder of the lightweight U-Net denoising network obtains the initial training result based on the received data, and outputs the initial training result from the output layer of the lightweight U-Net denoising network. The received data includes the outputs of each coding block in the lightweight U-Net denoising network, the outputs of the temporal coding units, and the outputs of each zero convolution in the control network.

[0152] As one possible implementation method, Figure 10 A flowchart illustrating the training process of an image generation model provided in this application embodiment. Figure 2 .like Figure 10 As shown, step S802 may include steps S1001-S1009.

[0153] S1001, The latent variable autoencoder in the initial image generation model receives the features of the current rendered frame from the training data and outputs the latent representation of the current rendered frame (e.g., embedding vector). Then, the latent variable autoencoder sends the latent representation of the current rendered frame to the scheduler in the initial image generation model.

[0154] S1002. The scheduler in the initial image generation model receives the latent representation output by the latent variable autoencoder and adds random noise to the latent representation output by the latent variable autoencoder to obtain the noise data of the current rendering frame. Then, the noise data of the current rendering frame is sent to the lightweight U-Net denoising network in the initial image generation model.

[0155] S1003. The lightweight U-Net denoising network in the initial image generation model receives the noise data of the current rendered frame and inputs the noise data of the current rendered frame into the encoder of the lightweight U-Net denoising network. In addition, the lightweight U-Net denoising network also sends the noise data of the current rendered frame to the control network in the initial image generation model.

[0156] In addition, the encoder of the lightweight U-Net denoising network also receives the time step vector output by the time coding unit in the lightweight U-Net denoising network. The time step vector is obtained by the time coding unit encoding the input time step.

[0157] The time step vector is obtained after the time step has been encoded by the time coding unit. The time step is a discrete variable in the diffusion model. When the diffusion model generates noisy data, the time step is used to add noise; when the diffusion module denoises the noisy data, the time step is used to remove noise. For example, the time step guides the diffusion model on how to add noise when generating noisy data, and also guides the diffusion model on how to remove noise from noisy data.

[0158] For example, in the process of the scheduler adding noise to the latent representation output by the latent variable autoencoder, the time step can start from an initial value (e.g., 0) and gradually increase to a certain terminal value (e.g., 1000), with each value corresponding to a noise level. The noise level determines the degree to which the data is noisy at that time step. In the process of removing noise from noisy data using the lightweight U-Net denoising network, the time step can also start from an initial value (e.g., 1000) and gradually decrease to a certain terminal value (e.g., 0).

[0159] For example, during the initial training process of the lightweight U-Net denoising network, the network can predict the noise to be removed from the noisy data at the current time step based on the noise level and other inputs (such as the output of the control network). It then removes the predicted noise, allowing the network to gradually generate image data. As the time steps decrease, the noise is gradually removed.

[0160] S1004, the lightweight U-Net denoising network sends the output of its encoder to the decoder of the lightweight U-Net denoising network.

[0161] S1005, The control network in the initial image generation model receives noise data from the current rendering frame.

[0162] S1006, The control network in the initial image generation model also receives the first vector, the second vector, and the third vector through the attention mechanism module.

[0163] The attention mechanism module can be used to assign initial weights to the received first, second, and third vectors. These initial weights can be learned during the training process of the initial image generation model until the initial image generation model reaches the training objective (e.g., the loss function converges), at which point the final updated weights are obtained.

[0164] It should be understood that the execution order of the above steps S1004, S1005, and S1006 can be adjusted according to actual usage requirements, and they can also be executed in parallel. This application embodiment does not limit this.

[0165] S1007. The control network performs feature fusion on the first vector, the second vector, the third vector and the noise data of the current rendering frame, and uses the fusion result as the input of the encoder in the control network to obtain the output of the encoder in the control network.

[0166] S1008. The output of each coding block in the encoder of the control network is subjected to zero convolution to obtain its own convolution result, and each convolution result is sent to the decoder of the lightweight U-Net denoising network.

[0167] S1009. The decoder of the lightweight U-Net denoising network obtains the initial training results based on the output of the encoder of the lightweight U-Net denoising network, the convolution result of the control network, and the time step vector output by the time coding unit in the lightweight U-Net denoising network.

[0168] It is understood that the lightweight U-Net denoising network can be pre-trained, meaning its parameters do not change. During the training process described above, the parameters in the latent variable autoencoder, the weights in the attention mechanism module of the control network, and the parameters in the encoder of the control network can change as the training progresses until the initial image generation model converges. The resulting parameters in the latent variable autoencoder, the weights in the attention mechanism module, and the parameters in the encoder of the control network can then be used as the parameters of the image generation model in this embodiment.

[0169] S803. The training device calculates the value of the loss function (also known as the first loss function) based on the initial training results and the fourth vector.

[0170] In one possible implementation, the value of the loss function can be the reconstruction loss. The reconstruction loss can be the difference between the model-generated frame and the future ground truth frame in pixel space; this difference could be, for example, the L1 distance (also known as the Manhattan distance). Here, the model-generated frame can be understood as the initial training result described above. For example, the model-generated frame could be... Figure 6 The predicted frame corresponds to time t4 in the dataset. The future true frame can be understood as the comparison frame mentioned above, or it can be the label of the training data. For example, the future true frame could be... Figure 6 Image frame 604 in the image.

[0171] For example, the loss function can be expressed as:

[0172]

[0173] Where L1 represents the value of the loss function, f θ′ E is the encoder of the lightweight U-Net denoising network. D is the encoder of the latent variable autoencoder. D is the decoder of the lightweight U-Net denoising network. ∈ represents noise, for example, ∈ ~ N(0,I), indicating that ∈ is random noise that follows a normal distribution with a mean of 0 and a covariance matrix of identity. It can be cosine noise. t For the current rendering frame, o <t The third vector corresponding to the historical rendering frames, where t is the time step and a is the vector. <t v is the first vector corresponding to the user interaction information. t The second vector corresponding to the scene motion information, o t+1 This is the fourth vector corresponding to the future real frame (i.e., the comparison frame). "‖‖" is used to represent the sum of the absolute values of each element in the vector. This "‖‖" can also be called the L1 norm or Manhattan norm.

[0174] In another possible implementation, the loss function can be a weighted sum of the reconstruction loss and the VGG feature matching loss. The VGG feature matching loss can be the difference between the feature map extracted from the target layer in the model-generated frame and the feature map extracted from the target layer in the future ground truth frame. Both the feature maps extracted from the target layer in the model-generated frame and the feature maps extracted from the target layer in the future ground truth frame can be extracted using a pre-trained VGG network (such as VGG16 or VGG19). The target layer can include shallow and deep layers, allowing the VGG network to capture semantic and textural information at both shallow and deep levels of the image frame.

[0175] For example, the training device can input model-generated frames into a pre-trained VGG network, which can extract feature maps from the target layer of the model-generated frames; and the training device can input future real frames into a pre-trained VGG network, which can extract feature maps from the target layer of the future real frames.

[0176] Next, the training device can calculate the difference between the feature map extracted from the target layer in the model-generated frame and the feature map extracted from the target layer in the future real frame to obtain the VGG feature matching loss.

[0177] For example, taking the target layer as the l-th layer, the VGG feature matching loss of the l-th layer... This can be used to differentiate between the feature map extracted by the pre-trained VGG network at layer l of the model-generated frame and the feature map extracted by the pre-trained VGG network at layer l of the future real frame. For example, VGG feature matching loss... Represented as:

[0178]

[0179] in, Φ is the feature map extracted by the pre-trained VGG network at layer l of the model-generated frame. l (o t+1 ) is the feature map extracted by the pre-trained VGG network from the l-th layer in future real frames.

[0180] The value of the loss function can be a weighted sum of the reconstruction loss and the VGG feature matching loss. For example, the loss function can be expressed as:

[0181]

[0182] Where L represents the value of the loss function, and L1 is the Manhattan distance between the model-generated frame and the future real frame in pixel space.

[0183] In one possible implementation, when the target layer includes multiple layers, the training device can calculate the difference between the feature map of the model-generated frame and the feature map of the future real frame in each layer, and calculate a weighted sum of the differences of each layer. The result of this weighted sum can be understood as the VGG feature matching loss.

[0184] In this embodiment, the calculation of VGG feature matching loss takes into account the semantic and texture information of image frames at different levels, thus making the details and textures of the predicted frames generated by the image generation model trained with VGG feature matching loss more realistic and clear.

[0185] S804. The training device iterates the parameters of the latent variable autoencoder and the control network in the initial image generation model based on the value of the loss function until the loss function converges, thus obtaining the image generation model.

[0186] The convergence of the loss function can be achieved when the value of the loss function is less than or equal to a first preset threshold. The first preset threshold can be set according to the actual scenario, and this application embodiment does not specifically limit the first preset threshold.

[0187] In other possible implementations, step S804 can be replaced by: the training device iterating over the parameters of the latent variable autoencoder and the control network in the initial image generation model based on the value of the loss function, and obtaining the image generation model when the training time reaches a preset time if the loss function does not converge.

[0188] The preset duration can be set according to the actual scenario, but this application embodiment does not specifically limit the preset duration.

[0189] In another possible implementation, step S804 can be replaced by: the training device iterating over the parameters of the latent variable autoencoder and the control network in the initial image generation model based on the value of the loss function, and obtaining the image generation model when the number of iterations reaches a preset number if the loss function does not converge.

[0190] The preset number of times can be set according to the actual scenario. This application embodiment does not specifically limit the preset number of times.

[0191] In this embodiment, the latent variable autoencoder and the control network work collaboratively in the same training loop, making the latent representation generated by the latent variable autoencoder more suitable for the processing needs of the control network. The third vector corresponding to the current rendered frame and historical rendered frames, the time step, the first vector corresponding to user interaction information, and the second vector corresponding to scene motion information are uniformly input throughout the training process, ensuring that the model can fully utilize this information when generating future frames. Simultaneously, Gaussian noise of varying intensities is applied during training, and the noise level is used as an additional conditional input to improve the model's consistency and robustness between training and inference. Since the latent variable autoencoder and the control network are optimized under the same objective, the overall model performance is generally better than separate training, thus generating higher quality and clearer future frames. Secondly, this method ensures that the latent representation output by the latent variable autoencoder is highly matched to the needs of the control network, reducing the accumulation of errors in the latent representation. Furthermore, joint training helps the model better learn complex conditional relationships, improving its generalization ability in diverse scenarios.

[0192] In another possible implementation, the training device can train the initial image generation model by training the latent variable autoencoder and the control network separately.

[0193] See Figure 11 Exemplary Figure 11 This is a schematic diagram illustrating a training process for a latent variable autoencoder, provided as an embodiment of this application. Figure 11 As shown, the process of training a latent variable autoencoder may include steps S1101-S1103.

[0194] S1101 The training device can input the image frames included in the training subset into the latent variable autoencoder to obtain the initial training results of the latent variable autoencoder.

[0195] The training subset can be a set of at least one currently rendered frame from the aforementioned training dataset, or it can be a set of game frames re-captured by a data acquisition device. The implementation method for re-capturing game frames by a data acquisition device can be found in the section on data acquisition of training datasets, and will not be elaborated upon here.

[0196] In a possible implementation, the training device inputs image frames from the training subset into a latent variable autoencoder. The latent variable autoencoder can encode the image frames into a latent representation in the latent space, and then generate a reconstructed frame of the image frame through the latent representation. This reconstructed frame is the initial training result of the latent variable autoencoder.

[0197] For example, Figure 12 This is a schematic diagram of an architecture for training a latent variable autoencoder, provided as an embodiment of this application.

[0198] like Figure 12 As shown, a latent variable autoencoder can include an encoder and a decoder. The training device will use image frames O t The encoder in the latent variable autoencoder can input image frames O t Encoded as a latent representation in the latent space, the decoder in the latent variable autoencoder generates the image frame O from this latent representation. t The reconstructed frame is the initial training result of the latent variable autoencoder.

[0199] S1102. The training device calculates the value of the loss function based on the initial training results of the latent variable autoencoder and the image frames.

[0200] The loss function is used to measure the difference between the input image frame and the corresponding initial training result. This application does not limit the method for calculating the difference between the input image frame and the corresponding initial training result. For example, the difference between the input image frame and the corresponding reconstructed frame can be represented by mean squared error or VGG feature matching loss, etc.

[0201] S1103. The training device iterates the parameters of the latent variable autoencoder based on the value of the loss function until the loss function converges, thus obtaining the trained latent variable autoencoder.

[0202] The convergence of the loss function can be achieved when the value of the loss function is less than or equal to a second preset threshold. The second preset threshold can be set according to the actual scenario, and this application embodiment does not specifically limit the second preset threshold.

[0203] In other possible implementations, the latent variable autoencoder training can be completed when the training time reaches a preset duration without the loss function converging, or when the number of iterations reaches a preset number without the loss function converging. The preset number of iterations and the number of iterations can be set according to the actual scenario, and this application does not impose specific limitations on them.

[0204] It should be understood that after the latent variable autoencoder is trained, the parameters of the trained latent variable encoder do not change during the training process of the control network described below.

[0205] After introducing the process of training a latent variable autoencoder, the process of training a control network will be introduced below.

[0206] The process of training the control network by the training device is similar to steps S801-S804 of the joint training described above, as can be found in the preceding text. The difference between the training device's training process and the joint training process lies in the fact that when training the control network, the training device utilizes the encoder from the separately trained latent variable autoencoder, and the parameters of this separately trained latent variable autoencoder do not change during the training of the control network. In other words, when the training device trains the control network, step S804 can be replaced by: the training device iterating through the parameters of the control network in the initial image generation model based on the loss function until the loss function converges, thus obtaining the image generation model.

[0207] In this embodiment, the latent variable autoencoder and control network are trained separately. This modular design allows the latent variable autoencoder and control network to be optimized independently, facilitating debugging and maintenance. Secondly, different modules can be flexibly replaced or upgraded as needed without affecting other parts of the overall model. Furthermore, separate training is more efficient in resource management, especially when training resources are limited, allowing for better allocation of computational resources.

[0208] It is understood that the image generation model trained by the training device in this embodiment can be a dedicated model for the game application X. That is, only the game application X can use this image generation model to generate predicted frames, or only the game application X can generate predicted frames of relatively high quality using this image generation model. The training dataset used by the training device in training the image generation model can be a training dataset obtained by a data acquisition device collecting data during the user's use of the game application X and then preprocessing it.

[0209] In this embodiment, the image generation model trained by the training device can also be a general model for multiple game applications. That is, all of these game applications can use the image generation model to generate predicted frames. The training dataset used by the training device in training the image generation model can be a training dataset obtained by a data acquisition device collecting data during the use of these multiple game applications by users or developers, and then preprocessing the collected data.

[0210] After training an initial image generation model on a training device, the image generation model can be deployed on a mobile phone. During user interaction, the phone can input vectors corresponding to user interaction information, vectors corresponding to scene motion information, the first real frame, the time step, and the vectors corresponding to the second real frame into the image generation model to obtain the predicted frame output by the model. The phone then displays this predicted frame after the first real frame. The image generation model can be deployed within the phone's frame interpolation service. In other words, the phone's frame interpolation service can obtain the predicted frame from the image generation model. The image generation method provided based on embodiments of this application will now be described in conjunction with the accompanying drawings.

[0211] For example, see Figure 13 , Figure 13 This is a schematic flowchart illustrating an image generation method provided in an embodiment of this application. Figure 13 As shown, the method may include steps S1301 to S1304.

[0212] In this embodiment of the application, after the game application is launched, the mobile phone can perform step S1301 and obtain user interaction information through frame interpolation service.

[0213] In one possible implementation, the frame interpolation service can be started in response to a user's triggering action on the frame interpolation button.

[0214] For example, Figure 14 This is a schematic diagram of an interface interaction provided for an embodiment of this application.

[0215] like Figure 14 As shown, the phone can display the following after startup: Figure 14The phone's home screen is 1400. (For example...) Figure 14 As shown, the mobile phone desktop 1400 may include a game application icon 1401. Furthermore, the mobile phone desktop 1400 also includes icons for smart living applications, settings applications, a recorder application, a browser application, a camera application, contacts applications, a phone application, a messaging application, a time application, and a weather application.

[0216] On the phone's home screen 1400, in response to the user's trigger action on the game application icon 1401, the phone can display as follows: Figure 14 The game's main interface is shown in page 1410. (For example...) Figure 14 As shown, the main interface 1410 of the game may include a first control 1411 and an "Enter Game" control, etc.

[0217] On the main screen 1410 of the phone, in response to the user's trigger operation on the first control 1411, the phone can display as follows: Figure 14 The game's main interface is shown in image 1420. Figure 14 As shown, the game's main interface 1420 may include a second control 1421, a frame interpolation switch control 1422, and an "Enter Game" control. The frame interpolation switch control 1422 is in a closed state, meaning the frame interpolation service is not started.

[0218] On the main screen 1420 of the phone, in response to the user's trigger operation on the frame interpolation switch 1422, the phone can display as follows: Figure 14 The game's main interface is shown at 1430. (For example...) Figure 14 As shown, the game's main interface 1430 may include a second control 1431, a frame interpolation switch control 1432, and an "Enter Game" control. The frame interpolation switch control 1432 is in the "on" state, meaning the frame interpolation service is running.

[0219] In another possible implementation, the frame interpolation service could start in response to the launch of the game application. Alternatively, the launch of the game application could automatically trigger the launch of the frame interpolation service.

[0220] In one possible implementation, after the game application is launched and the phone displays the game interface, the frame interpolation service obtains user interaction information.

[0221] The game interface can be found above. Figure 5 The corresponding description.

[0222] As one possible implementation, the frame interpolation service can obtain user interaction information from the input manager by listening.

[0223] For example, after the game application is launched, when the phone receives a user action triggered by the user in the game interface, the input manager of the phone application framework layer can obtain the user interaction information corresponding to the user action and send the user interaction information to the game application. During the process of the input manager obtaining the user interaction information and sending it to the game application, the frame interpolation service can obtain the user interaction information from the input manager by listening.

[0224] As another possible implementation, the frame interpolation service can obtain user interaction information from the accessibility service by listening.

[0225] For example, after obtaining user interaction information, the input manager can send the touch event corresponding to that user interaction information to the accessibility service. The accessibility service converts the touch event into an event corresponding to its own accessibility service and sends this event to the game application. This event can also represent user interaction information. The frame interpolation service can obtain user interaction information from the accessibility service through listening.

[0226] User interaction information can be stored in the form of images, as detailed above. Figure 7 The corresponding description.

[0227] In the case of a user triggering multiple user operations simultaneously on the game interface (hereinafter referred to as a multi-touch scenario), one possible implementation is that the data acquisition device can store the user interaction information corresponding to the multiple user operations in the form of an image.

[0228] For example, in a multi-touch scenario, user interaction information in image form can be recorded by a data acquisition device using different color channels, allowing for the simultaneous recording of different user actions. In other words, different color channels correspond to different user actions, and these different user actions are triggered simultaneously. For instance, when there are two user actions, the data acquisition device can record the user interaction information corresponding to the first user action using the G channel; and the data acquisition device can record the user interaction information corresponding to the second user action using the R channel.

[0229] Furthermore, in multi-touch scenarios, user interaction information in image form can be recorded by the data acquisition device using the grayscale values of each color channel. For any given color channel, the data acquisition device records click events and swipe start events at the point of minimum grayscale value, swipe end events at the point of maximum grayscale value, and swipe process events at the point where the grayscale value is greater than the minimum but less than the maximum.

[0230] Furthermore, in multi-touch scenarios, user interaction information in image form can be recorded by the data acquisition device using the grayscale value of any color channel to determine the chronological order of events. A higher grayscale value indicates a later event.

[0231] For example, see Figure 15 , Figure 15 This diagram illustrates how user interaction information is recorded in the form of images in a multi-touch scenario. Figure 15 A schematic diagram of an interactive trajectory multi-channel image representation provided in this application embodiment. Figure 2 .exist Figure 15 In the diagram, circles represent the green channel, and triangles represent the red channel. The size of the triangle and circle represents the grayscale value; the smaller the triangle or circle, the larger the grayscale value, indicating a later event. The minimum grayscale value represents a click event or the start of a swipe, the maximum grayscale value represents the end of a swipe, and a grayscale value greater than the minimum but less than the maximum represents a swipe event during the swipe process.

[0232] It is understandable that, in the above Figure 15 In the corresponding example, the user operation information corresponding to the first user operation is represented by the G channel, and the user operation information corresponding to the second user operation is represented by the R channel. This application does not limit the color channel used to represent the user operation information corresponding to the user operation, as long as the electronic device can distinguish the user interaction information corresponding to different user operations. For example, the user operation information corresponding to the first user operation can also be represented by the B channel, and the user operation information corresponding to the second user operation can be represented by the G channel.

[0233] For scenarios where a user simultaneously triggers multiple user actions on the game interface, another possible implementation is that the data acquisition device can store the user interaction information corresponding to these multiple user actions in vector and matrix formats. As one possible implementation, the data acquisition device can store the user interaction information corresponding to the multiple user actions as vectors within the same preset matrix. Different rows in the preset matrix store user interaction information corresponding to different user actions.

[0234] The number of rows and columns of the preset matrix can be set according to the actual scenario. In this embodiment, the number of rows and columns of the preset matrix is not limited.

[0235] For example, with a preset matrix of 3 rows, a certain multi-touch scenario involves a user simultaneously triggering two user actions on the game interface at a given moment: a first user action and a second user action. The data acquisition device can store the user interaction information corresponding to the first user action in the first row of the preset matrix; the data acquisition device can store the user interaction information corresponding to the second user action in the second row of the preset matrix; and the data acquisition device can fill the third row of the preset matrix with zeros "0". The preset matrix as a whole represents the user interaction information corresponding to this multi-touch scenario.

[0236] S1302, Frame interpolation service obtains the first real frame.

[0237] The first real frame is an image frame drawn by the game application.

[0238] In one possible implementation, the game application can draw the real image frame corresponding to the first real frame based on the touch events sent by the input manager.

[0239] The real image frame can include a first layer and a second layer. The game application renders the first real frame onto the first layer; and the game application renders UI elements onto the second layer. The frame interpolation service can intercept the game application's process of sending the real image frame to the display layer to obtain the first real frame from the first layer of the real image frame.

[0240] In another possible implementation, the game application can automatically draw the real image frame corresponding to the first real frame as the game progresses.

[0241] In one possible implementation, the frame interpolation service may also acquire a second real frame. This second real frame may be a real frame preceding the first real frame, or it may be a series of consecutive real frames preceding the first real frame.

[0242] S1303, Frame interpolation service obtains scene motion information.

[0243] In a possible implementation, during the rendering of the first real frame, the frame interpolation service can obtain scene motion information from a dedicated buffer.

[0244] For example, during the rendering of the first real frame, the game application first calculates the scene motion information of the game objects in the first real frame at the application layer. This scene motion information can be calculated using methods such as position transformation or view projection matrices. Next, this scene motion information is passed to the rendering pipeline as vertex attribute data. Then, the vertex shader processes this vertex data containing scene motion information and can output the processed data (such as velocity vectors) to a dedicated buffer via either the TransformFeedback mechanism (provided by OpenGL ES) or the Stream Output mechanism (provided by Vulkan). This buffer can be a scene motion vector buffer. In this way, the frame interpolation service can obtain the scene motion information by reading these dedicated buffers.

[0245] It should be understood that the execution order of the above steps S1301, S1302, and S1303 can be adjusted according to actual usage requirements, and they can also be executed in parallel. This application embodiment does not limit this.

[0246] S1304 The frame interpolation service inputs user interaction information, scene motion information, the first real frame and the second real frame into the image generation model to obtain the predicted frame output by the image generation model.

[0247] Optionally, S1304 can also be described as follows: The frame interpolation service uses an image generation model to predict based on user interaction information, scene motion information, a first real frame, and a second real frame to obtain a predicted frame.

[0248] Optionally, before the frame interpolation service inputs user interaction information, scene motion information, the first real frame, and the second real frame into the image generation model, the electronic device may preprocess the user interaction information, scene motion information, the first real frame, and the second real frame.

[0249] For example, preprocessing of the first real frame by the electronic device may include image enhancement to improve image quality and thus enhance image visibility. Image enhancement methods may include at least one of histogram equalization, wavelet transform, edge enhancement, etc.

[0250] For example, the preprocessing performed by the electronic device on user interaction information, scene motion information, and the second real frame may include the electronic device converting the user interaction information into an embedding vector, the electronic device converting the scene motion information into an embedding vector, and the electronic device converting the second real frame into an embedding vector. This application does not specifically limit whether the dimensions of each embedding vector are the same.

[0251] As one possible implementation, the frame interpolation service inputs the first real frame into the latent variable autoencoder of the image generation model; and the frame interpolation service inputs the embedding vectors corresponding to user interaction information, scene motion information, and the second real frame into the control network of the image generation model. The image generation model outputs a predicted frame.

[0252] After the phone executes step S1304, it can display the predicted frame output by the image generation model after the first real frame. Alternatively, after executing step S1304, the phone can display the final display frame after the real image frame corresponding to the first real frame. The final display frame can be the image frame obtained by combining the second layer with the predicted frame output by the image generation model.

[0253] In one possible implementation, the game application places real image frames into an image display queue. The frame interpolation service places predicted frames generated by an image generation model into the image display queue. The display module in the mobile phone retrieves image frames (e.g., real image frames or predicted frames) from the image display queue and displays them on the phone's screen. The image display queue can be a producer-consumer queue; the image frames in the producer-consumer queue follow a first-in, first-out (FIFO) rule; the game application and the frame interpolation service act as producers in the producer-consumer queue, and the display module acts as a consumer in the producer-consumer queue.

[0254] In this embodiment, the display module can acquire image frames placed in the image display queue in chronological order. For example, the display module can be a surface flinger. Figure 16 This application provides a display intent for an embodiment.

[0255] like Figure 16 As shown, the game application places the rendered real image frames into the image display queue. The frame interpolation service can obtain the first real frame from the first layer of real image frames. Based on the first real frame and the image generation model, the frame interpolation service obtains the predicted frame. Then, the frame interpolation service places the predicted frame into the image display queue, and the time when the predicted frame is placed into the image display queue is later than the time when the real image frame is placed into the image display queue. Afterwards, the surface manager can obtain image frames from the image display queue and send the obtained image frames to the display driver. The display driver outputs the image frames to the mobile phone display screen to realize the display of the image frames.

[0256] The technical solutions provided in the embodiments of this application will be further described below in conjunction with the user's experience of using the game application.

[0257] During user interaction with the game application, the phone launches the game application in response to a trigger action on the game application icon on the home screen. The phone then displays the first game interface. Following the first game interface, the phone displays a second game interface, which includes game objects and user interface elements. For example, the phone displays the above... Figure 5 The interface 520 shown includes game objects (e.g., game object 521, game object 522, game object 523) and user interface elements (e.g., directional buttons 524, acceleration buttons 525). It should be understood that after the phone displays the second game interface, the phone can perform the above steps S1301-S1304 to generate a prediction frame and display the prediction frame after the second game interface.

[0258] Next, the mobile phone acquires the first real frame and user interaction information. The first real frame corresponds to the second game interface, but it does not include the user interface elements of the second game interface. For example, the first real frame could be as follows: Figure 5 The interface 530 shown. User interaction information represents user actions performed before the second game interface was displayed. For example, user interaction information may represent actions performed before the second game interface was displayed. Figure 5 The interface 500 shown shows the triggering operation of the fourth button 5044.

[0259] Next, the mobile phone inputs the user interaction information and the first real frame into the image generation model. Then, the mobile phone obtains the predicted frame output by the image generation model. For further details on this process, please refer to step S1304 above; this embodiment of the application limits this step. Following this, the mobile phone can display a third game interface based on the predicted frame after displaying the second game interface.

[0260] This application also provides an electronic device. Figure 17 This is a schematic diagram of the hardware structure of another electronic device provided in an embodiment of this application. For example... Figure 17 As shown, the electronic device may include one or more processors 1701, memory 1702 and communication interface 1703.

[0261] The memory 1702, communication interface 1703, and processor 1701 are coupled together. For example, the memory 1702, communication interface 1703, and processor 1701 can be coupled together via bus 1704.

[0262] The communication interface 1703 is used for data transmission with other devices. The memory 1702 stores computer program code. The computer program code includes computer instructions, which, when executed by the processor 1701, cause the electronic device to perform the relevant method steps in the above-described method embodiments of this application.

[0263] The processor 1701 can be a processor or controller, such as a central processing unit (CPU), a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute the various exemplary logic blocks, modules, and circuits described in conjunction with this disclosure. The processor can also be a combination that implements computational functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.

[0264] The bus 1704 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The aforementioned bus 1704 can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 17 The symbol is represented by only one line, but this does not mean that there is only one bus or one type of bus.

[0265] This application also provides a chip system. Figure 18 This is a schematic diagram of a chip system provided in an embodiment of this application. Figure 18 As shown, the chip system 1800 includes at least one processor 1801 and at least one interface circuit 1802. The processor 1801 and the interface circuit 1802 are interconnected via lines. For example, the interface circuit 1802 can be used to receive signals from other devices (e.g., the memory of an electronic device). As another example, the interface circuit 1802 can be used to send signals to other devices (e.g., the processor 1801). Exemplarily, the interface circuit 1802 can read instructions stored in memory and send those instructions to the processor 1801. When the instructions are executed by the processor 1801, the electronic device can perform the steps in the above embodiments. Of course, the chip system may also include other discrete devices, which are not specifically limited in this application embodiment.

[0266] This application also provides a computer-readable storage medium storing computer program code. When the processor executes the computer program code, the electronic device executes the relevant method steps in the above method embodiments.

[0267] This application also provides a computer program product that, when run on a computer, causes the computer to execute the relevant method steps described in the above method embodiments.

[0268] The electronic devices, computer-readable storage media, or computer program products provided in this application are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the beneficial effects in the corresponding methods provided above, and will not be repeated here.

[0269] Through the above description of the embodiments, those skilled in the art can clearly understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0270] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another device, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0271] The units described as separate components may or may not be physically separate. A component shown as a unit can be one or more physical units; that is, it can be located in one place or distributed in multiple different locations. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0272] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0273] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, in essence, or the part that contributes, or all or part of the technical solution, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0274] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. An image display method, characterized in that, The method is applied to an electronic device, the electronic device having a game application installed, the method comprising: After the game application is launched, the first game interface of the game application is displayed; After displaying the first game interface, the second game interface is displayed; Acquire a first real frame and user interaction information; the first real frame corresponds to the second game interface, the first real frame does not include user interface elements of the second game interface, and the user interaction information represents user operations before the second game interface is displayed; The user interaction information and the first real frame input image are used to generate a model; Obtain the predicted frame output by the image generation model; After the second game interface is displayed, the third game interface is displayed based on the predicted frame.

2. The method according to claim 1, characterized in that, The second game interface includes: user interface elements and game objects; The second game interface is displayed as follows: Render the first real frame, including the game object, onto the first layer; Render the user interface elements of the second game interface onto the second layer; After merging the first layer and the second layer, the second game interface is displayed; The acquisition of the first real frame and user interaction information includes: Obtain the first layer and the user interaction information, wherein the first layer represents the first real frame.

3. The method according to claim 2, characterized in that, The user operations prior to displaying the second game interface include first user operations; the method further includes: The first user operation has been received; Generate the touch event corresponding to the first user operation; The process of obtaining the first real frame and user interaction information also includes: Obtain the touch event corresponding to the first user operation, whereby the touch event represents the user interaction information.

4. The method according to claim 3, characterized in that, When the first user operation is a swipe operation, the touch event includes: swipe start event, swipe process event, and swipe end event; A sliding image is obtained based on the touch event; Wherein, the first color channel of the sliding image corresponds to the sliding start event; the second color channel of the image corresponds to the sliding process event; and the third color channel of the image corresponds to the sliding end event. The grayscale value of the first color channel corresponds to the timestamp of the sliding start event; the grayscale value of the second color channel corresponds to the timestamp of the sliding process event; and the grayscale value of the third color channel corresponds to the timestamp of the sliding end event. The step of generating a model from the user interaction information and the first real frame input image includes: The model is generated by combining the sliding image and the first real frame input image.

5. The method according to claim 3 or 4, characterized in that, When the first user operation is a click operation, the touch event includes: a click event; The click image is obtained based on the touch event; The first color channel of the clicked image corresponds to the click event; The grayscale value of the first color channel corresponds to the timestamp of the click event; The step of generating a model from the user interaction information and the first real frame input image includes: The clicked image and the first real frame are input into the image generation model.

6. The method according to any one of claims 1-5, characterized in that, The second game interface includes the first game object; The second game interface is displayed as follows: The first real frame is drawn based on the velocity vector of the first game object; After merging the first real frame and the user interface elements, the second game interface is displayed; The method further includes: obtaining the velocity vector; The step of generating a model from the user interaction information and the first real frame input image includes: The velocity vector, the user interaction information, and the first real frame are input into the image generation model.

7. The method according to any one of claims 1-6, characterized in that, The first game interface includes a first switch that is in an off state; The acquisition of the first real frame and user interaction information includes: When the first switch in the off state is triggered, the first real frame and the user interaction information are acquired.

8. The method according to any one of claims 1-7, characterized in that, The image generation model includes a latent variable autoencoder, a scheduler, a denoising network, and a control network. The step of inputting the user interaction information and the first real frame into the image generation model includes: The first real frame is input into the latent variable autoencoder; Obtain the output of the latent variable autoencoder; The output of the latent variable autoencoder is input into the scheduler; Obtain the noise data of the first real frame output by the scheduler; The user interaction information is input into the control network; Obtain the output results of the control network; The noise data and the output of the control network are input into the denoising network.

9. A model training method, characterized in that, Applied to a training device, the method includes: Acquire training data, which includes features and labels. The features of the training data include a third real frame sample. The labels of the training data include a first real frame sample, a second real frame sample, and user interaction information samples. The first real frame sample is an image frame drawn by the game application, and does not include user interface elements. The second real frame sample was drawn by the game application before the first real frame sample, and does not include user interface elements. The user interaction information samples represent user actions between the timestamp corresponding to the second real frame sample and the timestamp corresponding to the first real frame sample. The features of the training data are input into the initial image generation model to obtain the initial training results; Calculate the first loss function based on the initial training results and the labels of the training data; The parameters in the initial image generation model are iterated based on the first loss function until the first loss function converges, thus obtaining the image generation model.

10. The method according to claim 9, characterized in that, The initial image generation model includes a latent variable autoencoder, a denoising network, and a control network; the denoising network includes a pre-trained first encoder, and the control network includes a second encoder. Before inputting the features of the training data into the initial image generation model to obtain the initial training result, the method further includes: Set the parameters of the second encoder to be the same as those of the first encoder; The step of inputting the features of the training data into the initial image generation model to obtain the initial training result includes: The first real frame sample is input into the latent variable autoencoder; Obtain the output of the latent variable autoencoder; The second real frame sample and the user interaction information sample are input into the control network; Obtain the output results of the control network; The output of the latent variable autoencoder and the output of the control network are input into the denoising network; Obtain the output of the denoising network; The initial training results include the output of the denoising network; The step of calculating the first loss function based on the initial training results and the labels of the training data includes: The first loss function is calculated based on the difference between the output of the denoising network and the third real frame sample; The step of iterating the parameters in the initial image generation model based on the first loss function until the first loss function converges to obtain the image generation model includes: The parameters of the second encoder are iterated based on the first loss function until the first loss function converges, thus obtaining the image generation model.

11. The method according to claim 9 or 10, characterized in that, The first loss function is the reconstruction loss, which is the difference between the initial training result and the third real frame sample in the pixel space; Alternatively, the first loss function is a weighted sum of reconstruction loss and VGG feature matching loss; wherein the VGG feature matching loss is the difference between the first feature map and the second feature map, the first feature map is the feature map extracted by the pre-trained VGG network at the target layer of the initial training result, and the second feature map is the feature map extracted by the pre-trained VGG network at the target layer of the third real frame sample.

12. The method according to claim 11, characterized in that, The denoising network is a lightweight U-Net denoising network. When the first loss function is the reconstruction loss, the first loss function satisfies: Where L1 represents the first loss function, f θ′ Let E represent the encoder of the lightweight U-Net denoising network, D represent the latent variable autoencoder, and ∈ represent noise. Represents cosine noise, o t o represents the vector corresponding to the first real frame sample. <t Let t represent the vector corresponding to the second real frame sample, and a represent the time step. <t v represents the vector corresponding to the user interaction information sample. t The vector corresponding to the velocity vector, o t+1 The vector corresponding to the third real frame sample is represented by "‖‖", which is used to indicate the sum of the absolute values of each element in the vector.

13. An electronic device, characterized in that, The electronic device includes a processor and a memory; the processor is coupled to the memory; the memory is used to store computer program code; the computer program code includes computer instructions, which, when executed by the processor, cause the electronic device to perform the method as described in any one of claims 1-12.

14. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes computer instructions that, when executed on an electronic device, cause the electronic device to perform the method as described in any one of claims 1-12.

15. A chip system, characterized in that, The chip system is applied to an electronic device, the chip system including one or more processors, the processors being configured to invoke computer instructions to cause the electronic device to perform the method as described in any one of claims 1-12.