Subtitle image generation method and apparatus, electronic device and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By dividing the subtitle image into layers and transforming the image, the problem of the monotonous effects of traditional subtitle animation is solved, and rich effects and efficient rendering of subtitle animation are achieved.

WO2026061166A9PCT designated stage Publication Date: 2026-06-11TENCENT TECHNOLOGY (SHENZHEN) CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date: 2025-08-08
Publication Date: 2026-06-11

Application Information

Patent Timeline

08 Aug 2025

Application

11 Jun 2026

Publication

WO2026061166A9

IPC: G06T11/60; G06T3/02; H04N5/278

CPC: G06T11/60; G06T11/40; G06T3/02; G06T11/10

AI Tagging

Application Domain

Television system details Geometric image transformation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In traditional subtitle animations, all displayed content exhibits the same animation effect, resulting in monotonous animation.

⚗Method used

By dividing the subtitle image into multiple image layers, obtaining the initial attribute values of each layer for text layout and drawing, an initial subtitle layer is obtained. Then, based on the target attribute values, image transformation is performed to generate the target subtitle image, thus achieving fine-grained control over different types of display content.

🎯Benefits of technology

It enriches the animation effects of subtitle animations, improves the rendering efficiency and real-time performance of subtitle animations, and enhances the viewing experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025113461_11062026_PF_FP_ABST

Patent Text Reader

Abstract

Disclosed in embodiments of the present disclosure are a subtitle image generation method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring initial attribute values of image levels in a subtitle image corresponding to a subtitle to be displayed, wherein the image levels are obtained by division on the basis of different types of display content in the subtitle image; acquiring a subtitle text of said subtitle, and for the image levels, performing text layout and rendering on the basis of the subtitle text and the initial attribute values to obtain initial subtitle layers of the image levels at an initial time point; acquiring target attribute values of the image levels, and for each image level, performing image transformation on the initial subtitle layer of the image level on the basis of the corresponding target attribute value to obtain a target subtitle layer of the image level at a target time point; and overlaying the plurality of target subtitle layers to generate a target subtitle image of said subtitle at the target time point. The embodiments of the present disclosure can enrich the animation effect of subtitles to be displayed.

Need to check novelty before this filing date? Find Prior Art

Description

Methods, apparatus, electronic devices and storage media for generating subtitle images

[0001] This application claims priority to Chinese Patent Application No. 2024113052722, filed on September 19, 2024, entitled “Method, Apparatus, Electronic Device and Storage Medium for Generating Subtitle Images”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This disclosure relates to the field of computer technology, and in particular to a method, apparatus, electronic device, and storage medium for generating subtitle images. Background Technology

[0003] In audio and video playback, subtitles provide textual information corresponding to the audio and video content, helping viewers better understand the plot and dialogue. Animated subtitles can further highlight key information and enhance the atmosphere of the audio and video content, playing a vital role in improving the viewing experience. Currently, traditional subtitle animations present the same animation effect for all displayed content, resulting in relatively monotonous animation effects. Summary of the Invention

[0004] The following is an overview of the subject matter described in detail in this disclosure. This overview is not intended to limit the scope of the claims.

[0005] This disclosure provides a method, apparatus, electronic device, and storage medium for generating subtitle images, which can enrich the animation effects of subtitles to be displayed.

[0006] On one hand, embodiments of this disclosure provide a method for generating subtitle images, executed by an electronic device, including:

[0007] Obtain the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed, wherein the image layer is divided based on different types of display content in the subtitle image, and the initial attribute value is the attribute value of the layer attribute of the image layer at the initial time point;

[0008] Obtain the subtitle text to be displayed. For each image layer, perform text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point.

[0009] Obtain the target attribute value of each of the image layers. For each image layer, perform image transformation on the initial subtitle layer of the image layer based on the target attribute value of the image layer to obtain the target subtitle layer of the image layer at the target time point. The target attribute value is the attribute value of the layer attribute of the image layer at the target time point, and the target time point is a time point after the initial time point.

[0010] The target subtitle layers of each of the image layers are superimposed to generate the target subtitle image of the subtitle to be displayed at the target time point.

[0011] On the other hand, embodiments of this disclosure also provide a subtitle image generation apparatus, including:

[0012] The acquisition module is used to acquire the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed. The image layers are divided based on different types of display content in the subtitle image, and the initial attribute values are the attribute values of the layer attributes of the image layer at the initial time point.

[0013] The layout and drawing module is used to obtain the subtitle text of the subtitle to be displayed, and for each image layer, perform text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point.

[0014] The image transformation module is used to obtain the target attribute values of each of the image layers. For each image layer, based on the target attribute values of the image layer, the initial subtitle layer of the image layer is transformed to obtain the target subtitle layer of the image layer at the target time point. The target attribute values are the attribute values of the layer attributes of the image layer at the target time point, and the target time point is a time point after the initial time point.

[0015] The image generation module is used to overlay the target subtitle layers of each of the image layers to generate the target subtitle image of the subtitle to be displayed at the target time point.

[0016] On the other hand, this disclosure also provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the above-described subtitle image generation method.

[0017] On the other hand, embodiments of this disclosure also provide a computer-readable storage medium storing a computer program that is executed by a processor to implement the above-described subtitle image generation method.

[0018] On the other hand, this disclosure also provides a computer program product comprising a computer program stored in a computer-readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium and executes the computer program, causing the computer device to perform the above-described caption image generation method.

[0019] The embodiments disclosed herein include at least the following beneficial effects: obtaining the initial attribute values of the layer attributes of each image layer at the initial time point, and obtaining the subtitle text of the subtitle to be displayed. Since the image layer is divided based on different types of display content in the subtitle image corresponding to the subtitle to be displayed, text layout drawing can be performed based on the subtitle text and the initial attribute values to obtain the initial subtitle layer of the image layer at the initial time point, thereby realizing the division of subtitle layers for different display content. Then, based on the target attribute values of the hierarchical attributes at the target time point, the initial subtitle layer is transformed to obtain the target subtitle layer at the target time point. Multiple target subtitle layers are then superimposed to generate the target subtitle image of the subtitle to be displayed at the target time point. In the image transformation, the image transformation process of the corresponding initial subtitle layer is controlled by each target attribute value, achieving fine-grained control of each image level. This allows for fine-grained control of different types of display content in the subtitle image corresponding to the subtitle to be displayed, enriching the animation effects of the subtitle. Furthermore, for multiple frame images of the subtitle to be displayed, the initial subtitle layer can be considered as a layer in the initial frame image displayed at the initial time point, and the target subtitle layer can be considered as a layer in the target frame image displayed at the target time point. During the subtitle animation rendering process, text layout and drawing only need to be performed when generating the initial subtitle layer, while text layout and drawing are not required when generating the target subtitle layer. Instead, the target subtitle layer is obtained through image transformation based on the initial subtitle layer. This avoids frequent text layout and drawing operations with high computational load, thereby improving the rendering efficiency of the subtitle animation and enhancing its real-time performance and smoothness.

[0020] Other features and advantages of this disclosure will be set forth in the following description and will be apparent in part from the description or may be learned by practicing this disclosure. Attached Figure Description

[0021] The accompanying drawings are provided to further understand the technical solutions of this disclosure and constitute a part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure.

[0022] Figure 1 is a schematic diagram of an optional implementation environment provided by an embodiment of this disclosure;

[0023] Figure 2 is a schematic flowchart of an optional subtitle image generation method provided in an embodiment of this disclosure;

[0024] Figure 3 is a schematic diagram of a first optional process for determining the target subtitle layer according to an embodiment of this disclosure;

[0025] Figure 4 is a schematic diagram of a second optional process for determining the target subtitle layer according to an embodiment of this disclosure;

[0026] Figure 5 is a schematic diagram of a third optional process for determining the target subtitle layer provided in an embodiment of this disclosure;

[0027] Figure 6 is a schematic diagram of a fourth optional process for determining the target subtitle layer provided in an embodiment of this disclosure;

[0028] Figure 7 is a schematic diagram of a fifth optional process for determining the target subtitle layer provided in an embodiment of this disclosure;

[0029] Figure 8 is a schematic diagram of an optional mask transformation provided in an embodiment of this disclosure;

[0030] Figure 9 is a schematic diagram of an optional grouping of the dynamic attribute set provided in an embodiment of this disclosure;

[0031] Figure 10 is a schematic diagram of an optional process for determining the target attribute value according to an embodiment of this disclosure;

[0032] Figure 11 is a schematic diagram of an optional architecture for the subtitle image generation method provided in this embodiment of the present disclosure;

[0033] Figure 12 is a schematic diagram of an optional structure of the subtitle image generation device provided in an embodiment of this disclosure;

[0034] Figure 13 is a partial structural block diagram of the terminal provided in an embodiment of this disclosure;

[0035] Figure 14 is a partial structural block diagram of the server provided in an embodiment of this disclosure. Detailed Implementation

[0036] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and are not intended to limit the scope of this disclosure.

[0037] It should be noted that in the various specific embodiments of this disclosure, when processing is required based on data related to the characteristics of the target object, such as target object attribute information or a set of attribute information, the permission or consent of the target object will be obtained first. Furthermore, the collection, use, and processing of this data will comply with relevant laws, regulations, and standards. The target object can be a user. In addition, when embodiments of this disclosure need to obtain target object attribute information, separate permission or consent from the target object will be obtained through pop-ups or redirection to a confirmation page. Only after obtaining the target object's separate permission or consent will the necessary target object-related data for the normal operation of the embodiments of this disclosure be obtained.

[0038] In this disclosure, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0039] To facilitate understanding of the technical solutions provided in the embodiments of this disclosure, some key terms used in the embodiments of this disclosure will be explained below:

[0040] Subtitles: Used to display non-visual content such as dialogue in television, film, and stage productions in text form; also refers broadly to the text added during post-production of film and television works. Dialogue subtitles in film and television works generally appear at the bottom of the screen, while subtitles in theatrical works may appear on the sides or top of the stage. Their function is to display the audio content of the program as subtitles, helping viewers with hearing impairments understand the program content. Additionally, subtitles can be used to translate foreign language programs, allowing viewers who do not understand the foreign language to hear the original audio while simultaneously understanding the program content. Therefore, in audio and video playback, subtitles provide textual information corresponding to the audio and video content, helping viewers better understand the plot and dialogue.

[0041] Currently, in traditional subtitle animations, all displayed content exhibits the same animation effect, resulting in relatively monotonous animation effects.

[0042] Based on this, the present disclosure provides a method, apparatus, electronic device and storage medium for generating subtitle images, which can enrich the animation effects of subtitles to be displayed.

[0043] Referring to Figure 1, which is a schematic diagram of an optional implementation environment provided by an embodiment of the present disclosure, the implementation environment includes a terminal 101 and a server 102, wherein the terminal 101 and the server 102 are connected through a communication network.

[0044] For example, server 102 can obtain subtitle information input by terminal 101, wherein the subtitle information is information related to the subtitle to be displayed, such as information related to subtitle attributes, subtitle text, subtitle display time, etc. The server 102 obtains the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed from the subtitle information. The image layer is divided based on different types of display content in the subtitle image, and the initial attribute value is the attribute value of the layer attribute of the image layer at the initial time point. The server 102 obtains the subtitle text of the subtitle to be displayed from the subtitle information. For each image layer, the server 102 performs text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point. The server 102 obtains the target attribute value of each image layer. For each image layer, the server 102 performs image transformation on the initial subtitle layer of the image layer based on the target attribute value of the image layer to obtain the target subtitle layer of the image layer at the target time point. The target attribute value is the attribute value of the layer attribute of the image layer at the target time point, and the target time point is a time point after the initial time point. The server 102 overlays the target subtitle layers of each image layer to generate the target subtitle image of the subtitle to be displayed at the target time point. The server 102 sends the target subtitle image to the terminal 101.

[0045] Server 102 obtains the initial attribute values of the layer attributes of each image layer at the initial time point, and obtains the subtitle text of the subtitle to be displayed. Since the image layer is divided based on the different types of display content in the subtitle image corresponding to the subtitle to be displayed, it can perform text layout drawing based on the subtitle text and the initial attribute values to obtain the initial subtitle layer of each image layer at the initial time point, thereby realizing the division of subtitle layers for different display content. Then, based on the target attribute values of the image layer at the target time point, the initial subtitle layer is transformed to obtain the target subtitle layer at the target time point. Multiple target subtitle layers are then superimposed to generate the target subtitle image at the target time point. In the image transformation, the image transformation process of the corresponding initial subtitle layer is controlled by each target attribute value, achieving fine-grained control of each image layer. This allows for fine-grained control of different types of display content in the subtitle image, enriching the animation effects of the subtitle. Furthermore, for multiple frame images of the subtitle to be displayed, the initial subtitle layer can be considered as a layer in the initial frame image displayed at the initial time point, and the target subtitle layer can be considered as a layer in the target frame image displayed at the target time point. During subtitle animation rendering, text layout and drawing only need to be performed when generating the initial subtitle layer, while text layout and drawing are not required when generating the target subtitle layer. Instead, the target subtitle layer is obtained through image transformation based on the initial subtitle layer. This avoids frequent, computationally intensive text layout and drawing, thereby improving the rendering efficiency of the subtitle animation and enhancing its real-time performance and smoothness.

[0046] Server 102 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Additionally, server 102 can also be a node server in a blockchain network.

[0047] Terminal 101 may be a mobile phone, computer, smart voice interaction device, smart home appliance, smart wearable device, vehicle terminal, etc., but is not limited to these. Terminal 101 and server 102 can be directly or indirectly connected through wired or wireless communication, and this embodiment of the disclosure does not impose any limitations.

[0048] Referring to Figure 2, which is an optional flowchart of a subtitle image generation method provided in an embodiment of the present disclosure, the subtitle image generation method can be executed by an electronic device, specifically by a server, or by a terminal, or by a server in conjunction with a terminal. The subtitle image generation method includes, but is not limited to, the following steps 201 to 204.

[0049] Step 201: Obtain the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed.

[0050] The subtitle to be displayed refers to the subtitle that needs to be displayed on the video. The video can be obtained from sources such as movies, animations, and photographic works. This embodiment of the disclosure does not limit the source. Since subtitles are usually displayed for a period of time, the subtitle animation of the subtitle to be displayed can be displayed in multiple video frames. Displaying the subtitle to be displayed in the video is equivalent to rendering the corresponding subtitle image in multiple video frames. Each subtitle image is a frame image of the subtitle animation.

[0051] The image layers are divided based on the different types of display content in the subtitle image corresponding to the subtitle to be displayed. The subtitle image may include various types of display content, such as text, borders, shadows, and backgrounds. The text of the subtitle image refers to the text content displayed on the video screen, which may include dialogue, narration, or other information. The border of the subtitle image refers to the lines surrounding the text, which are used to increase the readability of the text. The shadow of the subtitle image refers to the shadow effect added below or around the text, which is used to increase the three-dimensionality of the subtitle. The background of the subtitle image refers to the color block or pattern behind the text, which is used to enhance the readability of the subtitle. Different types of display content in the subtitle image corresponding to the subtitle to be displayed are divided into different image layers. For example, the background of the subtitle image is divided into the first image layer, the shadow of the subtitle image is divided into the second image layer, the border of the subtitle image is divided into the third image layer, and the text of the subtitle image is divided into the fourth image layer.

[0052] It should be noted that subtitle images can have multiple hierarchical attributes. Hierarchical attributes refer to attributes used to describe the characteristics of subtitle images. Dynamic effects can be added to subtitles to obtain corresponding subtitle animations. Since subtitle animation is a process of certain hierarchical attributes changing over time, that is, the hierarchical attributes of subtitles have at least one dynamic attribute, hierarchical attributes can be divided into static attributes and dynamic attributes according to whether they can achieve animation effects. Static attributes are attributes that cannot achieve animation effects, while dynamic attributes are attributes used to achieve animation effects.

[0053] For example, static properties may include font name properties, font size properties, bold properties, italic properties, underline properties, and strikethrough properties, etc.; dynamic properties may include text color properties, character spacing properties, border width properties, border color properties, shadow distance properties, shadow color properties, background color properties, edge blur properties, rotation properties, scaling properties, shear properties, position properties, color change ratio properties, and mask coordinate properties, etc.

[0054] Among them, the attribute value type of the font name attribute is string; the attribute value type of the font size attribute, character spacing attribute, border width attribute, shadow distance attribute, edge blur attribute, rotation attribute, scaling attribute, shear attribute, position attribute, and color ratio attribute is floating point number; the attribute value type of the text color attribute, border color attribute, shadow color attribute, and background color attribute is color value; the attribute value type of the bold attribute, italic attribute, underline attribute, and strikethrough attribute is boolean value; and the attribute value type of the mask coordinate attribute is coordinate.

[0055] Specifically, hierarchical attributes can be represented using tuples. Taking dynamic attributes as an example, the formula for representing hierarchical attributes in caption images is as follows: p i = <v i,0 ,v i,1 ,t i,0 ,t i,1 ,c i (t)|i∈N>

[0056] Where, p i It refers to the i-th level attribute, v i,0 This refers to the initial value of the attribute when the i-th level attribute begins to change, v i,1 It refers to the final value of the attribute when the i-th level attribute ends its change, t i,0 This refers to the starting time of the change when the i-th level attribute begins to change, t i,1 This refers to the time point when the change ends when the i-th level attribute stops changing, c i (t) refers to the transition curve expression for the change of the i-th level attribute. The transition curve expression is equivalent to the attribute change function. N refers to the set of identifiers for the level attributes. i,0 v i,1 t i,0 t i,1 and c i (t) can all be obtained from the subtitle information entered by relevant personnel.

[0057] It's worth noting that for dynamic attributes, a basic animation effect corresponds to a dynamic attribute. For example, if the y-axis scaling factor of the subtitle changes linearly from 1 to 1.5 over 4 to 6 seconds, it can be represented as p. fscy=<1,1.5,4,6,0.25t>, at the start time t of the subtitle display. s up to the time point t when the change begins i,0 During this period, the attribute values of the hierarchical attributes remain at their initial values until the change ends at time t. i,1 Until the end time t of the subtitle display e Throughout this period, the attribute values of all hierarchical attributes remain as their final values, displaying the start time point t. s And display the end time point t e This refers to the time point when the subtitles are displayed, specifically the start time point t. s And display the end time point t e It can be obtained from the subtitle information entered by relevant personnel.

[0058] Furthermore, static attributes can be considered a special case of dynamic attributes. Therefore, static attributes can use the same representation formula as dynamic attributes. The representation formula for static attributes typically satisfies the following condition: c i (t)=v i,0 =v i,1 t i,0 =t s and t i,1 =t e , t s This refers to the start time of the subtitle display, t e This refers to the end time of the subtitle display. The start time of display is the start time of the change, and the end time of display is the end time of the change. The initial value and the final value of the attribute are the same, which means that the attribute value of a static attribute remains unchanged during the subtitle display.

[0059] The initial attribute values are the attribute values of the image layer's hierarchical attributes at the initial time point. For any hierarchical attribute, if the representation formula of the hierarchical attribute is not an empty set, it means that the subtitle image possesses that hierarchical attribute. The subtitle image can possess one or more types of hierarchical attributes. The types of hierarchical attributes of each image layer can be the same as the types of hierarchical attributes possessed by the subtitle image. Since different image layers are used to represent different types of subtitle content, there may be differences between the attribute values of some hierarchical attributes of different image layers. Specifically, the number of initial attribute values is the same as the number of types of hierarchical attributes of the image layer. Therefore, it is necessary to obtain the attribute value of each type of hierarchical attribute of the image layer at the initial time point. Assuming there are 10 types of hierarchical attributes of the image layer, it is necessary to obtain the attribute values of these 10 types of hierarchical attributes at the initial time point.

[0060] Specifically, when the first frame of multiple subtitle images to be displayed is selected as the initial frame image, the initial time point is the playback time point corresponding to the first frame image, that is, the initial time point is the start time point of subtitle display. When other frame images of the subtitle to be displayed are selected as the initial frame images, the initial time point is the playback time point corresponding to the selected frame image. Therefore, the initial time point is specifically the playback time point of the initial frame image. When the initial frame image is the first frame image, the v of the layer attribute is set... i,0 Determine it as the initial attribute value.

[0061] Based on this, since different image levels correspond to different types of display content in the subtitle image to be displayed, obtaining the initial attribute values of each image level is equivalent to obtaining the initial attribute values of different types of display content in the subtitle image. This enables separate processing of different types of display content in the subtitle image, and subsequently allows for accurate and personalized adjustment of the subtitle effects of different types of display content in the subtitle image, thereby improving the quality and richness of the subtitle animation effects corresponding to the subtitle to be displayed.

[0062] Step 202: Obtain the subtitle text to be displayed. For each image layer, perform text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point.

[0063] The subtitle text refers to the text content displayed on the video screen. The subtitle text can be obtained from subtitle information input by relevant personnel. For the image layer containing the subtitle text, since the image layer corresponds to the text type, the pixels at the location of the text content in the initial subtitle layer corresponding to that image layer are non-transparent pixels, while pixels at other locations are transparent pixels. This ensures that the initial subtitle layer can display the text content. Transparent pixels are pixels with a transparent color value and are invisible. For other image layers, since the image layer does not correspond to the text type, the pixels at the location of the text content in the initial subtitle layer corresponding to that image layer are transparent pixels. This ensures that the initial subtitle layer corresponding to that image layer will not display text content, but only the display content corresponding to that image layer type.

[0064] It should be noted that different subtitle layers usually contain different display content. By overlaying multiple subtitle layers, the display content of each subtitle layer can be merged and displayed in the same screen. Therefore, by overlaying the initial subtitle layers of each image level, the initial frame image of the subtitle to be displayed at the initial time point can be obtained.

[0065] Based on this, the initial subtitle layer is used to display the corresponding type of content. The initial subtitle layer is obtained by text layout drawing based on the subtitle text and initial attribute values. Through appropriate text layout drawing, the clarity and readability of the content displayed by the subtitle to be displayed at the initial time point can be effectively improved. However, the text layout drawing process usually involves a large amount of computation, especially when the subtitle animation involves many dynamically changing hierarchical attributes. The text layout drawing process is often time-consuming. By only performing text layout drawing on the subtitle image of the subtitle to be displayed at the initial time point, the frequent text layout drawing with a large amount of computation can be avoided, thereby improving the rendering efficiency of the subtitle animation and enhancing the real-time performance and smoothness of the subtitle animation.

[0066] Specifically, text layout and drawing are performed based on the subtitle text and initial attribute values. Specifically, the subtitle text and initial attribute values can be input into a text layout and drawing function to generate an initial subtitle layer. The text layout and drawing function can be a function provided in related technologies or a function adaptively optimized based on the current video scene; this embodiment does not limit the specific function. The formula for determining the initial subtitle layer is as follows: image k,0 =typo(text,V) k,0 )

[0067] Among them, image k,0 This is the initial caption layer corresponding to the k-th image level, typo() is the text layout drawing function, text is the caption text, and V... k,0 This is the set of initial attribute values for all layer attributes corresponding to the k-th image layer.

[0068] For example, in text layout drawing, in addition to the initial attribute values of dynamic attributes, the attribute values of static attributes are also needed. When a certain type of hierarchical attribute is an empty set, the attribute value of that hierarchical attribute is usually set to the preset attribute value in the preset style. Specifically, the attribute values of font size, text alignment, line spacing, character spacing, and the interface size of the display interface, as well as the subtitle text, can be input into the text wrapping algorithm. The text is then laid out using the text wrapping algorithm to achieve a reasonable layout of the subtitle text. Then, the image is drawn based on the text layout result to obtain the initial subtitle layer. Font size is a hierarchical attribute. Text alignment, line spacing, character spacing, and the interface size of the display interface can be obtained through pre-setting or real-time input. The text wrapping algorithm can adopt greedy algorithms, dynamic programming algorithms, etc., and this embodiment of the disclosure is not limited here.

[0069] Step 203: Obtain the target attribute values of each image layer. For each image layer, based on the target attribute values of that image layer, perform image transformation on the initial subtitle layer of that image layer to obtain the target subtitle layer of that image layer at the target time point.

[0070] The target attribute value is the attribute value of the image layer's hierarchical attribute at the target time point. The target time point is a time point after the initial time point. Typically, when the initial time point is the start time point of the subtitle display, the target time point is another display time point of the subtitle. There can be multiple target time points, so there can also be multiple target subtitle layers. For example, all display time points after the start time point of the subtitle display can be determined as target time points. The initial time point can also be another display time point of the subtitle. This embodiment of the present disclosure does not limit this.

[0071] Specifically, the initial subtitle layer is a two-dimensional digital matrix composed of pixels. The image is also a two-dimensional digital matrix composed of pixels. Therefore, the initial subtitle layer can be regarded as a special kind of image. Thus, the initial subtitle layer can be transformed to obtain a new subtitle layer, so as to add dynamic effects to the subtitles. For example, color transformation or affine transformation can be performed on the initial subtitle layer.

[0072] Based on this, by obtaining the target attribute value of the layer attribute at the target time point, the corresponding initial subtitle layer can be transformed based on the target attribute value to obtain the target subtitle layer at the target time point. Therefore, the target subtitle layer is not obtained by text layout drawing, but is obtained by image transformation based on the initial subtitle layer. Obtaining the target subtitle layer through image transformation with less computation can effectively improve the generation efficiency of the target subtitle layer.

[0073] Step 204: Overlay the target subtitle layers of each image level to generate the target subtitle image of the subtitle to be displayed at the target time point.

[0074] Each target subtitle layer is used to display the content of the subtitle image corresponding to the subtitle to be displayed at the target time point. Different target subtitle layers display different types of content. Overlaying multiple target subtitle layers is equivalent to overlaying different types of display content, which can generate a target subtitle image that merges different types of display content, which is equivalent to obtaining the target frame image of the subtitle to be displayed at the target time point.

[0075] Based on this, the initial attribute values of the layer attributes of each image layer at the initial time point are obtained, as well as the subtitle text of the subtitle to be displayed. Since the image layer is divided based on different types of display content in the subtitle image corresponding to the subtitle to be displayed, text layout and drawing can be performed based on the subtitle text and the initial attribute values to obtain the initial subtitle layer of the image layer at the initial time point, thereby realizing the division of subtitle layers for different display content. Then, based on the target attribute values of the image layer at the target time point, the initial subtitle layer is transformed to obtain the target subtitle layer at the target time point. The target subtitle layers of each image layer are then superimposed to generate the target subtitle image of the subtitle to be displayed at the target time point. In the image transformation, the image transformation process of the corresponding initial subtitle layer is controlled by each target attribute value, achieving fine-grained control of each image layer. This allows for fine-grained control of different types of display content in the subtitle image corresponding to the subtitle to be displayed, enriching the animation effects of the subtitle. Furthermore, for multiple frame images of the subtitle to be displayed, the initial subtitle layer can be considered as a layer in the initial frame image displayed at the initial time point, and the target subtitle layer can be considered as a layer in the target frame image displayed at the target time point. During the subtitle animation rendering process, text layout and drawing only need to be performed when generating the initial subtitle layer, while text layout and drawing are not required when generating the target subtitle layer. Instead, the target subtitle layer is obtained through image transformation based on the initial subtitle layer. This avoids frequent text layout and drawing operations with high computational load, thereby improving the rendering efficiency of the subtitle animation and enhancing its real-time performance and smoothness.

[0076] Specifically, in the subtitle animation rendering process, only an initial text layout drawing is needed for the subtitles to obtain the initial subtitle layers corresponding to each image level. Then, the initial subtitle layers are transformed to obtain the target subtitle layers. Finally, the target subtitle image is generated by superimposing multiple target subtitle layers. This realizes the dynamic transformation process of converting subtitle animation into subtitle image, which can improve rendering efficiency.

[0077] In one possible implementation, based on the target attribute value of the image layer, an image transformation is performed on the initial subtitle layer of the image layer to obtain the target subtitle layer of the image layer at the target time point. Specifically, this can be done by: determining the attribute change between the target attribute value of the image layer and the initial attribute value of the image layer; and based on the attribute change, performing an image transformation on the initial subtitle layer of the image layer to obtain the target subtitle layer of the image layer at the target time point.

[0078] Based on this, the attribute change is first determined according to the target attribute value and the initial attribute value. Then, the corresponding initial subtitle layer is transformed based on the attribute change to obtain the target subtitle layer at the target time point. Since the initial attribute value is the attribute value of the layer attribute at the initial time point, and the target attribute value is the attribute value of the layer attribute at the target time point, the attribute change can determine whether the attribute value of the layer attribute changes and the specific amount of change when the target time point is reached. This is equivalent to indicating the difference between the layer attribute values corresponding to the initial time point and the target time point through the attribute change. Using the attribute change as a reference factor for image transformation of the initial subtitle layer can improve the reliability of image transformation and further improve the accuracy of image transformation results, thereby obtaining an accurate target subtitle layer.

[0079] In one possible implementation, referring to FIG3, FIG3 is a schematic diagram of a first optional process for determining the target subtitle layer provided by an embodiment of the present disclosure. The layer attribute includes a color attribute, the target attribute value includes the target color value of the color attribute, and the attribute change amount includes the color change amount of the color attribute. Based on the attribute change amount, an image transformation is performed on the initial subtitle layer of the image layer to obtain the target subtitle layer of the image layer at the target time point. Specifically, when the color change amount indicates that the image layer has undergone a color change, the color value of the non-transparent pixels in the initial subtitle layer is transformed into the target color value to obtain the target subtitle layer of the image layer at the target time point.

[0080] Among them, the color attribute is used to indicate the color state of the display content of the corresponding type of image layer. The color attribute can be regarded as an inline attribute that only applies to a portion of the image layer of the subtitle to be displayed. The inline attribute can be denoted as S. line The color change is the difference between the target color value and the initial color value. The target color value is the color value of the color attribute at the target time point, and the initial color value is the color value of the color attribute at the initial time point.

[0081] In this context, the non-transparent pixels in the initial subtitle layer are the parts that need to be displayed, while the transparent pixels in the initial subtitle layer are the parts that do not need to be displayed. The corresponding type of display content in the subtitle to be displayed can be determined by the non-transparent pixels in the initial subtitle layer. For example, for the image layer where the text of the subtitle is located, the text content of the subtitle can be determined by the non-transparent pixels in the initial subtitle layer.

[0082] Specifically, the value of the color attribute can be represented numerically. When the color change indicates that a color change has occurred in the image layer, the color change is not 0, while when the color change indicates that no color change has occurred in the image layer, the color change is 0. Therefore, by determining whether the color change corresponding to each image layer is 0, it is possible to accurately determine whether the color change indicates that a color change has occurred in the image layer.

[0083] Based on this, image transformation includes color transformation. For any image layer, when the color change indicates that the image layer has undergone a color change, the initial subtitle layer representing that image layer needs to be color transformed. Specifically, the color values of the non-transparent pixels in the corresponding initial subtitle layer are transformed to the target color value, which is equivalent to transforming the initial color value to the target color value, thus obtaining the target subtitle layer of the image layer at the target time point. After determining the subtitle animation of the subtitle to be displayed through the initial subtitle layer and the target subtitle layer, the dynamic effect generated by the color transformation during the display of the subtitle can increase the visual dynamism, improve the efficiency of information transmission, and enhance the viewing experience of the audience.

[0084] Taking an image transformation that only includes color transformation as an example, the formula for determining the target subtitle layer is:

[0085] in, For the target caption layer corresponding to the k-th image level at the target time point t, image k,0 v represents the initial caption layer corresponding to the k-th image level at the initial time point. k,t The target color value is the color attribute corresponding to the k-th image layer at the target time point t. color() is a color transformation function used to transform the color value of non-transparent pixels in the initial caption layer to the target color value.

[0086] Furthermore, the color values of the pixels in the target subtitle layer can be represented as:

[0087] in, f represents the color value of pixel (x, y) in the target caption layer corresponding to the k-th image level at the target time point t. k,0 (x,y) represents the color value of pixel (x,y) in the initial caption layer corresponding to the k-th image level at the initial time point, f k,0 (x,y)≠0 refers to non-transparent pixels, where the transparent color value is 0. k,t The target color value of the color attribute corresponding to the k-th image layer at the target time point t.

[0088] In one possible implementation, referring to Figure 4, which is a schematic diagram of a second optional process for determining the target subtitle layer according to an embodiment of this disclosure, the layer attributes include geometric attributes, and the attribute change amount includes the geometric change amount of the geometric attributes. Based on the attribute change amount, an image transformation is performed on the initial subtitle layer of the image layer to obtain the target subtitle layer of the image layer at the target time point. Specifically, this can be: determining affine transformation parameters based on the geometric change amount, constructing an affine transformation matrix based on the affine transformation parameters, and performing an affine transformation on the initial subtitle layer based on the affine transformation matrix to obtain the target subtitle layer of the image layer at the target time point.

[0089] Among them, geometric attributes are used to indicate the geometric state of the display content of the corresponding type at the image level. For example, the geometric state may include position, size, and shape, etc. Geometric attributes can be regarded as paragraph attributes that act on the entire subtitle to be displayed. The paragraph attribute can be denoted as S. para The geometric change is the difference between the target geometric value and the initial geometric value. The target geometric value is the geometric value of the geometric attribute at the target time point, and the initial geometric value is the geometric value of the geometric attribute at the initial time point.

[0090] Specifically, affine transformations typically involve various types of geometric changes. For example, geometric properties can include rotation, scaling, shearing, and position properties, while geometric changes can include rotational changes of rotational properties, scaling changes of scaling properties, shearing changes of shearing properties, and translation changes of position properties. The affine transformation parameters are first determined based on the geometric changes, and the corresponding affine transformation parameters can be determined for each type of geometric change.

[0091] Based on this, image transformation includes affine transformation, which refers to transforming the position, size, or shape of the subtitle layer. When the layer attributes include geometric attributes, i.e., when the geometric attributes are not an empty set, all subtitle layers corresponding to the image layers need to undergo affine transformation. An affine transformation matrix is constructed based on the affine transformation parameters. Then, based on the affine transformation matrix, the corresponding initial subtitle layer is subjected to affine transformation to obtain the target subtitle layer of the image layer at the target time point. Specifically, the target subtitle layer is obtained by multiplying the affine transformation matrix with the matrix corresponding to the initial subtitle layer. After determining the subtitle animation of the subtitle to be displayed through the initial subtitle layer and the target subtitle layer, the dynamic effect generated by the affine transformation can increase the visual appeal during the display of the subtitle.

[0092] Taking an image transformation that only includes affine transformation as an example, the formula for determining the target subtitle layer is:

[0093] Among them, V k,Δ ={v i,t -v i,0 |vi,t ∈V k,t ,v i,0 ∈V k,0}, For the target caption layer corresponding to the k-th image level at the target time point t, image k,0 V represents the initial caption layer corresponding to the k-th image level at the initial time point. k,Δ V is the set of geometric changes of all geometric attributes corresponding to the k-th image level. k,t V is the set of target geometric values for all geometric attributes corresponding to the k-th image level. k,0 v is the set of initial geometric values for all geometric attributes corresponding to the k-th image level. i,t Let v be the target geometric value of the i-th geometric attribute. i,0 Let be the initial geometric value of the i-th geometric attribute, and affine() be the affine transformation function. The affine transformation function is used to transform the pixels in the initial subtitle layer to new coordinates to obtain the target subtitle layer. Specifically, the affine transformation function can be used to construct an affine transformation matrix based on the affine transformation parameters, and to perform an affine transformation on the corresponding initial subtitle layer based on the affine transformation matrix.

[0094] Furthermore, the pixels in the target subtitle layer can be represented as:

[0095] in, f represents the color value of pixel (x′, y′) in the target caption layer corresponding to the k-th image level at the target time point t. k,0 (x,y) is the color value of the pixel (x,y) in the initial caption layer corresponding to the kth image level at the initial time point, which is equivalent to the pixel (x,y) in the initial caption layer being transformed to the new coordinates (x′,y′).

[0096] Specifically, for any type of affine transformation, the coordinate transformation formula is as follows:

[0097] Where (x, y) are the coordinates of a pixel in the initial subtitle layer, and (x′, y′) are the coordinates of a pixel in the target subtitle layer. For the reference transformation matrix, a, b, c, d, e, and f are all affine transformation parameters. a is the scaling factor in the horizontal direction, b is the shear factor in the horizontal direction, c is the translation in the horizontal direction, d is the shear factor in the vertical direction, e is the scaling factor in the vertical direction, and f is the translation in the vertical direction.

[0098] It should be noted that when the affine transformation is a scaling transformation, a and e are determined by the scaling amount; when the affine transformation is a shearing transformation, b and d are determined by the shearing amount; when the affine transformation is a translation transformation, c and f are determined by the shearing amount; and when the affine transformation is a rotation transformation, a, b, d, and e are determined by the rotation amount.

[0099] Therefore, when multiple types of affine transformations are required, it is necessary to first determine the reference transformation matrix corresponding to each type of affine transformation, then multiply the reference transformation matrices in sequence to obtain the affine transformation matrix, and finally multiply the affine transformation matrix with the matrix corresponding to the initial subtitle layer to obtain the target subtitle layer.

[0100] In one possible implementation, the layer attribute includes a masking attribute, and the target attribute value includes a target masking value of the masking attribute. Based on the target attribute value of the image layer, an image transformation is performed on the initial subtitle layer of the image layer to obtain the target subtitle layer of the image layer at the target time point. Specifically, this can be done by: determining a masking area in the initial subtitle layer based on the target masking value; and transforming the color values of non-transparent pixels located within the masking area in the initial subtitle layer to preset masking color values to obtain the target subtitle layer of the image layer at the target time point.

[0101] Among them, the masking attribute is used to indicate the masking element of the display content of the corresponding image layer. The masking attribute can be regarded as a special attribute that does not directly affect the subtitle to be displayed. The special attribute can be denoted as S. spec The target mask value is the mask value of the mask attribute at the target time point. The target mask value is used to determine the mask area of the subtitle to be displayed at the target time point. The mask area is a specific area in the initial subtitle layer whose display effect is controlled. The display effect controlled in the mask area can be to change the color value of non-transparent pixels to the mask color value. The mask color value can be preset to a transparent color value or a non-transparent color value. This embodiment of the present disclosure does not limit this.

[0102] Based on this, image transformation includes mask transformation, which refers to transforming mask attributes. When the layer attribute includes a mask attribute (i.e., the mask attribute is not an empty set), all subtitle layers corresponding to the image layers need to undergo mask transformation. Based on the target mask value, the mask area is determined in the corresponding initial subtitle layer. Then, in the corresponding initial subtitle layer, the color values of the transparent pixels in the initial subtitle layer are kept unchanged to ensure that the display content of the corresponding image layer type remains unchanged. The color values of the non-transparent pixels located in the mask area are transformed into mask color values to obtain the target subtitle layer. This achieves accurate control of the display effect of the mask area. After determining the subtitle animation of the subtitle to be displayed through the initial subtitle layer and the target subtitle layer, during the display of the subtitle to be displayed, the mask transformation will cause the masked area of the display content in the subtitle to be displayed to gradually increase, decrease, or move. By processing part of the display content in the subtitle to be displayed through the dynamically changing mask area, the key parts of the subtitle can be highlighted, thereby increasing visual appeal, improving information transmission efficiency, and enhancing the viewing experience of the audience.

[0103] In one possible implementation, referring to Figure 5, which is a schematic diagram of a third optional process for determining the target subtitle layer according to an embodiment of the present disclosure, the masking attribute includes a color change ratio attribute, the target masking value includes a target color change ratio value of the color change ratio attribute, and the masking area is determined in the initial subtitle layer based on the target masking value. Specifically, this can be done by: obtaining the layer width of the target subtitle layer, determining a width threshold based on the product of the target color change ratio value and the layer width; and determining the area with a horizontal coordinate less than the width threshold in the initial subtitle layer as the masking area.

[0104] The mask transformation includes color mask transformation, the mask attributes include the color change ratio attribute corresponding to the color mask transformation, the target color change ratio value is the attribute value of the color change ratio attribute at the target time point, and the layer width of the target subtitle layer is specifically the distance between the left boundary and the right boundary of the target subtitle layer.

[0105] Based on this, the width threshold is determined by multiplying the target color change ratio value by the layer width. Then, the area in the initial subtitle layer whose horizontal coordinate belongs to the width threshold is determined as the masking area. The horizontal coordinate of each pixel in the masking area is less than the width threshold. When the target color change ratio value is larger, the width threshold is larger and the masking area is larger. Conversely, when the target color change ratio value is smaller, the width threshold is smaller and the masking area is smaller. After determining the target color change ratio value of the subtitle to be displayed at the target time point, the masking area can be accurately determined through the target color change ratio value. This transforms the color value of the non-transparent pixels located in the masking area into the mask color value. This mask color value is usually a non-transparent color value, which can realize the dynamic effect of the subtitle content changing color word by word, thereby achieving rich visual effects. It can further highlight the uncolored parts of the subtitle to be displayed, as well as the boundary between the colored parts and the uncolored parts, thereby increasing visual appeal, improving information transmission efficiency, and enhancing the viewing experience of the audience.

[0106] Specifically, under normal circumstances, the x-coordinate of the pixel on the left edge of the initial subtitle layer can be set to 0, with the positive direction of the x-axis being horizontal to the right. Therefore, the x-coordinate of the pixel at other positions is greater than 0, and the masked area is located to the left of the non-masked area. Alternatively, the x-coordinate of the pixel on the left edge of the initial subtitle layer can be set to other values. Assuming that the x-coordinate of the pixel on the left edge is the first preset value, the first preset value can be regarded as the offset of the area. The product of the target color change ratio value and the layer width needs to be added to the first preset value to obtain the width threshold.

[0107] For example, by setting the color-changing ratio attribute, a karaoke effect can be added to the subtitle to be displayed, that is, at a certain display time, the first half of the subtitle to be displayed displays one color, while the second half of the subtitle to be displayed displays another color.

[0108] Taking an image transformation that only includes color masking as an example, the formula for determining the pixels in the target subtitle layer is:

[0109] in, f represents the color value of pixel (x, y) in the target caption layer at the target time point t for the k-th image level. k,0 (x, y) represents the color value of the pixel (x, y) in the initial caption layer corresponding to the i-th image level at the initial time point, where (x, y) are the coordinates of the pixel. kf Here, x is the mask color value, x is the x-coordinate of a pixel in the initial subtitle layer, and w is the layer width. Ckf(t) For the target color change ratio value, w* Ckf(t) This is the width threshold.

[0110] In one possible implementation, referring to FIG6, FIG6 is a schematic diagram of a fourth optional process for determining the target subtitle layer provided by an embodiment of the present disclosure. The masking attributes include multiple masking coordinate attributes, and the target masking value includes the target masking coordinates of each masking coordinate attribute. Based on the target masking value, the masking area is determined in the initial subtitle layer. Specifically, it can be: determining the visible area in the initial subtitle layer according to each target masking coordinate; and determining the area outside the visible area in the initial subtitle layer as the masking area.

[0111] The masking transformation includes a visibility masking transformation, the masking attributes include the visibility masking attributes corresponding to the visibility masking transformation, the target masking coordinates are the attribute values of the masking coordinates at the target time point, and the target masking coordinates are located at the boundary of the visible area.

[0112] Specifically, the shape of the visible area can be rectangular, circular, elliptical, etc., and this embodiment of the present disclosure is not limited thereto. Taking a rectangular shape as an example, the number of masking coordinate attributes can be two. One masking coordinate attribute has a target masking coordinate of the upper left corner of the visible area, and the other masking coordinate attribute has a target masking coordinate of the lower right corner of the visible area. In addition, the number of masking coordinate attributes can also be other numbers, and the target masking coordinate can also be the coordinates of other positions in the visible area. This embodiment of the present disclosure is not limited thereto.

[0113] Based on this, the visible area is determined in the corresponding initial subtitle layer by using the coordinates of each target mask. Then, the area outside the visible area is determined as the mask area. The color values of non-transparent pixels within the mask area are transformed into mask color values, which can be transparent. This enables a dynamic effect of displaying the content of the subtitles word by word, thereby achieving rich visual effects, further highlighting the displayed part of the subtitles, increasing visual appeal, improving information transmission efficiency, and enhancing the viewing experience.

[0114] Taking an image transformation that only includes visibility masking as an example, the formula for determining the pixels in the target caption layer is:

[0115] in, f represents the color value of pixel (x, y) in the target caption layer at the target time point t for the k-th image level. k,0 (x,y) represents the color value of pixel (x,y) in the initial caption layer corresponding to the k-th image level at the initial time point, where (x,y) are the coordinates of the pixel. s,0 ,y s,0 (x) represents the coordinates of the first target mask. e,0 ,y e,0) represents the coordinates of the second target mask. At this point, the visible area is a matrix. The coordinates of the first target mask are located at the upper left corner of the visible area, and the coordinates of the second target mask are located at the lower right corner of the visible area. The transparent color value is 0.

[0116] In one possible implementation, referring to FIG7, FIG7 is a fifth optional flowchart of determining the target subtitle layer provided by the present disclosure embodiment. The image transformation may include one or more of color transformation, affine transformation and mask transformation. During the image transformation process, it is necessary to determine in sequence whether color transformation, affine transformation and mask transformation are needed, and perform the corresponding transformation only when needed, which can improve the flexibility and reliability of image transformation.

[0117] It should be noted that when the image transformation includes multiple transformations, the steps for determining the target subtitle layer need to be adjusted. The following is a detailed description using the example of an image transformation that simultaneously includes color transformation, affine transformation, and masking transformation.

[0118] First, the color values of the non-transparent pixels in the initial subtitle layer of the image layer are transformed into the target color values to obtain the target subtitle layer of the image layer at the target time point. This can be adjusted to: transform the color values of the non-transparent pixels in the initial subtitle layer of the image layer into the target color values to obtain the first subtitle layer of the image layer at the target time point.

[0119] Then, based on the affine transformation matrix, the initial subtitle layer of the image layer is subjected to an affine transformation to obtain the target subtitle layer of the image layer at the target time point. This can be adjusted to: based on the affine transformation matrix, the first subtitle layer of the image layer is subjected to an affine transformation to obtain the second subtitle layer of the image layer at the target time point.

[0120] Then, in the initial caption layer of the image layer, the color values of the non-transparent pixels located in the masking area are transformed to the preset masking color values to obtain the target caption layer of the image layer at the target time point. This can be adjusted as follows: in the second caption layer of the image layer, the color values of the non-transparent pixels located in the masking area are transformed to the preset masking color values to obtain the target caption layer of the image layer at the target time point.

[0121] It is worth noting that, referring to Figure 8, which is an optional flowchart of the mask transformation provided in the embodiments of this disclosure, the mask transformation may include one or more of color mask transformation and visibility mask transformation. During the mask transformation process, it is necessary to determine in turn whether color mask transformation and visibility mask transformation are needed, and perform the corresponding transformation only when needed, which can improve the flexibility and reliability of mask transformation.

[0122] It should be noted that when the mask transformation includes multiple transformations, the steps for determining the target subtitle layer need to be further adjusted. The following is a detailed description using the example of mask transformation including both color mask transformation and visibility mask transformation.

[0123] First, during color mask transformation, the first masking area needs to be determined based on the target color change ratio. The masking color value corresponding to the color mask transformation can be set to the color value Color. kf In the initial caption layer at the image level, the color values of non-transparent pixels located within the masking area are transformed to preset masking color values to obtain the target caption layer at the target time point. This can be adjusted as follows: In the corresponding second caption layer, the color values of non-transparent pixels located within the first type of masking area are transformed to the color value Color. kf This yields the third caption layer of the image at the target time point;

[0124] Then, during the visibility mask transformation, the second type of masking region needs to be determined based on the target masking coordinates. The masking color value corresponding to the color mask transformation can be set to a transparent color value. In the initial caption layer of the image layer, the color values of the non-transparent pixels located in the masking region are transformed to the preset masking color value to obtain the target caption layer of the image layer at the target time point. This can be adjusted as follows: In the third caption layer of the image layer, the color values of the non-transparent pixels located in the second type of masking region are transformed to transparent color values to obtain the target caption layer of the image layer at the target time point.

[0125] Based on this, when the image transformation simultaneously includes color transformation, affine transformation, color masking transformation, and visibility masking transformation, the formula for determining the first subtitle layer can be as follows:

[0126] in, For the first caption layer corresponding to the k-th image level at the target time point t, image k,0 v represents the initial caption layer corresponding to the k-th image level at the initial time point. k,t The target color value is the color attribute at the target time point t. color() is a color transformation function used to transform the color value of non-transparent pixels in the initial subtitle layer to the target color value.

[0127] Furthermore, the color values of the pixels in the first subtitle layer can be specifically represented as follows:

[0128] in, f represents the color value of pixel (x, y) in the first caption layer corresponding to the k-th image level at the target time point t. k,0(x,y) represents the color value of pixel (x,y) in the initial caption layer corresponding to the k-th image level at the initial time point, f k,0 (x,y)≠0 refers to non-transparent pixels, where the transparent color value is 0. k,t The target color value of the color attribute corresponding to the k-th image layer at the target time point t.

[0129] Then, the formula for determining the second subtitle layer can be as follows:

[0130] Among them, V k,Δ ={v i,t -v i,0 |v i,t ∈V k,t ,v i,0 ∈V k,0}, For the second caption layer corresponding to the k-th image level at the target time point t, image k,0 V represents the first caption layer corresponding to the k-th image level at the target time point t. k,Δ V is the set of geometric changes of all geometric attributes corresponding to the k-th image level. k,t V is the set of target geometric values for all geometric attributes corresponding to the k-th image level. k,0 v is the set of initial geometric values for all geometric attributes corresponding to the k-th image level. i,t Let v be the target geometric value of the i-th geometric attribute. i,0 The initial geometric value of the i-th geometric attribute is denoted as affine(), which is an affine transformation function used to transform the pixels in the first subtitle layer to new coordinates to obtain the second subtitle layer.

[0131] Furthermore, the pixels in the second subtitle layer can be specifically represented as follows:

[0132] in, Let (x′, y′) be the color value of the pixel (x′, y′) in the second subtitle layer corresponding to the k-th image level at the target time point t. The color value of the pixel (x,y) in the first subtitle layer corresponding to the kth image level at the initial time point is equivalent to transforming the pixel (x,y) in the first subtitle layer to the new coordinates (x′,y′). The affine transformation function can be used to construct an affine transformation matrix based on the affine transformation parameters, and to perform an affine transformation on the corresponding first subtitle layer based on the affine transformation matrix.

[0133] Then, the formula for determining the pixels in the third subtitle layer can be as follows:

[0134] in, Let (x, y) be the color value of the pixel (x, y) in the third subtitle layer corresponding to the k-th image level at the target time point t. This refers to the color value of pixel (x, y) in the second subtitle layer corresponding to the k-th image level at the target time point t, where (x, y) are the coordinates of the pixel. kf Here, x is the mask color value, x is the x-coordinate of a pixel in the second subtitle layer, w is the layer width, and C is the color value of the mask. kf (t) represents the target color change ratio, w* Ckf(t) This is the width threshold.

[0135] Then, the specific formula for determining the pixels in the target subtitle layer can be:

[0136] in, Let (x, y) be the color value of the pixel (x, y) in the target caption layer corresponding to the k-th image level at the target time point t. Let (x, y) be the color value of the pixel (x, y) in the third subtitle layer corresponding to the k-th image level at the target time point t, where (x, y) are the coordinates of the pixel. s,0 ,y s,0 (x) represents the coordinates of the first target mask. e,0 ,y e,0 ) represents the coordinates of the second target mask. At this point, the visible area is a matrix. The coordinates of the first target mask are located at the upper left corner of the visible area, and the coordinates of the second target mask are located at the lower right corner of the visible area. The transparent color value is 0.

[0137] In one possible implementation, the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed are obtained. Specifically, this can be done by: obtaining the dynamic attribute set of the subtitle image corresponding to the subtitle to be displayed, dividing the dynamic attribute set into an equivalent attribute set and a non-equivalent attribute set; when the non-equivalent attribute set is empty, obtaining the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed from the equivalent attribute set.

[0138] The set of dynamic attributes includes hierarchical attributes used to achieve animation effects; the set of equivalent attributes includes hierarchical attributes used to achieve animation effects through image transformation; and the set of non-equivalent attributes includes hierarchical attributes that cannot achieve animation effects through image transformation. All hierarchical attributes in the dynamic attribute set are dynamic attributes; all hierarchical attributes in the equivalent attribute set are equivalent attributes; and all hierarchical attributes in the non-equivalent attribute set are non-equivalent attributes. The set of non-equivalent attributes can be denoted as S. not .

[0139] For example, referring to FIG9, FIG9 is a schematic diagram of an optional grouping of dynamic attribute sets provided in an embodiment of the present disclosure.

[0140] The equivalent attribute set may include the aforementioned color attribute, geometric attribute, and masking attribute. The color attribute may include text color attribute, border color attribute, shadow color attribute, and background color attribute. The geometric attribute may include rotation attribute, scaling attribute, shear attribute, and position attribute. The masking attribute may include color change ratio attribute and masking coordinate attribute. The non-equivalent attribute set may specifically include character spacing attribute, border width attribute, shadow distance attribute, and edge blur attribute. In addition, the dynamic attribute set may also include other hierarchical attributes, which are not limited in this embodiment.

[0141] It is worth noting that when the color attributes of the subtitle image corresponding to the subtitle to be displayed are empty sets, that is, the text color attribute, border color attribute, shadow color attribute, and background color attribute are all empty sets, the subtitle image corresponding to the subtitle to be displayed can be treated as a single layer, denoted as L. a In this case, there is no need to split the subtitle to be displayed into multiple image layers. The colors of each type of display content in the subtitle image corresponding to the subtitle can be configured using the preset color values in the preset style. However, when the color attributes of the subtitle image corresponding to the subtitle to be displayed are not empty sets, that is, any one of the text color attribute, border color attribute, shadow color attribute, and background color attribute is not empty sets, it is necessary to split the subtitle image corresponding to the subtitle to be displayed into multiple image layers. Therefore, the list of image layers is as follows:

[0142] Where H represents the image layer level, and L... k For one of the image levels, L a For the overall hierarchy, L d L is the image layer corresponding to the background of the caption image. s L represents the image layer corresponding to the shadow of the subtitle image. b L represents the image layer corresponding to the border of the subtitle image. w S is the image level corresponding to the text in the caption image. line For color attributes, It is an empty set.

[0143] Based on this, when the set of non-equivalent attributes is empty, the attribute values representing non-equivalent attributes will not change. The attribute values of non-equivalent attributes are usually set to the preset attribute values in the preset style. At this time, the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed are obtained from the set of equivalent attributes. These initial attribute values belong to the attribute values of dynamic attributes. In the subsequent subtitle animation rendering process, only the dynamic effects corresponding to the equivalent attributes need to be added. Therefore, when generating the target subtitle layer, there is no need to perform text layout drawing. Instead, the target subtitle layer is obtained by image transformation based on the initial subtitle layer. This can avoid frequent text layout drawing with large computational load, thereby improving the rendering efficiency of subtitle animation and enhancing the real-time performance and smoothness of subtitle animation.

[0144] Furthermore, since non-equivalent attributes are closely related to glyphs, the attribute values representing non-equivalent attributes will change when the set of non-equivalent attributes is not empty. Since image transformation cannot effectively handle dynamic changes in glyphs, image transformation is no longer used in the subsequent subtitle animation rendering process. Instead, the hierarchical attributes of the subtitle to be displayed at each display time point are obtained. Then, for each display time point, text layout and drawing are performed based on the subtitle text and the hierarchical attributes of that display time point to obtain the real subtitle image at that display time point. This is equivalent to re-layouting and drawing each frame of the subtitle animation to be displayed. By using the real subtitle images at each display time point, subtitle animations with dynamic changes in glyphs can be generated, ensuring the quality of the subtitle animation.

[0145] Specifically, the initial attribute values can be divided into dynamic attribute values and static attribute values. In addition to obtaining the dynamic attribute set of the subtitle to be displayed, it is also necessary to obtain the static attribute set of the subtitle to be displayed. All the hierarchical attributes included in the static attribute set are static attributes. When the static attribute set is empty, it is not necessary to obtain the initial attribute values of each image level in the subtitle image corresponding to the subtitle to be displayed from the static attribute set. When the static attribute set is not empty, it is also necessary to obtain the initial attribute values of each image level in the subtitle image corresponding to the subtitle to be displayed from the static attribute set. These initial attribute values belong to the static attribute values.

[0146] In one possible implementation, the hierarchical attributes in the equivalent attribute set include the color attributes of each image layer. When the non-equivalent attribute set is empty, the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed are obtained from the equivalent attribute set. Specifically, when the non-equivalent attribute set is empty, for each image layer, the color attribute corresponding to the current image layer is kept unchanged in the equivalent attribute set, and the color attributes corresponding to the remaining image layers in the equivalent attribute set are adjusted to transparent to obtain the hierarchical attribute set corresponding to the current image layer; the initial attribute values of the image layers in the subtitle image corresponding to the subtitle to be displayed are obtained from each hierarchical attribute set.

[0147] Adjusting the color attribute to transparent is equivalent to adjusting the color value of the color attribute to a transparent color value. The color value can be represented numerically; for example, the transparent color value can be 0. Each image layer has a corresponding color attribute. For example, assuming there are four image layers, the first image layer corresponds to the background of the subtitle image, the second image layer corresponds to the shadow of the subtitle image, the third image layer corresponds to the border of the subtitle image, and the fourth image layer corresponds to the text of the subtitle image. Then, the color attribute corresponding to the first image layer can be the background color attribute, the color attribute corresponding to the second image layer can be the shadow color attribute, the color attribute corresponding to the third image layer can be the border color attribute, and the color attribute corresponding to the fourth image layer can be the text color attribute.

[0148] Based on this, when the non-equivalent attribute set is empty, a layer attribute set corresponding to each image layer is generated through the equivalent attribute set. Specifically, the color attribute corresponding to the current image layer is kept unchanged in the equivalent attribute set, and the color attributes corresponding to the remaining image layers in the equivalent attribute set are adjusted to transparent. Then, the initial attribute values of the image layer corresponding to the subtitle to be displayed are obtained from the layer attribute set, so that the initial attribute values of different image layers are only different in initial color value, while other attribute values are the same. Subsequently, when generating the initial subtitle layer, it is possible to retain only the color value of the display content of the current image layer type in the initial subtitle layer of each image layer, while the color values of other types of display content are transparent, which can ensure that the initial subtitle layer will only display the display content of the current image layer type.

[0149] Specifically, the set of hierarchical attributes for the subtitle to be displayed may include a set of static attributes and a set of dynamic attributes. The set of static attributes includes hierarchical attributes that cannot achieve animation effects, while the set of dynamic attributes includes hierarchical attributes used to achieve animation effects, with the image level L corresponding to the text of the subtitle image. w For example, image layer L w The formula for constructing the corresponding hierarchical attribute set is as follows: Pw =(P-{p i |i∈N w-})∪{p′ i =<0,0,t s ,t e ,0>|i∈N w-}

[0150] Among them, P w For image level L w The corresponding hierarchical attribute set, P is the hierarchical attribute set, and N is the set of identifiers for the hierarchical attributes. w- In addition to image level L w The set of color attributes corresponding to other image levels besides N w- ={2c,3c,4c}, where 2c is the identifier for the background color attribute, 3c is the identifier for the border color attribute, 4c is the identifier for the shadow color attribute, and p i N refers to w- The i-th color attribute in the array, where the value of the color attribute is represented by a number, with 0 being the transparent color value, p′ i N refers to w- The adjusted i-th color attribute, <0,0,t s ,t e The first element in ,0> refers to N w- The initial value of the i-th color attribute in the array is the transparent color value, <0,0,t. s ,t e The second element in ,0> refers to N w- The final value of the i-th color attribute in t is the transparent color value. s This refers to the start time of the subtitle display, t e This refers to the end time of the subtitle display, <0,0,t s ,t e The fifth element in ,0> refers to N w- The color value of the i-th color attribute in the text remains a transparent color value during the subtitle display.

[0151] It should be noted that in P w During the construction process, when the static attribute set is empty and the non-equivalent attribute set is also empty, P specifically becomes the equivalent attribute set. This achieves the retention of dynamic attributes for text color based on the equivalent attribute set, while treating background color, shadow color, and border color attributes as transparent static attributes, thus making P... w Excluding static attributes, the values of static attributes can be set to the default values in the preset styles; however, when the static attribute set is not empty, the hierarchical attribute set P includes static attributes, such that P... wThis will also include static properties.

[0152] It is worth noting that P can also be used separately. w The construction formula is similar to the one used for the subtitle image, where the image level L corresponds to the background of the subtitle image. d Construct a hierarchical attribute set P d , is the image level L corresponding to the shadow of the subtitle image. s Construct a hierarchical attribute set P s , is the image layer L corresponding to the border of the subtitle. b Construct a hierarchical attribute set P b .

[0153] In one possible implementation, the hierarchical attribute set also includes attribute transformation functions corresponding to the initial attribute values, to obtain the target attribute values of each image layer. Specifically, this can be done by: obtaining the target time point; inputting the target time point into the attribute transformation functions in each hierarchical attribute set for calculation, and obtaining the target attribute values of each image layer.

[0154] Each hierarchical attribute in the hierarchical attribute set has a corresponding attribute change function. The attribute change function is the transition curve expression of the hierarchical attribute change. The independent variable of the attribute change function is the time point, and the dependent variable is the attribute value. For example, the attribute change function can be a linear function, a quadratic function, or an exponential function, etc., which is not limited in this embodiment. The formula for determining the target attribute value is as follows: V k,t ={v i,t =c i (t)|p i,k ∈P k}

[0155] Among them, V k,t v is the set of target attribute values for all layer attributes corresponding to the k-th image layer. i,t For P k The target attribute value of the i-th level attribute at the target time point t, c i (t) is P k The attribute change function of the i-th level attribute, p i,k For P k The i-th level attribute in P k It is the set of all layer attributes corresponding to the k-th image layer.

[0156] Specifically, the attribute change function can be located within the tuple of the corresponding level attribute. The attribute change function can be determined in various ways. For example, relevant personnel can directly input a specific attribute change function. Alternatively, relevant personnel can input multiple sample data combinations, where the sample data combination includes sample attribute values and sample time points corresponding to the sample attribute values. The sample time points are the display time points of the subtitles. Then, interpolation operations are performed based on multiple sample data combinations to determine the attribute change function. This disclosure does not limit the method of determining the attribute change function.

[0157] Based on this, since each level attribute in the hierarchical attribute set has a corresponding attribute change function, by inputting the target time point into the attribute change function for calculation, the accurate target attribute value of each level attribute can be obtained, thereby improving the reliability of the image transformation process.

[0158] In one possible implementation, referring to Figure 10, which is an optional flowchart for determining target attribute values according to an embodiment of this disclosure, the target time points are input into the attribute change functions in the attribute sets of each level for calculation to obtain the target attribute values of each image level. Specifically, the target time points are input into the attribute change functions in the attribute sets of each level for calculation to obtain reference attribute values of each image level; the subtitle text is input into a large language model for sentiment recognition to obtain the target sentiment information of the subtitle text; the target sentiment information and the reference attribute values are concatenated and input into a regression model for regression to obtain the target attribute values of each image level.

[0159] Among them, the target emotional information can characterize the emotional atmosphere of the subtitle text. For example, the emotional atmosphere may include happiness, sadness, disappointment and surprise, etc. The target emotional information may be one of multiple emotional markers. Different emotional markers are used to indicate the corresponding emotional information. For example, emotional markers may include 1, 2 and 3, etc.

[0160] Large Language Models (LLMs) are deep learning models trained on large amounts of text data that can generate natural language text or understand the meaning of language text. LLMs typically employ Recurrent Neural Networks (RNNs) or their variants, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), to capture contextual information in text sequences, thereby enabling tasks such as natural language text generation, language model evaluation, text classification, and sentiment analysis. In the field of natural language processing, LLMs have been widely applied in areas such as speech recognition, machine translation, automatic summarization, dialogue systems, and intelligent question answering.

[0161] Specifically, a large language model can be trained using multiple training samples related to emotion recognition tasks, enabling it to learn how to predict the emotion of input text and thus improve the prediction accuracy of target emotion information during the inference stage. Similarly, a regression model can be trained using multiple training samples related to attribute value prediction tasks, enabling it to learn the relationship between input and output and thus improve the prediction accuracy of target attribute values during the inference stage.

[0162] Based on this, the target time point is input into the attribute change function for calculation, which yields reference attribute values for each level of attributes. Then, the subtitle text is input into a large language model, which predicts the target sentiment information of the subtitle text. The target sentiment information and the reference attribute values are then concatenated and input into a regression model for regression. The regression model can predict the target attribute values for each image level. This is equivalent to optimizing the attribute values of the level attributes through the target sentiment information of the subtitle text, making the target attribute values more consistent with the emotional atmosphere of the subtitle text, and further enhancing the display effect and expressiveness of the subtitles.

[0163] For example, when the emotional atmosphere is happy, the regression model makes the color value in the target attribute value a color value used to indicate a warm color tone; when the emotional atmosphere is sad, the regression model makes the color value in the target attribute value a color value used to indicate a cool color tone; when the emotional atmosphere is surprised, the regression model adjusts the geometric value in the reference attribute value to obtain the target attribute value, increasing the amount of geometric change, which can better highlight the content of the subtitle to be displayed.

[0164] In one possible implementation, multiple image layers include a text layer, a border layer, a shadow layer, and a background layer. The target subtitle layers of each image layer are superimposed to generate a target subtitle image of the subtitle to be displayed at the target time point. Specifically, the target subtitle layers are superimposed sequentially in the order of background layer, shadow layer, border layer, and text layer to generate a target subtitle image of the subtitle to be displayed at the target time point.

[0165] Specifically, the order of background layer, shadow layer, border layer, and text layer is the stacking order. Multiple target subtitle layers are stacked in the order of background layer, shadow layer, border layer, and text layer, which can achieve the stacking of multiple target subtitle layers from bottom to top according to this stacking order.

[0166] The stacking order is used to indicate the display priority of each target subtitle layer. Since the display priority of the target subtitle layer located above is higher than that of the target subtitle layer located below, the target subtitle layer located above will cover the target subtitle layer located below. If the target subtitle layer located above is not completely transparent, then the target subtitle layer located above can partially or completely cover the target subtitle layer located below.

[0167] Based on this, the display content of the subtitle image corresponding to the subtitle to be displayed can include the text, border, shadow, and background of the subtitle. The display content corresponding to the text level is the text of the subtitle image, the display content corresponding to the border level is the border of the subtitle image, the display content corresponding to the shadow level is the shadow of the subtitle image, and the display content corresponding to the background level is the background of the subtitle image. Since the display priority of the text of the subtitle image is higher than that of the border, the display priority of the border is higher than that of the shadow, and the display priority of the shadow is higher than that of the background, it is necessary to stack them in the order of background level, shadow level, border level, and text level to properly display each type of display content, thereby achieving the expected visual effect of the subtitle and improving the readability of the subtitle to be displayed.

[0168] As can be seen, the caption image method provided in this disclosure can be applied to a variety of scenarios.

[0169] For example, in a video processing scenario, the video to be processed and its subtitle information are obtained, and then the subtitle information is processed using the subtitle image method provided in this embodiment of the present disclosure. This can efficiently generate target subtitle images for each video frame in the video to be processed, and then each target subtitle image is added to the corresponding video frame to obtain the target video, which can effectively improve the efficiency of adding subtitles to the video.

[0170] For example, in audio processing scenarios, the subtitle information of the audio to be processed is obtained, and then the subtitle information is processed by the subtitle image method provided in this embodiment of the present disclosure. This can efficiently generate target subtitle images at each playback time point in the audio to be processed. Then, each target subtitle image can be used as a video frame at the corresponding playback time point, and the target subtitle images can be combined in the order of the playback time points to generate a subtitle video. This can effectively improve the generation efficiency of subtitle videos, and subtitle videos can provide text information to groups with hearing impairments or who cannot use audio.

[0171] The following details the complete process of generating subtitle images.

[0172] Referring to Figure 11, Figure 11 is a schematic diagram of an optional architecture of the subtitle image generation method provided in the embodiments of this disclosure.

[0173] First, obtain the subtitle information of the subtitle to be displayed, and determine whether the subtitle information contains dynamic attributes in its hierarchical attributes. If the subtitle information does not contain dynamic attributes in its hierarchical attributes, the text can be laid out and drawn according to the subtitle information to obtain a static image. Then, the static image is used as the subtitle image of the subtitle to be displayed and the subtitle image is output.

[0174] Then, when the hierarchical attributes of the subtitle information contain dynamic attributes, the set of dynamic attributes of the subtitle to be displayed can be obtained from the subtitle information. The set of dynamic attributes is divided into an equivalent attribute set and a non-equivalent attribute set. The equivalent attribute set includes hierarchical attributes used to achieve animation effects through image transformation, and the non-equivalent attribute set includes hierarchical attributes that cannot achieve animation effects through image transformation.

[0175] Then, when the non-equivalent attribute set is empty, for each image level, keep the color attribute corresponding to the current image level unchanged in the equivalent attribute set, and adjust the color attribute corresponding to the other image levels in the equivalent attribute set to transparent, so as to obtain the level attribute set corresponding to the current image level.

[0176] Then, the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed are obtained from each layer attribute set. The image layer is divided based on the different types of display content in the subtitle image corresponding to the subtitle to be displayed. The initial attribute value is the attribute value of the layer attribute of the image layer at the initial time point. Multiple image layers include text layer, border layer, shadow layer and background layer, which realizes the image layer splitting of the subtitle to be displayed.

[0177] Then, the subtitle text to be displayed is obtained. For each image layer, text layout and drawing are performed based on the subtitle text and initial attribute values to obtain the initial subtitle layer of the image layer at the initial time point, thus realizing the initial frame layout and drawing.

[0178] Then, the target time point is obtained; the target time point is input into the attribute change function in the attribute set of each level for calculation to obtain the target attribute value of each image level, thus realizing dynamic attribute interpolation.

[0179] Then, the subtitle image transformation is performed, which can be divided into the following transformation cases:

[0180] When the layer attribute includes a color attribute, the target attribute value includes the target color value of the color attribute, and the attribute change amount includes the color change amount of the color attribute, when the color change amount indicates that the image layer has undergone a color change, the color value of the non-transparent pixels in the initial caption layer of the image layer is transformed into the target color value to obtain the target caption layer of the image layer at the target time point, where the target time point is the time point after the initial time point.

[0181] Alternatively, when the hierarchical attributes include geometric attributes and the attribute changes include the geometric changes of the geometric attributes, the affine transformation parameters are determined based on the geometric changes, and the affine transformation matrix is constructed based on the affine transformation parameters. Based on the affine transformation matrix, the initial caption layer of the image hierarchy is subjected to an affine transformation to obtain the target caption layer of the image hierarchy at the target time point.

[0182] Alternatively, when the masking attribute includes a color-shifting ratio attribute, and the target masking value includes the target color-shifting ratio value of the color-shifting ratio attribute, obtain the layer width of the target subtitle layer, determine the width threshold based on the product of the target color-shifting ratio value and the layer width; in the corresponding initial subtitle layer, determine the area with a horizontal coordinate less than the width threshold as the masking area; in the initial subtitle layer at the image level, transform the color values of the non-transparent pixels located within the masking area to the preset masking color value to obtain the target subtitle layer at the target time point at the image level.

[0183] Alternatively, when the masking attribute includes multiple masking coordinate attributes, and the target masking value includes the target masking coordinates of each masking coordinate attribute, the visible area is determined in the initial caption layer at the image level according to each target masking coordinate, wherein the target masking coordinates are located at the boundary of the visible area; in the initial caption layer at the image level, the area outside the visible area is determined as the masking area; in the initial caption layer at the image level, the color values of non-transparent pixels located within the masking area are transformed into preset masking color values to obtain the target caption layer at the target time point at the image level.

[0184] Then, multiple target subtitle layers are superimposed sequentially in the order of background layer, shadow layer, border layer, and text layer to generate the target subtitle image of the subtitle to be displayed at the target time point, thus realizing the subtitle layer fusion; finally, the target subtitle image is output.

[0185] Based on this, the initial attribute values of the layer attributes of each image layer at the initial time point are obtained, as well as the subtitle text of the subtitle to be displayed. Since the image layer is divided based on different types of display content in the subtitle image corresponding to the subtitle to be displayed, text layout and drawing can be performed based on the subtitle text and the initial attribute values to obtain the initial subtitle layer of the image layer at the initial time point, thereby realizing the division of subtitle layers for different display content. Then, based on the target attribute values of the hierarchical attributes at the target time point, the initial subtitle layer is transformed to obtain the target subtitle layer at the target time point. Multiple target subtitle layers are then superimposed to generate the target subtitle image of the subtitle to be displayed at the target time point. In the image transformation, the image transformation process of the corresponding initial subtitle layer is controlled by each target attribute value, achieving fine-grained control of each image level. This allows for fine-grained control of different types of display content in the subtitle to be displayed, enriching the animation effects of the subtitle. Furthermore, for multiple frame images of the subtitle to be displayed, the initial subtitle layer can be considered as a layer in the initial frame image displayed at the initial time point, and the target subtitle layer can be considered as a layer in the target frame image displayed at the target time point. During the subtitle animation rendering process, text layout and drawing only need to be performed when generating the initial subtitle layer, while text layout and drawing are not required when generating the target subtitle layer. Instead, the target subtitle layer is obtained through image transformation based on the initial subtitle layer. This avoids frequent text layout and drawing operations with high computational load, thereby improving the rendering efficiency of the subtitle animation and enhancing its real-time performance and smoothness.

[0186] It is understood that although the steps in the above flowcharts are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this embodiment, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the above flowcharts may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps.

[0187] Referring to Figure 12, which is a schematic diagram of an optional structure of a subtitle image generation apparatus provided in an embodiment of the present disclosure, the subtitle image generation apparatus 1200 includes:

[0188] The acquisition module 1201 is used to acquire the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed. The image layer is divided based on different types of display content in the subtitle image, and the initial attribute value is the attribute value of the layer attribute of the image layer at the initial time point.

[0189] The layout and drawing module 1202 is used to obtain the subtitle text to be displayed. For each image layer, it performs text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point.

[0190] The image transformation module 1203 is used to obtain the target attribute values of each image layer. For each image layer, based on the target attribute values of the image layer, the initial subtitle layer of the image layer is transformed to obtain the target subtitle layer of the image layer at the target time point. The target attribute values are the attribute values of the layer attributes of the image layer at the target time point, and the target time point is the time point after the initial time point.

[0191] The image generation module 1204 is used to overlay the target subtitle layers of each image level to generate the target subtitle image of the subtitle to be displayed at the target time point.

[0192] Furthermore, the image transformation module 1203 described above is specifically used for:

[0193] Determine the amount of attribute change between the target attribute value and the initial attribute value of the image layer;

[0194] Based on the attribute change, an image transformation is performed on the initial caption layer of the image layer to obtain the target caption layer of the image layer at the target time point.

[0195] Furthermore, the hierarchical attributes include color attributes, the target attribute value includes the target color value of the color attribute, and the attribute change amount includes the color change amount of the color attribute. The image transformation module 1203 is specifically used for:

[0196] When the color change indicates a color change in the image layer, the color values of the non-transparent pixels in the initial caption layer are transformed into the target color values to obtain the target caption layer at the target time point.

[0197] Furthermore, the hierarchical attributes include geometric attributes, and the attribute changes include the geometric changes of the geometric attributes. Specifically, the image transformation module 1203 is used for:

[0198] The affine transformation parameters are determined based on the geometric changes, and the affine transformation matrix is constructed based on the affine transformation parameters.

[0199] Based on the affine transformation matrix, an affine transformation is performed on the initial subtitle layer to obtain the target subtitle layer at the target time point.

[0200] Furthermore, the hierarchical attributes include masking attributes, and the target attribute value includes the target masking value of the masking attribute. The image transformation module 1203 is specifically used for:

[0201] Based on the target masking value, determine the masking area in the initial caption layer;

[0202] In the initial subtitle layer, the color values of the non-transparent pixels located within the masking area are transformed to preset masking color values to obtain the target subtitle layer at the target time point.

[0203] Furthermore, the masking attribute includes a color-shifting ratio attribute, and the target masking value includes the target color-shifting ratio value of the color-shifting ratio attribute. The image transformation module 1203 is specifically used for:

[0204] Obtain the layer width of the target subtitle layer, and determine the width threshold based on the product of the target color change ratio value and the layer width;

[0205] In the initial caption layer, the area with a horizontal coordinate less than the width threshold is defined as the masking area.

[0206] Furthermore, the masking attributes include multiple masking coordinate attributes, and the target masking value includes the target masking coordinates of each masking coordinate attribute. The image transformation module 1203 is specifically used for:

[0207] Based on the coordinates of each target mask, the visible area is determined in the initial subtitle layer, where the target mask coordinates are located at the boundary of the visible area;

[0208] In the initial caption layer, the area outside the visible area is defined as the masking area.

[0209] Furthermore, the aforementioned acquisition module 1201 is specifically used for:

[0210] Obtain the dynamic attribute set of the subtitle image corresponding to the subtitle to be displayed, and divide the dynamic attribute set into an equivalent attribute set and a non-equivalent attribute set. The equivalent attribute set includes hierarchical attributes used to achieve animation effects through image transformation, and the non-equivalent attribute set includes hierarchical attributes that cannot achieve animation effects through image transformation.

[0211] When the non-equivalent attribute set is empty, the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed are obtained from the equivalent attribute set.

[0212] Furthermore, the hierarchical attributes in the equivalent attribute set include the color attributes of each image level, and the aforementioned acquisition module 1201 is specifically used for:

[0213] When the non-equivalent attribute set is empty, for each image level, the color attribute corresponding to the current image level is kept unchanged in the equivalent attribute set, and the color attribute corresponding to the other image levels in the equivalent attribute set is adjusted to transparent to obtain the level attribute set corresponding to the current image level.

[0214] Obtain the initial attribute values of the image layer in the subtitle image corresponding to the subtitle to be displayed from each layer attribute set.

[0215] Furthermore, the hierarchical attribute set also includes attribute transformation functions corresponding to the initial attribute values. Specifically, the image transformation module 1203 is used for:

[0216] Obtain the target time point;

[0217] The target time points are input into the attribute change functions in the attribute sets of each level for calculation, and the target attribute values of each image level are obtained.

[0218] Furthermore, the image transformation module 1203 described above is specifically used for:

[0219] The target time points are input into the attribute change functions in the attribute sets of each level for calculation, and the reference attribute values of each image level are obtained.

[0220] The subtitle text is input into a large language model for sentiment recognition to obtain the target sentiment information of the subtitle text.

[0221] The target sentiment information and reference attribute values are concatenated and input into the regression model for regression to obtain the target attribute values at each image level.

[0222] Furthermore, the multiple image layers include a text layer, a border layer, a shadow layer, and a background layer. The image generation module 1204 is specifically used for:

[0223] The target subtitle layers are stacked sequentially in the order of background layer, shadow layer, border layer, and text layer to generate the target subtitle image at the target time point.

[0224] The electronic device provided in this disclosure for executing the above-described subtitle image generation method can be a terminal. Referring to FIG13, FIG13 is a partial structural block diagram of the terminal provided in this disclosure. The terminal includes: a camera assembly 1310, a first memory 1320, an input unit 1330, a display unit 1340, a sensor 1350, an audio circuit 1360, a wireless fidelity (Wi-Fi) module 1370, a first processor 1380, and a first power supply 1390, etc. Those skilled in the art will understand that the terminal structure shown in FIG13 does not constitute a limitation on the terminal, and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0225] The camera assembly 1310 can be used to capture images or videos. Optionally, the camera assembly 1310 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal. In some embodiments, there are at least two rear-facing cameras, which are any one of a main camera, a depth-sensing camera, a wide-angle camera, and a telephoto camera, to achieve background blurring by fusion of the main camera and the depth-sensing camera, panoramic shooting by fusion of the main camera and the wide-angle camera, VR (Virtual Reality) shooting, or other fusion shooting functions.

[0226] The first memory 1320 can be used to store software programs and modules. The first processor 1380 executes various functional applications and data processing of the terminal by running the software programs and modules stored in the first memory 1320.

[0227] The input unit 1330 can be used to receive input numeric or character information, and to generate key signal inputs related to the terminal's settings and function control. Specifically, the input unit 1330 may include a touch panel 1331 and other input devices 1332.

[0228] Display unit 1340 can be used to display input or provided information, as well as various menus of the terminal. Display unit 1340 may include display panel 1341.

[0229] Audio circuitry 1360, speaker 1361, and microphone 1362 provide an audio interface.

[0230] The first power source 1390 can be AC power, DC power, a disposable battery, or a rechargeable battery.

[0231] The number of sensors 1350 can be one or more, and these sensors 1350 include, but are not limited to: accelerometers, gyroscopes, pressure sensors, optical sensors, etc.

[0232] An accelerometer can detect the magnitude of acceleration along the three coordinate axes of a coordinate system established by the terminal. For example, an accelerometer can be used to detect the components of gravitational acceleration along the three coordinate axes. The first processor 1380 can control the display unit 1340 to display the user interface in either a horizontal or vertical view based on the gravitational acceleration signal acquired by the accelerometer. The accelerometer can also be used for collecting motion data from games or other applications.

[0233] The gyroscope sensor can detect the terminal's orientation and rotation angle. It can work in conjunction with an accelerometer to collect 3D user movements on the terminal. Based on the data collected by the gyroscope sensor, the first processor 1380 can perform the following functions: motion sensing (e.g., changing the UI based on the user's tilt), image stabilization during shooting, game control, and inertial navigation.

[0234] The pressure sensor can be installed on the side bezel of the terminal and / or on the lower layer of the display unit 1340. When the pressure sensor is installed on the side bezel of the terminal, it can detect the user's grip signal on the terminal, and the first processor 1380 can perform left / right hand recognition or quick operation based on the grip signal collected by the pressure sensor. When the pressure sensor is installed on the lower layer of the display unit 1340, the first processor 1380 can control the operable controls on the UI interface based on the user's pressure operation on the display unit 1340. The operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

[0235] An optical sensor is used to collect ambient light intensity. In one embodiment, the first processor 1380 can control the display brightness of the display unit 1340 based on the ambient light intensity collected by the optical sensor. Specifically, when the ambient light intensity is high, the display brightness of the display unit 1340 is increased; when the ambient light intensity is low, the display brightness of the display unit 1340 is decreased. In another embodiment, the first processor 1380 can also dynamically adjust the shooting parameters of the camera assembly 1310 based on the ambient light intensity collected by the optical sensor.

[0236] In this embodiment, the first processor 1380 included in the terminal can execute the subtitle image generation method of the previous embodiment.

[0237] The electronic device for executing the above-described subtitle image generation method provided in this disclosure can also be a server. Referring to Figure 14, which is a partial structural block diagram of a server provided in this disclosure, the server can vary considerably due to different configurations or performance. It may include one or more second processors 1410 and second memories 1430, and one or more storage media 1440 (e.g., one or more mass storage devices) for storing application programs 1443 or data 1442. The second memories 1430 and storage media 1440 can be temporary or persistent storage. The program stored in the storage media 1440 may include one or more modules (not shown in the figure), each module may include a series of instruction operations on the server. Furthermore, the second processor 1410 may be configured to communicate with the storage media 1440 and execute a series of instruction operations in the storage media 1440 on the server.

[0238] The server may also include one or more secondary power supplies 1420, one or more wired or wireless network interfaces 1450, one or more input / output interfaces 1460, and / or one or more operating systems 1441, such as Windows Server. TM Mac OS X TM Unix TM Linux TM FreeBSD TM etc.

[0239] The second processor 1410 in the server can be used to execute the caption image generation method.

[0240] This disclosure also provides a computer-readable storage medium for storing a computer program for executing the subtitle image generation methods of the foregoing embodiments.

[0241] This disclosure also provides a computer program product comprising a computer program stored in a computer-readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium and executes the computer program, causing the computer device to perform the above-described caption image generation method.

[0242] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in this disclosure and the foregoing drawings are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatuses.

[0243] It should be understood that in this disclosure, "at least one item" means one or more, and "more than one" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0244] It should be understood that in the description of the embodiments disclosed herein, "multiple" means two or more, "greater than", "less than", "exceeding" etc. are understood to exclude the number itself, and "above", "below", "within" etc. are understood to include the number itself.

[0245] In the several embodiments provided in this disclosure, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.

[0246] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0247] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0248] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0249] It should also be understood that the various implementation methods provided in this disclosure can be combined arbitrarily to achieve different technical effects.

[0250] The above provides a detailed description of the preferred embodiments of this disclosure. However, this disclosure is not limited to the above-described embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of this disclosure. All such equivalent modifications or substitutions are included within the scope defined by the claims of this disclosure.

Claims

1. A method for generating subtitle images, executed by an electronic device, comprising: Obtain the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed, wherein the image layer is divided based on different types of display content in the subtitle image, and the initial attribute value is the attribute value of the layer attribute of the image layer at the initial time point; Obtain the subtitle text to be displayed. For each image layer, perform text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point. Obtain the target attribute value of each of the image layers. For each image layer, perform image transformation on the initial subtitle layer of the image layer based on the target attribute value of the image layer to obtain the target subtitle layer of the image layer at the target time point. The target attribute value is the attribute value of the layer attribute of the image layer at the target time point, and the target time point is a time point after the initial time point. The target subtitle layers of each of the image layers are superimposed to generate the target subtitle image of the subtitle to be displayed at the target time point.

2. The subtitle image generation method according to claim 1, wherein, The step of performing image transformation on the initial subtitle layer of the image layer based on the target attribute value of the image layer to obtain the target subtitle layer of the image layer at the target time point includes: Determine the amount of attribute change between the target attribute value of the image layer and the initial attribute value of the image layer; Based on the attribute change, an image transformation is performed on the initial subtitle layer of the image layer to obtain the target subtitle layer of the image layer at the target time point.

3. The subtitle image generation method according to claim 2, wherein, The hierarchical attributes include color attributes, the target attribute value includes the target color value of the color attribute, and the attribute change amount includes the color change amount of the color attribute. The step of performing image transformation on the initial subtitle layer of the image hierarchy based on the attribute change amount to obtain the target subtitle layer of the image hierarchy at the target time point includes: When the color change indicates that the image layer has undergone a color change, the color value of the non-transparent pixels in the initial subtitle layer is transformed into the target color value to obtain the target subtitle layer of the image layer at the target time point.

4. The subtitle image generation method according to claim 2 or 3, wherein, The hierarchical attributes include geometric attributes, and the attribute change includes the geometric change of the geometric attributes. The step of performing an image transformation on the initial subtitle layer of the image hierarchy based on the attribute change to obtain the target subtitle layer of the image hierarchy at the target time point includes: The affine transformation parameters are determined based on the geometric changes, and the affine transformation matrix is constructed based on the affine transformation parameters. Based on the affine transformation matrix, an affine transformation is performed on the initial subtitle layer to obtain the target subtitle layer at the target time point.

5. The subtitle image generation method according to any one of claims 1 to 4, wherein, The subtitle attributes include masking attributes, and the target attribute value includes the target masking value of the masking attribute. The step of performing image transformation on the initial subtitle layer of the image layer based on the target attribute value of the image layer to obtain the target subtitle layer of the image layer at the target time point includes: Based on the target masking value, a masking area is determined in the initial subtitle layer; In the initial subtitle layer, the color values of non-transparent pixels located within the masking area are transformed into preset masking color values to obtain the target subtitle layer of the image layer at the target time point.

6. The subtitle image generation method according to claim 5, wherein, The masking attribute includes a color change ratio attribute, the target masking value includes a target color change ratio value of the color change ratio attribute, and the step of determining the masking area in the initial subtitle layer based on the target masking value includes: Obtain the layer width of the target subtitle layer, and determine the width threshold based on the product of the target color change ratio value and the layer width; In the initial subtitle layer, the region with a horizontal coordinate less than the width threshold is determined as the masking region.

7. The subtitle image generation method according to claim 5, wherein, The masking attributes include multiple masking coordinate attributes, and the target masking value includes the target masking coordinates of each of the masking coordinate attributes. Determining the masking region in the initial subtitle layer based on the target masking value includes: Based on the respective target mask coordinates, a visible area is determined in the initial subtitle layer, wherein the target mask coordinates are located at the boundary of the visible area; In the initial subtitle layer, the area outside the visible area is defined as the masking area.

8. The subtitle image generation method according to any one of claims 1 to 7, wherein, The process of obtaining the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed includes: Obtain the dynamic attribute set of the subtitle image corresponding to the subtitle to be displayed, and divide the dynamic attribute set into an equivalent attribute set and a non-equivalent attribute set. The equivalent attribute set includes hierarchical attributes used to achieve animation effects through image transformation, and the non-equivalent attribute set includes hierarchical attributes that cannot achieve animation effects through image transformation. When the set of non-equivalent attributes is empty, the initial attribute values of each image level in the subtitle image corresponding to the subtitle to be displayed are obtained from the set of equivalent attributes.

9. The subtitle image generation method according to claim 8, wherein, The hierarchical attributes in the equivalent attribute set include the color attributes of each image level. When the non-equivalent attribute set is empty, obtaining the initial attribute values of each image level in the subtitle image corresponding to the subtitle to be displayed from the equivalent attribute set includes: When the non-equivalent attribute set is empty, for each image level, the color attribute corresponding to the current image level is kept unchanged in the equivalent attribute set, and the color attribute corresponding to the remaining image levels in the equivalent attribute set is adjusted to transparent to obtain the level attribute set corresponding to the current image level. The initial attribute values of the image level in the subtitle image corresponding to the subtitle to be displayed are obtained from each of the respective layer attribute sets.

10. The subtitle image generation method according to claim 9, wherein, The hierarchical attribute set also includes the attribute change function corresponding to the initial attribute value, and obtaining the target attribute value for each of the image layers includes: Obtain the target time point; The target time points are input into the attribute change functions in each of the attribute sets of the respective layers for calculation, so as to obtain the target attribute values of each of the image layers.

11. The subtitle image generation method according to claim 10, wherein, The step of inputting the target time point into the attribute change function in each of the attribute sets of the respective layers for calculation to obtain the target attribute value of each image layer includes: The target time points are respectively input into the attribute change functions in each of the attribute sets of the respective layers for calculation, so as to obtain the reference attribute values of each of the image layers; The subtitle text is input into a large language model for sentiment recognition to obtain the target sentiment information of the subtitle text. The target sentiment information and the reference attribute values are concatenated and input into a regression model for regression to obtain the target attribute values for each image level.

12. The subtitle image generation method according to any one of claims 1 to 11, wherein, The multiple image layers include a text layer, a border layer, a shadow layer, and a background layer. The step of overlaying the target subtitle layers of each of the image layers to generate the target subtitle image of the subtitle to be displayed at the target time point includes: The target subtitle layers are sequentially superimposed in the order of background layer, shadow layer, border layer, and text layer to generate the target subtitle image of the subtitle to be displayed at the target time point.

13. A subtitle image generation apparatus, wherein, include: The acquisition module is used to acquire the initial attribute values of each image layer in the subtitle image corresponding to the subtitle to be displayed. The image layers are divided based on different types of display content in the subtitle image, and the initial attribute values are the attribute values of the layer attributes of the image layer at the initial time point. The layout and drawing module is used to obtain the subtitle text of the subtitle to be displayed, and for each image layer, perform text layout and drawing based on the subtitle text and the initial attribute values of each image layer to obtain the initial subtitle layer of each image layer at the initial time point. The image transformation module is used to obtain the target attribute values of each of the image layers. For each image layer, based on the target attribute values of the image layer, the initial subtitle layer of the image layer is transformed to obtain the target subtitle layer of the image layer at the target time point. The target attribute values are the attribute values of the layer attributes of the image layer at the target time point, and the target time point is a time point after the initial time point. The image generation module is used to overlay the target subtitle layers of each of the image layers to generate the target subtitle image of the subtitle to be displayed at the target time point.

14. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, wherein, When the processor executes the computer program, it implements the subtitle image generation method according to any one of claims 1 to 12.

15. A computer-readable storage medium storing a computer program, wherein, When the computer program is executed by the processor, it implements the subtitle image generation method according to any one of claims 1 to 12.

16. A computer program product comprising a computer program, wherein, When the computer program is executed by the processor, it implements the subtitle image generation method according to any one of claims 1 to 12.