Album generation method and apparatus, and device, medium and product
By using a feature element recognition model to determine the popularity information of feature elements in an image, a story album with a sorted timeline is generated, which solves the problems of low user satisfaction and low hardware resource utilization in existing technologies and achieves more efficient electronic album generation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CHINA MOBILE INTERNET CO LTD
- Filing Date
- 2025-12-05
- Publication Date
- 2026-06-25
AI Technical Summary
In existing technologies, the generation methods of electronic story albums rely on the time when the images were captured, resulting in low user satisfaction and low utilization of hardware processing resources.
The feature element recognition model determines the feature elements and their popularity information in the image to be processed, and generates a sorted timeline based on this to build a story album.
It improved user satisfaction and optimized the utilization of hardware processing resources, determining the image display timeline by the popularity of feature elements.
Smart Images

Figure CN2025140301_25062026_PF_FP_ABST
Abstract
Description
Album generation methods, devices, equipment, media and products
[0001] Cross-references to related applications
[0002] This application is based on and claims priority to Chinese Patent Application No. 202411864156.4, filed on December 17, 2024, entitled “Album Generation Method, Apparatus, Device, Medium and Product”, the entire contents of which are incorporated herein by reference. Technical Field
[0003] This application belongs to the field of computer vision technology, and in particular relates to a method, apparatus, device, medium and product for generating photo albums. Background Technology
[0004] In real life, story albums are no longer limited to traditional paper formats; they are increasingly presented digitally. Digital story albums aggregate a series of images stored on electronic devices, allowing users to quickly and easily view past memories.
[0005] However, in related technologies, the scene or shooting time of the images stored in the electronic device is usually identified, and a storyline for displaying the images is generated based on the shooting time. The related images can then be displayed according to the storyline to obtain an electronic story album, which results in low user satisfaction and low utilization of hardware processing resources. Summary of the Invention
[0006] This application provides a method, apparatus, device, medium, and product for generating photo albums.
[0007] This application provides a method for generating a photo album, the method including:
[0008] Obtain multiple images to be processed for generating the story album;
[0009] For each image to be processed, a feature element recognition model is used to determine at least one feature element in the image and the element information of each feature element;
[0010] Determine the thermal information of each feature element in each image to be processed;
[0011] A story album is generated by sorting multiple images to be processed according to a sorting timeline. The sorting timeline is generated based on the heat information of target elements in each image to be processed. The target element is any one of at least one feature element in the image to be processed.
[0012] This application provides a photo album generation device, the device comprising:
[0013] The acquisition module is configured to acquire multiple images to be processed for generating the story album;
[0014] The determination module is configured to determine at least one feature element in the image to be processed and the element information of each feature element using a feature element recognition model for each image to be processed.
[0015] The determination module is also configured to determine the heat information of each feature element in each image to be processed;
[0016] The generation module is configured to generate a story album by sorting multiple images to be processed according to a sorting timeline. The sorting timeline is generated based on the heat information of target elements in each image to be processed. The target element is any one of at least one feature element in the image to be processed.
[0017] This application provides an electronic device, including: a memory for storing computer program instructions; and a processor for reading and running the computer program instructions stored in the memory to execute the album generation method provided in any of the optional embodiments described above.
[0018] This application provides a computer storage medium storing computer program instructions, which, when executed by a processor, implement the album generation method provided in any of the optional embodiments described above.
[0019] This application provides a computer program product, which includes a computer program that, when executed by a processor, implements the album generation method provided in any of the optional implementation methods described above.
[0020] In this embodiment, multiple images to be processed for generating a story album can be acquired. For each image, a feature element recognition model is used to determine at least one feature element and the element information of each feature element. This allows for the determination of the popularity information of each feature element in each image. The multiple images are then sorted according to a timeline to generate the story album. Since this timeline can be generated based on the popularity information of target elements in each image, and popularity information characterizes the attention or popularity of a feature element (which is any one of the at least one feature element in the image), the timeline for displaying multiple images in the story album can be determined by the popularity of the feature elements in each image. This effectively improves user satisfaction and the utilization of hardware processing resources. Attached Figure Description
[0021] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 is a flowchart illustrating the first album generation method provided in this application embodiment;
[0023] Figure 2 is a schematic diagram of the structure of a feature element recognition model provided in an embodiment of this application;
[0024] Figure 3 is a flowchart illustrating the second album generation method provided in this application embodiment;
[0025] Figure 4 is a schematic diagram of the structure of a first sub-model provided in an embodiment of this application;
[0026] Figure 5 is a schematic diagram of the structure of a first attention mechanism model provided in an embodiment of this application;
[0027] Figure 6 is a schematic diagram of a convolution kernel provided in an embodiment of this application;
[0028] Figure 7 is a schematic diagram of a second attention mechanism model provided in an embodiment of this application;
[0029] Figure 8 is a flowchart illustrating a training method for a preset feature element recognition model provided in an embodiment of this application;
[0030] Figure 9 is a schematic diagram of the structure of a photo album generation device provided in an embodiment of this application;
[0031] Figure 10 is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0032] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.
[0033] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.
[0034] In this article, the term "and / or" is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.
[0035] As mentioned in the background art, to address the existing problems, embodiments of this application provide a photo album generation method, apparatus, device, medium, and product. This method can acquire multiple images to be processed for generating a story photo album. For each image to be processed, a feature element recognition model is used to determine at least one feature element and the element information of each feature element. The popularity information of each feature element in each image is also determined. The multiple images to be processed are then sorted according to a timeline to generate the story photo album. Since the timeline can be generated based on the popularity information of target elements in each image to be processed, and popularity information can characterize the attention or popularity of feature elements, and the target element is any one of the at least one feature element in the image to be processed, the timeline for displaying multiple images in the story photo album can be determined by the popularity of feature elements in each image, effectively improving user satisfaction.
[0036] It should be noted that the execution entity of the album generation method provided in this application embodiment can be an album generation device, or a control module in the album generation device used to execute the album generation method. This application embodiment takes the execution of the album generation method by an album generation device as an example to illustrate the album generation method provided in this application embodiment.
[0037] The album generation method provided in this application will be described in detail below with reference to the accompanying drawings and specific embodiments.
[0038] Figure 1 is a flowchart illustrating a photo album generation method provided in an embodiment of this application.
[0039] As shown in Figure 1, the execution subject of this method can be a photo album generation device. Based on this, the method can specifically include the following steps:
[0040] S110, Obtain multiple images to be processed for generating the story album.
[0041] The aforementioned multiple images to be processed may be images selected or authorized by the user, and this embodiment of the application does not impose specific limitations on them.
[0042] Specifically, the album generation device can obtain multiple images to be processed for generating a story album from images stored in the album of an electronic device or from images stored in a cloud drive, selected or authorized by the user.
[0043] S120, for each image to be processed, at least one feature element in the image to be processed and the element information of each feature element are determined using a feature element recognition model.
[0044] The feature element recognition model can be a pre-trained model used to identify feature elements present in an image; however, this embodiment does not impose any specific limitations on it.
[0045] Furthermore, the aforementioned feature elements can be elements such as necklaces, watches, glasses, and mobile phones in a person image, or landmarks and plants in a landscape image, or furniture and decorations in a lifestyle image. This application embodiment does not impose specific limitations on these elements. Correspondingly, the element information of the feature element can be related information about the corresponding feature element, such as size and dimensions. This application embodiment does not impose specific limitations on these elements.
[0046] In some embodiments, after acquiring a plurality of images to be processed for generating a story album, the album generation apparatus can, for each of the plurality of images to be processed, use a pre-trained feature element recognition model to identify the image to be processed, so as to obtain at least one feature element in the image to be processed and the element information of each feature element. In this way, at least one feature element in each of the plurality of images to be processed and the element information of each feature element can be obtained.
[0047] S130, determine the heat information corresponding to each feature element in each image to be processed.
[0048] The popularity information of the aforementioned feature elements is used to characterize the attention or popularity of the feature element.
[0049] In some embodiments, after obtaining at least one feature element in each image to be processed and the element information of each feature element, the album generation device can determine the heat information of each feature element based on the element information of the feature element for each of the at least one feature element in the image to be processed. In this way, the heat information of each feature element in the at least one feature element in the image to be processed can be obtained.
[0050] S140: Generate a story album by sorting multiple images to be processed according to the sorting timeline.
[0051] In some embodiments, the sorting timeline described above can be generated based on the popularity information of the target element of each image to be processed. The target element of each image to be processed can be any one of at least one feature element of the image to be processed; for example, the target element can be the feature element with the highest popularity among at least one feature element in the corresponding image to be processed.
[0052] In some embodiments, after obtaining the heat information of each feature element in at least one feature element of each image to be processed, for each image to be processed, a target element of the image to be processed can be determined from at least one feature element in the image to be processed based on the heat information of each feature element in the image to be processed, thereby obtaining the target elements corresponding to each of the multiple images to be processed, and generating a sorting timeline based on the heat information of the target elements corresponding to each of the multiple images to be processed. In this way, the album generation device can generate a story album by sorting multiple images to be processed according to the sorting timeline.
[0053] In this embodiment, multiple images to be processed for generating a story album can be acquired. For each image, a feature element recognition model is used to determine at least one feature element and the element information of each feature element. This allows for the determination of the popularity information of each feature element in each image. The multiple images are then sorted according to a timeline to generate the story album. Since this timeline can be generated based on the popularity information of target elements in each image, and popularity information characterizes the attention or popularity of a feature element (which is any one of the at least one feature element in the image), the timeline for displaying multiple images in the story album can be determined by the popularity of the feature elements in each image, effectively improving user satisfaction.
[0054] In the above embodiments, the album generation method provided by this application requires the use of a feature element recognition model to determine at least one feature element in the image to be processed and the element information of each feature element. In order to accurately obtain at least one feature element in the image to be processed and the element information of each feature element, it should be noted that the feature element recognition model provided by this application may include a first sub-model, a second sub-model, a third sub-model, and a fourth sub-model, as shown in Figure 2. Based on this, in some embodiments, as shown in Figure 3, the above S120 may specifically include the following steps:
[0055] S310, the first sub-model of the feature element recognition model is used to recognize the feature elements of the image to be processed, and stacked feature maps corresponding to at least one feature element in the image to be processed are obtained.
[0056] Among them, the first sub-model of the aforementioned feature element recognition model is used to extract high-level features in the image, mainly outputting feature maps at certain levels in the middle of the network, such as semantic information in the image, such as the shape, texture, color and other features of the object.
[0057] In some embodiments, after acquiring multiple images to be processed, since each image to be processed may include at least one feature element, the album generation device can identify the feature elements in the image to be processed through the first sub-model of the feature element recognition model for each image to be processed, and obtain stacked feature maps corresponding to at least one feature element in the image to be processed.
[0058] S320, the second sub-model of the feature element recognition model is used to process the stacked feature map corresponding to each feature element, and to determine multiple first proposal regions corresponding to the feature element and the regional confidence of each first proposal region.
[0059] The second sub-model of the aforementioned feature element recognition model includes a Region Proposal Network (RPN). This second sub-model can be a model used to identify at least one feature element in the image to be processed, and this embodiment of the application does not specifically limit it.
[0060] In addition, the aforementioned first proposed region refers to the candidate region selected from the image to be processed that may contain feature elements, and the region confidence level corresponding to each first proposed region represents the degree of authenticity of the first proposed region.
[0061] In some embodiments, after the first sub-model outputs the stacked feature map corresponding to each feature element, the album generation device can input the stacked feature map corresponding to each feature element into the second sub-model of the feature element recognition model, and then process the stacked feature map corresponding to each feature element through the second sub-model of the feature element recognition model to determine the multiple first proposed regions corresponding to the feature element and the regional confidence of each first proposed region.
[0062] S330, the third sub-model of the feature element recognition model is used to adjust the multiple first proposal regions corresponding to each feature element to obtain multiple second proposal regions for the feature element.
[0063] The third sub-model of the aforementioned feature element recognition model can be a Region of Interest Align (RoIAlign) network, which is used to adjust the first proposed region.
[0064] Furthermore, the aforementioned multiple second proposal regions correspond one-to-one with the aforementioned multiple first proposal regions, and this application embodiment does not impose specific limitations here.
[0065] In some embodiments, for each of the at least one feature elements, the album generation device can input multiple first proposal regions corresponding to the feature element into a third sub-model of the feature element recognition model, and then adjust the multiple first proposal regions of the feature element through the third sub-model of the feature element recognition model to obtain multiple second proposal regions of the feature element.
[0066] S340, using the fourth sub-model of the feature element to identify the target image corresponding to each feature element, determining the element information of the feature element, and obtaining the element information of each feature element in at least one feature element in the image to be processed.
[0067] In some embodiments, the target image for each feature element is obtained by cropping the image to be processed based on the target proposal region corresponding to each feature element, wherein the target proposal region may be determined from a plurality of second proposal regions corresponding to the feature element.
[0068] In addition, the fourth sub-model mentioned above may include a semantic segmentation network for determining element information of feature elements.
[0069] In some embodiments, for each feature element in each image to be processed, the album generation device determines the target proposal region corresponding to the feature element from a plurality of second proposal regions of the feature element, and crops the target image according to the target proposal region of the feature element. In this way, the album generation device can identify the target image through the fourth sub-model of the feature element, determine the element information of the feature element, and obtain the element information of each feature element in at least one feature element in the image to be processed.
[0070] In this embodiment, after acquiring multiple images to be processed, each image to be processed is input into the feature element recognition model. The first sub-model, second sub-model, third sub-model, and fourth sub-model included in the feature element recognition model are used to recognize and process the images to be processed in sequence, so as to obtain more accurate recognition results. That is, at least one feature element in the image to be processed and the element information of each feature element can be accurately obtained.
[0071] Based on this, in some embodiments, the aforementioned first sub-model may include a first convolutional layer, a first attention mechanism model, and N convolutional kernels, as shown in Figure 4. This represents channel-level multiplication. Based on this, S310 above can specifically include the following steps:
[0072] The first convolutional layer performs convolution processing on the image to be processed, resulting in the first feature map;
[0073] The first feature map is processed by the first attention mechanism model to obtain the second feature map;
[0074] The third feature map is processed sequentially by N convolution kernels to obtain a stacked feature map, which is obtained by multiplying the first and second feature maps.
[0075] The first convolutional layer may include a 3×3 convolutional network, which is not specifically limited in this embodiment. Additionally, the second feature map may include the feature element positions of at least one feature element in the image to be processed.
[0076] In some embodiments, the album generation device can first perform convolution processing on the image to be processed through a first convolutional layer to obtain a first feature map, and then process the first feature map through a first attention mechanism model to obtain a second feature map. Since the second feature map may include the feature element position of at least one feature element, the third feature map corresponding to the feature element can be located in the feature map obtained by multiplying the first feature map and the second feature map based on the feature element position of each feature element. Then, the third feature map can be processed sequentially by N convolution kernels to obtain the stacked feature map corresponding to the feature element.
[0077] In this embodiment, by multiplying the outputs of the two channels of the first convolutional layer and the first attention mechanism model and inputting them into N convolutional kernels for convolution processing, not only can high-level features be obtained, but also the interference information of the entire model is reduced, the ability of the prediction head to extract and locate feature information is enhanced, and the performance of the entire network is improved.
[0078] In order to accurately describe the album generation method provided in the embodiments of this application, in some embodiments, the step of processing the first feature map through the first attention mechanism model to obtain the second feature map specifically includes:
[0079] The first feature map is transformed into a first sub-feature code and a second sub-feature code using the first attention mechanism model;
[0080] The first and second sub-feature codes are concatenated to obtain the total feature code, and the total feature code is processed using a convolution transformation function to obtain the fourth feature map;
[0081] The fourth feature map is decomposed according to different dimensional directions to obtain the first and second sub-feature maps;
[0082] The first feature map, the first target sub-feature map, and the second target sub-feature map are processed using the first attention mechanism to obtain the second feature map. The first target sub-feature map is determined based on the first sub-feature map, and the second target sub-feature map is determined based on the second sub-feature map.
[0083] In some embodiments, the album generation device can use a first attention mechanism model to transform a first feature map into a first sub-feature code and a second sub-feature code, and concatenate the first sub-feature code and the second sub-feature code to obtain a total feature code. The total feature code is then processed using a convolutional transformation function to obtain a fourth feature map. The fourth feature map can then be decomposed according to different dimensional directions to obtain a first sub-feature map and a second sub-feature map. A first target sub-feature map is determined based on the first sub-feature map, and a second target sub-feature map is determined based on the second sub-feature map. The first attention mechanism can then be used to process the first feature map, the first target sub-feature map, and the second target sub-feature map to obtain a second feature map.
[0084] In one example, the structure of the first attention mechanism model described above can be shown in Figure 5. When the first convolutional layer inputs the first feature map obtained into the first attention mechanism model, the first feature map (F∈R) can first be processed by the residual. C*H*WThe first feature map is processed as follows: R represents the set of real numbers, C represents the number of channels, H represents the feature map height, and W represents the feature map width. It should be noted that this first feature map can contain two-dimensional global pooling feature codes. Based on this, the first feature map can be decomposed into two parallel one-dimensional feature codes (i.e., the first sub-feature code and the second sub-feature code mentioned above, also known as the X AVG Pool and Y AVG Pool in Figure 5). Next, by concatenating the one-dimensional feature codes in the two directions using Concat (i.e., obtaining the total feature code), and using a 1×1 convolution transformation function (i.e., Conv in Figure 5) to integrate its feature information, we obtain F∈R. (C / r’)*1*(W+H) (i.e., the fourth feature map mentioned above) represents the intermediate feature map F′ encoding spatial location information in the horizontal and vertical directions; where r′ is a scaling factor used to reduce the number of channels. Subsequently, F′ is decomposed into two separate feature maps along the spatial dimensions of the two directions, and then passed through two 1×1 convolutional transformation functions (Conv), followed by a Sigmoid activation function to obtain the final feature map G. h and G w Finally, the input feature map F is compared with the feature maps G from the two directions where the position information was obtained. h and G w By performing element-wise multiplication, we obtain the enhanced expression for the attention mechanism, Y' = F·G. h ·G w (i.e., the second feature map).
[0085] In this embodiment, the first attention mechanism model can be used to strengthen or weaken the feature representation of a specific region, capture the long-term dependencies between network channels, and determine the accurate location of feature elements. Secondly, considering that the weight parameters can be changed by combining the feature elements of the convolutional layer in the residual, more weights are applied to important feature element channels and less weights are applied to other unimportant feature channels, thereby enhancing the global attention of the feature element recognition network and accurately obtaining the stacked features of each feature element, which facilitates the accurate acquisition of element information of the feature elements in the future.
[0086] In some embodiments, each convolutional kernel may include a depthwise separable convolutional layer, a second attention mechanism model, and a random deactivation layer; for each feature element in at least one feature element, the third feature map is processed sequentially by N convolutional kernels to obtain a stacked feature map corresponding to the feature element, including:
[0087] For the (i+1)th convolutional kernel among N convolutional kernels, the depthwise separable convolutional layer in the (i+1)th convolutional kernel is used to extract features from the output feature map of the i-th convolutional kernel to obtain the initial feature map corresponding to the (i+1)th convolutional kernel. The output feature map of the i-th convolutional kernel is obtained by processing the second feature map sequentially using the i-th convolutional kernel.
[0088] The initial feature map corresponding to the (i+1)th convolutional kernel is adjusted using the second attention mechanism model in the (i+1)th convolutional kernel, and the adjusted initial feature map corresponding to the (i+1)th convolutional kernel is processed using the random deactivation layer in the (i+1)th convolutional kernel to obtain the reference feature map corresponding to the (i+1)th convolutional kernel.
[0089] Based on the output feature map of the i-th convolutional kernel and the reference feature map corresponding to the (i+1)-th convolutional kernel, the output feature map of the (i+1)-th convolutional kernel is obtained. If i+1 is less than N, i = i+1 is updated, and the process returns to the step of using the depthwise separable convolutional layer in the (i+1)-th convolutional kernel to extract features from the output feature map of the i-th convolutional kernel and obtain the initial feature map corresponding to the (i+1)-th convolutional kernel, until i+j = N.
[0090] In some embodiments, since the first sub-model includes N convolutional kernels, the album generation device can extract features from the output feature map of the (i+1)th convolutional kernel using the depthwise separable convolutional layer in the (i+1)th convolutional kernel to obtain the initial feature map corresponding to the (i+1)th convolutional kernel. The output feature map of the (i)th convolutional kernel is obtained by processing the second feature map sequentially using the (i)th convolutional kernel. The initial feature map corresponding to the (i+1)th convolutional kernel is adjusted using the second attention mechanism model in the (i+1)th convolutional kernel, and the initial feature map is then processed using the (i+1)th convolutional kernel. The randomly deactivated layer in the kernel processes the adjusted initial feature map corresponding to the (i+1)th convolutional kernel to obtain the reference feature map corresponding to the (i+1)th convolutional kernel. Based on the output feature map of the ith convolutional kernel and the reference feature map corresponding to the (i+1)th convolutional kernel, the output feature map of the (i+1)th convolutional kernel is obtained. If i+1 is less than N, i is updated to i+1, and the process returns to the step of using the depthwise separable convolutional layer in the (i+1)th convolutional kernel to extract features from the output feature map of the ith convolutional kernel to obtain the initial feature map corresponding to the (i+1)th convolutional kernel, until i+j = N. Here, j can be equal to 1.
[0091] In one example, if N=2, meaning the first sub-model includes only two convolutional kernels, namely convolutional kernel 1 and convolutional kernel 2, as shown in Figure 6, convolutional kernel 1 can be used to further extract high-level features corresponding to feature elements in the image to be processed. ⊕ indicates channel-level addition. This mechanism extracts features through a depthwise separable convolutional layer, adjusts the extracted data through a second attention mechanism model, and then reduces and expands the input features through a 1x1 convolution. Next, to improve the model's generalization ability, a random deactivation layer is added to further reduce the probability of overfitting. Finally, the outputs of the two channels of the depthwise separable convolutional layer and the random deactivation layer are added to obtain the final output of convolutional kernel 1. Convolutional kernel 2 can be used to reduce the complexity and number of parameters of the network, improving computational performance and running efficiency. First, the input features are reduced in dimensionality and expanded using 1x1 convolutions. Then, features are extracted using depthwise separable convolutional layers. Next, a second attention mechanism model is used to learn the correlation between the image and the question, which reduces the sensitivity to semantic differences between the image and the question, improving the accuracy and robustness of the task. Another 1x1 convolution is used to expand the feature dimension. A random deactivation layer is used to further reduce the probability of overfitting. Finally, the output of the 1x1 convolution and the output of the random deactivation layer are added together to obtain the final output of convolution kernel 2.
[0092] It should be noted that after multiplying the data from the two channels after convolution and attention processing and inputting them into convolution kernel 1, convolution kernel 1 can be adjusted according to the actual application scenario. The adjustment principle is that for image scenarios with many small feature elements, the number of convolution kernels 1 should be greater than the number for image scenarios with many small feature elements.
[0093] Additionally, it should be noted that the structure of the second attention mechanism model in the above convolution kernel 2 can be shown in Figure 7. This second attention mechanism model is used to reduce the interference information of the entire model, enhance the ability of the prediction head to extract and locate feature information, and improve the performance of the entire network.
[0094] When the input feature map is F∈R C*H*W Where R represents the set of real numbers, C represents the number of channels, H represents the feature map height, and W represents the feature map width. The input feature map F is encoded in parallel along the horizontal and vertical directions using pooling kernels of (H, 1) and (1, W) respectively, resulting in two independent one-dimensional feature codes, Z. h ∈R C*H*1 and Z w ∈R C*1*W It can capture long-range dependencies in one spatial direction and retain accurate location information in another spatial direction.
[0095] Next, Z is stitched along the channel dimension. hand Z w Normalization and activation processing are performed using the product of weight sharing to obtain a pair of attention maps for direction perception and position perception, as shown in formula (1): B=δ(F1[Z h Z w (1)
[0096] Where F1 = R (C / r)*H*W F1 is the weight-sharing convolution transformation function, δ is the non-linear activation function, R is the scaling factor (the scaling ratio of the control block), and r is the scaling factor.
[0097] Then, along the channel dimension, B is divided into two independent tensors, Bchannel and Binterval. h =R (C / r)*H and B w =R (C / r)*W Using two different 1x1 convolutions J h and J w , for B h and B w Align the number of channels to enable perception of position and shape;
[0098] Finally, through multiplication, the output is obtained as shown in formula (2): Y1(i,j)=F(i,j)*g h (i)*g w (j) (2)
[0099] Among them, g h =α(J h (B h )), g w =α(J w (B w )).
[0100] In this embodiment, the second feature map can be iteratively processed by N convolutional kernels in the first sub-model to obtain a more accurate stacked feature map of feature elements, which makes it easier to identify the element information of each feature element more accurately in the future.
[0101] In some embodiments, the above-described S330 may specifically include the following steps:
[0102] For each of the multiple first proposal regions of each feature element, the third sub-model of the feature recognition element model is used to extract the regional location feature vector of the first proposal region from the first proposal region.
[0103] Based on the regional location feature vector of the first proposed region, the first proposed region is classified and bounding box regression is performed to obtain the second proposed region.
[0104] In some embodiments, after obtaining multiple proposed regions for each feature element in the image to be processed, the album generation apparatus can extract a regional location feature vector from each of the multiple first proposed regions for each feature element using a third sub-model of the feature recognition element model. Based on the regional location feature vector, the apparatus can perform classification and bounding box regression on the first proposed regions to obtain second proposed regions. In this way, second proposed regions corresponding to each of the multiple first proposed regions for the feature element can be obtained.
[0105] In this embodiment, for each of the multiple first proposed regions of each feature element, the regional location feature vector of the first proposed region can be obtained. Based on the regional location feature vector of the first proposed region, the first proposed region is classified and bounding box regression is performed, so that the first proposed region gradually approaches the real second proposed region, resulting in a more accurate second proposed region, thereby improving the positioning accuracy of the feature element.
[0106] In some embodiments, prior to S340, the album generation method described above may further include the following steps:
[0107] Based on the region confidence of each of the multiple second proposal regions corresponding to each feature element, at least two third proposal regions with region confidence greater than the confidence threshold are selected from the multiple second proposal regions.
[0108] By performing nonmaximum suppression processing on at least two third proposal regions, a target proposal region is determined from the at least two third proposal regions. The image to be processed is then cropped based on the target proposal region to obtain the target image corresponding to the feature elements.
[0109] The confidence threshold mentioned above can be preset based on actual experience or circumstances, and this application embodiment does not impose specific limitations on it.
[0110] Since each second proposed region of each feature element is adjusted based on the first proposed region corresponding to each feature element, the region confidence of each second proposed region of each feature element is the region confidence of the corresponding first proposed region. Thus, the album generation device can, based on the region confidence corresponding to each second proposed region, filter from multiple second proposed regions to obtain at least two third proposed regions with region confidence greater than a confidence threshold. Furthermore, by performing non-maximum suppression processing on the at least two third proposed regions, a target proposed region can be determined from the at least two third proposed regions. Based on the target proposed region, the image to be processed can be cropped to obtain the target image corresponding to the feature element.
[0111] In this embodiment, for each feature element, multiple second proposal regions can be filtered out using a confidence threshold, or second proposal regions that may be false detections can be filtered out, thereby reducing the burden on subsequent hardware processing resources. Then, non-maximum suppression is used to eliminate redundant third proposal regions, thereby obtaining a unique target image corresponding to the feature element, avoiding repeated detection of the same object, and improving the accuracy and efficiency of detection.
[0112] It should be noted that the image processing method provided in this application requires a pre-trained feature element recognition model to process the image to be processed. Therefore, the feature element recognition model needs to be trained before step S120. Therefore, the specific implementation method of the feature element recognition model training method provided in this application is described below with reference to the accompanying drawings.
[0113] Figure 8 is a flowchart illustrating a training method for a feature element recognition model provided in an embodiment of this application.
[0114] As shown in Figure 8, the execution entity of this method can be a photo album generation device. Based on this, the method can include the following steps:
[0115] S810, obtain the training sample set.
[0116] Before introducing this step, it's important to note that before training the feature element recognition model, it's necessary to perform multiple iterations to adjust its loss function value until the loss function value meets the training adjustment conditions, resulting in the trained feature element recognition model. However, if only one image sample is input during each iteration, the sample size is too small, hindering the training and adjustment of the pre-set feature element recognition model. Therefore, during the training process, multiple image samples are needed to iteratively process the pre-set feature element recognition model to obtain the trained model.
[0117] Therefore, the training sample set mentioned above may include multiple image samples and sample labels corresponding to each image sample. The sample labels corresponding to each image sample are used to characterize the element information of each feature element in at least one feature element in the image sample.
[0118] S820: For each image sample, the reference feature elements in the image sample are identified by the first preset sub-model of the preset feature element recognition model, so as to obtain the reference stacked feature map corresponding to at least one reference feature element in the image sample.
[0119] In some embodiments, the album generation device can input each image sample into a first preset sub-model of a preset feature element recognition model, and use the first preset sub-model of the preset feature element recognition model to identify reference feature elements in the image sample, thereby obtaining reference stacked feature maps corresponding to at least one reference feature element in the image sample.
[0120] S830, the reference stacked feature map corresponding to each reference feature element is processed by the second preset sub-model of the preset feature element recognition model to determine multiple first reference proposal regions corresponding to the reference feature element and the region confidence of each first proposal region.
[0121] In some embodiments, the confidence level of the reference region corresponding to each of the first reference proposal regions represents the degree of authenticity of the first reference proposal region.
[0122] In some embodiments, after obtaining the reference stacked feature map of each feature element, the reference stacked feature map of each feature element can be input into the second preset sub-model of the preset feature element recognition model. The reference stacked feature map of each feature element is processed by the second preset sub-model of the preset feature element recognition model to determine the multiple first reference proposal regions corresponding to the reference feature element and the region confidence of each first proposal region.
[0123] S840, the multiple first reference proposal regions of the reference feature element are adjusted by the third preset sub-model of the preset feature element recognition model to obtain multiple second reference proposal regions of the reference feature element.
[0124] In some embodiments, the album generation device can input multiple first reference proposal regions corresponding to each of the at least one reference feature elements into a third preset sub-model for each reference feature element, and adjust the multiple first reference proposal regions of the reference feature element respectively through the third preset sub-model of the preset feature element recognition model to obtain multiple second reference proposal regions of the reference feature element.
[0125] S850, the target reference image corresponding to each reference feature element is identified by the fourth preset sub-model of the preset feature element recognition model, the reference element information of the reference feature element is determined, and the reference element information of each reference feature element in at least one reference feature element in the image sample is obtained.
[0126] In some embodiments, the target reference image corresponding to each of the above reference feature elements is obtained by cropping the image sample based on the target reference proposal region of each reference feature element, and the target reference proposal region is determined from a plurality of second reference proposal regions corresponding to each reference feature element.
[0127] In some embodiments, the album generation device can determine a target reference proposal region based on multiple second reference proposal regions of each feature element, and further determine the corresponding target reference image. Then, the target reference image can be identified by the fourth preset sub-model of the preset feature element recognition model to determine the reference element information of the reference feature element, and obtain the reference element information of each reference feature element in at least one reference feature element in the image sample.
[0128] S860 determines the loss function value of the preset feature element recognition model based on the reference element information corresponding to the target image sample and the sample label of the target image sample.
[0129] The target image sample is any one of multiple image samples.
[0130] In some embodiments, the album generation device can determine the loss function value of a preset feature element recognition model based on the reference element information corresponding to the target image sample and the sample label of the target image sample.
[0131] S870 uses the loss function value of a preset feature element recognition model to train the preset feature element recognition model using multiple image samples, thus obtaining the trained feature element recognition model.
[0132] In some embodiments, to obtain a better trained feature element recognition model, the image processing device can adjust the model parameters of the preset feature element recognition model if the loss function value of the preset feature element recognition model does not meet the training stopping condition. Then, using multiple image samples, the device trains the preset feature element recognition model with adjusted model parameters until the loss function value of the preset feature element recognition model meets the training stopping condition, thus obtaining the trained feature element recognition model. The training stopping condition can be based on practical experience or circumstances, and is specifically defined in this embodiment.
[0133] In this embodiment, a training sample set can be obtained, which includes multiple image samples and sample labels corresponding to each image sample. Based on this, for each image sample, the image sample can be identified sequentially using each sub-model in the preset feature element recognition model to obtain reference element information of each reference feature element in at least one reference feature element in the image sample. Based on the reference element information corresponding to the target image sample and the sample label of the target image sample, the loss function value of the preset feature element recognition model is determined. Then, based on the loss function value of the preset feature element recognition model, the preset feature element recognition model can be trained using multiple image samples to obtain a more accurate feature element recognition model.
[0134] In some embodiments, the element information of each feature element includes the number of sub-elements, element type, and element size; based on this, the above S130 may specifically include the following steps:
[0135] Obtain the correspondence between element information and popularity information. The correspondence between element information and popularity information includes the correspondence between the number of child elements, element type, element size and popularity information;
[0136] Based on the correspondence between the number of sub-elements, element type, element size and popularity information, the number of sub-elements, element type and element size of feature elements are matched to determine the popularity information of feature elements.
[0137] In some embodiments, the correspondence between the element information and the heat information includes the correspondence between the number of sub-elements, element type, element size and heat information.
[0138] In some embodiments, after obtaining the element information of each feature element in each image to be processed, the album generation device can obtain the correspondence between the element information and the heat information, since the element information of the feature element may include the number of sub-elements, element type, and element size of the feature element. The correspondence between the element information and the heat information includes the correspondence between the number of sub-elements, element type, element size and heat information. Thus, the heat information of the feature element can be determined by matching the number of sub-elements, element type, and element size of the feature element based on the correspondence between the number of sub-elements, element type, element size and heat information.
[0139] In one example, the correspondence between the element information and the popularity information can include a first sub-relationship and a second sub-relationship. The first sub-relationship includes the correspondence between the number of child elements, element type, element size, and feature element name, as shown in Table 1 below. The second sub-relationship can include the correspondence between feature element name and popularity information, as shown in Table 2 below.
[0140] In this way, we can first match the number of child elements, element type and element size of the feature element based on the first sub-relation to obtain the feature element name, and then match the feature element name through the second sub-relation to obtain the corresponding popularity information.
[0141] Table 1
[0142] Table 2
[0143] In this embodiment, the correspondence between element information and popularity information can be obtained, and based on the correspondence between element information and popularity information, the element information of feature elements can be matched to accurately determine the popularity information of feature elements.
[0144] In some embodiments, prior to S140, the album generation method described above further includes:
[0145] For each of the multiple images to be processed, a target element is determined from at least one feature element in the image to be processed, thus obtaining the target element corresponding to each of the multiple images to be processed.
[0146] Based on the heat information of the target elements in each of the multiple images to be processed, a sorted timeline is generated according to a preset heat order.
[0147] The preset popularity order can be pre-set based on actual experience or circumstances, and this application embodiment does not make specific limitations here.
[0148] In some embodiments, the album generation device can determine a target element from at least one feature element in each of a plurality of images to be processed, obtain the target element corresponding to each of the plurality of images to be processed, and then generate a sorted timeline based on the heat information of the target element of each of the plurality of images to be processed in a preset heat order.
[0149] It should be noted that in the process of determining the target element from at least one feature element in the image to be processed and obtaining the target elements corresponding to multiple images to be processed respectively, the feature element with the highest popularity can be determined as the target element, or the feature element selected by the user can be determined as the target element.
[0150] In this embodiment, target elements can be determined from each image to be processed, and a sorting timeline can be generated based on the popularity information of the target elements in each image to be processed, so as to provide a new album generation method, improve user satisfaction, and enhance the utilization of hardware processing resources.
[0151] Based on the same inventive concept, this application also provides a photo album generation device. The photo album generation device provided in this application will be described in detail with reference to FIG9.
[0152] Figure 9 is a schematic diagram of the structure of an album generation device provided in an embodiment of this application.
[0153] As shown in Figure 9, the album generation device 900 may include:
[0154] The acquisition module 910 is configured to acquire multiple images to be processed for generating the story album;
[0155] The determination module 920 is configured to determine, for each image to be processed, at least one feature element in the image to be processed and the element information of each feature element using a feature element recognition model;
[0156] The determination module 920 is also configured to determine the heat information of each feature element in each image to be processed;
[0157] The generation module 930 is configured to generate a story album by sorting multiple images to be processed according to a sorting timeline. The sorting timeline is generated based on the heat information of target elements in each image to be processed. The target element is any one of at least one feature element in the image to be processed.
[0158] In some embodiments, the album recognition device provided in this application may further include:
[0159] The recognition module is configured to use the first sub-model of the feature element recognition model to recognize the feature elements in the image to be processed, and obtain stacked feature maps corresponding to at least one feature element in the image to be processed.
[0160] The determination module is specifically configured to use the second sub-model of the feature element recognition model to process the stacked feature map corresponding to each feature element, determine multiple first proposal regions corresponding to the feature element and the region confidence of each first proposal region, and the region confidence of each first proposal region represents the authenticity of the first proposal region.
[0161] The adjustment module is configured to use the third sub-model of the feature element recognition model to adjust the multiple first proposal regions corresponding to each feature element, thereby obtaining multiple second proposal regions corresponding to the feature element.
[0162] The determination module is specifically configured to use the fourth sub-model of the feature element recognition model to recognize the target image corresponding to each feature element, determine the element information of the feature element, and obtain the element information of each feature element in at least one feature element in the image to be processed. The target image of each feature element is obtained by cropping the image to be processed based on the target proposal region corresponding to each feature element. The target proposal region is determined from multiple second proposal regions corresponding to the feature element.
[0163] In some embodiments, the first sub-model includes a first convolutional layer, a first attention mechanism model, and N convolutional kernels; the album recognition device provided in this application embodiment may further include:
[0164] The convolution processing module is configured to perform convolution processing on the image to be processed through the first convolutional layer to obtain the first feature map;
[0165] The processing module is configured to process the first feature map through a first attention mechanism model to obtain a second feature map, wherein the second feature map includes the feature element position of at least one feature element in the image to be processed.
[0166] The processing module is also configured to process the third feature map sequentially with N convolution kernels for each feature element in at least one feature element to obtain a stacked feature map corresponding to the feature element. The third feature map is determined by multiplying the first feature map and the second feature map based on the feature element position of the feature element.
[0167] In some embodiments, the above processing module is specifically configured as follows:
[0168] The first feature map is transformed into a first sub-feature code and a second sub-feature code using the first attention mechanism model;
[0169] The first and second sub-feature codes are concatenated to obtain the total feature code, which is then processed using a convolutional transformation function to obtain the fourth feature map.
[0170] The fourth feature map is decomposed according to different dimensional directions to obtain the first and second sub-feature maps;
[0171] The first feature map, the first target sub-feature map, and the second target sub-feature map are processed using the first attention mechanism to obtain the second feature map. The first target sub-feature map is determined based on the first sub-feature map, and the second target sub-feature map is determined based on the second sub-feature map.
[0172] In some embodiments, each convolutional kernel includes a depthwise segregating convolutional layer, a second attention mechanism model, and a random deactivation layer; the above processing module is specifically configured as follows:
[0173] For the (i+1)th convolutional kernel among N convolutional kernels, the depthwise separable convolutional layer in the (i+1)th convolutional kernel is used to extract features from the output feature map of the i-th convolutional kernel to obtain the initial feature map corresponding to the (i+1)th convolutional kernel. The output feature map of the i-th convolutional kernel is obtained by processing the second feature map sequentially using the i-th convolutional kernel.
[0174] The initial feature map corresponding to the (i+1)th convolutional kernel is adjusted using the second attention mechanism model in the (i+1)th convolutional kernel, and the adjusted initial feature map corresponding to the (i+1)th convolutional kernel is processed using the random deactivation layer in the (i+1)th convolutional kernel to obtain the reference feature map corresponding to the (i+1)th convolutional kernel.
[0175] Based on the output feature map of the i-th convolutional kernel and the reference feature map corresponding to the (i+1)-th convolutional kernel, the output feature map of the (i+1)-th convolutional kernel is obtained. If i+1 is less than N, i = i+1 is updated, and the process returns to the step of using the depthwise separable convolutional layer in the (i+1)-th convolutional kernel to extract features from the output feature map of the i-th convolutional kernel and obtain the initial feature map corresponding to the (i+1)-th convolutional kernel, until i+j = N.
[0176] In some embodiments, the album generation apparatus provided in this application further includes:
[0177] The extraction module is configured to extract the regional location feature vector of the first proposal region from the first proposal region using the third sub-model of the feature recognition element model for each of the multiple first proposal regions corresponding to each feature element.
[0178] The processing module is also configured to perform classification and bounding box regression on the first proposed region based on the regional location feature vector of the first proposed region to obtain the second proposed region.
[0179] In some embodiments, the album generation apparatus provided in this application includes:
[0180] The filtering module is configured to filter at least two third proposal regions from multiple second proposal regions based on the region confidence of each second proposal region corresponding to each feature element, where the region confidence is greater than a confidence threshold.
[0181] The non-maximum suppression module is configured to determine the target proposal region from at least two third proposal regions by performing non-maximum suppression processing on at least two third proposal regions, and crop the image to be processed based on the target proposal region to obtain the target image corresponding to the feature elements.
[0182] In some embodiments, the album generation apparatus provided in this application may further include a training module, which is specifically configured as follows:
[0183] Obtain a training sample set, which includes multiple image samples and a sample label corresponding to each image sample. The sample label corresponding to each image sample is used to characterize the element information of each feature element in at least one feature element in the image sample.
[0184] For each image sample, perform the following steps:
[0185] The first preset sub-model of the preset feature element recognition model is used to identify the reference feature elements in the image sample, and the reference stacked feature map corresponding to at least one reference feature element in the image sample is obtained.
[0186] The second preset sub-model of the preset feature element recognition model is used to process the reference stacked feature map corresponding to each feature element to determine multiple first reference proposal regions corresponding to the reference feature element and the region confidence of each first reference proposal region. The reference region confidence of each first reference proposal region represents the authenticity of the first reference proposal region.
[0187] The third preset sub-model of the preset feature element recognition model is used to adjust the multiple first reference proposal regions corresponding to each reference feature element to obtain multiple second reference proposal regions of the reference feature element.
[0188] The target reference image corresponding to each reference feature element is identified by the fourth preset sub-model of the preset feature element recognition model, and the reference element information of the reference feature element is determined. The reference element information of each reference feature element in at least one reference feature element in the image sample is obtained. The target reference image corresponding to each reference feature element is obtained by cropping the image sample based on the target reference proposal region corresponding to each reference feature element. The target reference proposal region is determined from multiple second reference proposal regions corresponding to the reference feature element.
[0189] Based on the reference element information corresponding to the target image sample and the sample label of the target image sample, the loss function value of the preset feature element recognition model is determined, where the target image sample is any one of multiple image samples;
[0190] Based on the loss function value of the preset feature element recognition model, the preset feature element recognition model is trained using multiple image samples to obtain the trained feature element recognition model.
[0191] In some embodiments, the element information of each feature element includes the number of sub-elements, element type, and element size of the feature element;
[0192] The acquisition module is also configured to acquire the correspondence between element information and popularity information. The correspondence between element information and popularity information includes the correspondence between the number of child elements, element type, element size and popularity information.
[0193] The determination module is specifically configured to match the number of child elements, element type, element size and popularity information of feature elements based on the correspondence between the number of child elements, element type and element size and popularity information, and determine the popularity information of feature elements.
[0194] In some embodiments, the determining module is further configured to determine a target element from at least one feature element in each of a plurality of images to be processed, thereby obtaining target elements corresponding to the plurality of images to be processed respectively.
[0195] The generation module is also configured to generate a sorted timeline based on the heat information of the target elements in each of the multiple images to be processed, according to a preset heat order.
[0196] In this embodiment, multiple images to be processed for generating a story album can be acquired. For each image, a feature element recognition model is used to determine at least one feature element and the element information of each feature element. This allows for the determination of the popularity information of each feature element in each image. The multiple images are then sorted according to a timeline to generate the story album. Since this timeline can be generated based on the popularity information of target elements in each image, and popularity information characterizes the attention or popularity of a feature element (which is any one of the at least one feature element in the image), the timeline for displaying multiple images in the story album can be determined by the popularity of the feature elements in each image, effectively improving user satisfaction.
[0197] Each module in the album generation device provided in this application embodiment can implement the method steps of the embodiments shown in FIG1, FIG2 or FIG8, and achieve the corresponding technical effects. For the sake of brevity, it will not be described in detail here.
[0198] Figure 10 shows a schematic diagram of the hardware structure of the electronic device provided in an embodiment of this application.
[0199] An electronic device may include a processor 1001 and a memory 1002 storing computer program instructions.
[0200] Specifically, the processor 1001 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.
[0201] Memory 1002 may include mass storage for data or instructions. For example, and not limitingly, memory 1002 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 1002 may include removable or non-removable (or fixed) media. Where appropriate, memory 1002 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 1002 is non-volatile solid-state memory.
[0202] Memory may include read-only memory (ROM), random access memory (RAM), disk storage media devices, optical storage media devices, flash memory devices, and electrical, optical, or other physical / tangible memory storage devices. Therefore, typically, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to one aspect of this application.
[0203] The processor 1001 reads and executes computer program instructions stored in the memory 1002 to implement any of the album generation methods in the above embodiments.
[0204] In one example, the electronic device may also include a communication interface 1003 and a bus 1010. As shown in Figure 10, the processor 1001, memory 1002, and communication interface 1003 are connected via the bus 1010 and communicate with each other.
[0205] The communication interface 1003 is mainly used to realize communication between one or more of the modules, devices, units, and equipment in the embodiments of this application.
[0206] Bus 1010 includes one or more hardware and software components that couple together the parts of an online data flow metering device. For example, and not limited to, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect Express (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local Bus (VLB) bus, or other suitable buses, or a combination of two or more of these. Where appropriate, bus 1010 may include one or more buses. Although specific buses are described and illustrated in the embodiments of this application, this application considers any suitable bus or interconnection.
[0207] In addition, in conjunction with the album generation method in the above embodiments, this application embodiment can provide a computer storage medium for implementation. The computer storage medium stores computer program instructions; when these computer program instructions are executed by a processor, they implement any of the album generation methods provided in this application embodiment.
[0208] This application also provides a computer program product, which includes a computer program that is executed by a processor to implement any of the album generation methods provided in this application.
[0209] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0210] The functional blocks shown in the above block diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on machine-readable media or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable media" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable read-only ROM (EROM), floppy disks, compact disc read-only ROM (CD-ROM), optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.
[0211] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.
[0212] The aspects of this disclosure have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable album generation apparatus to create a machine such that these instructions, executable via the processor of the computer or other programmable album generation apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowchart illustrations, and combinations of blocks in the block diagrams and / or flowchart illustrations, can also be implemented by special-purpose hardware performing the specified functions or actions, or can be implemented by a combination of special-purpose hardware and computer instructions.
[0213] The above are merely specific embodiments of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the described systems, modules, and units can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.
Claims
1. A method for generating a photo album, the method comprising: Obtain multiple images to be processed for generating the story album; For each of the images to be processed, a feature element recognition model is used to determine at least one feature element in the image to be processed and the element information of each feature element; Determine the heat information of each feature element in each of the images to be processed; A story album is generated by sorting the multiple images to be processed according to a sorting timeline. The sorting timeline is generated based on the heat information of target elements in each image to be processed. The target element is any one of at least one feature element in the image to be processed.
2. The method of claim 1, wherein, The step of using a feature element recognition model to determine at least one feature element in the image to be processed and the element information of each feature element includes: The first sub-model of the feature element recognition model is used to identify the feature elements in the image to be processed, thereby obtaining stacked feature maps corresponding to at least one feature element in the image to be processed. The second sub-model of the feature element recognition model is used to process the stacked feature map corresponding to each feature element to determine multiple first proposal regions corresponding to the feature element and the region confidence of each first proposal region. The region confidence of each first proposal region represents the authenticity of the first proposal region. The third sub-model of the feature element recognition model is used to adjust the multiple first proposal regions corresponding to each feature element to obtain multiple second proposal regions corresponding to the feature element. The fourth sub-model of the feature element recognition model is used to identify the target image corresponding to each feature element, determine the element information of the feature element, and obtain the element information of each feature element in at least one feature element in the image to be processed. The target image of each feature element is obtained by cropping the image to be processed based on the target proposal region corresponding to each feature element. The target proposal region is determined from multiple second proposal regions corresponding to the feature element.
3. The method of claim 2, wherein, The first sub-model includes a first convolutional layer, a first attention mechanism model, and N convolutional kernels; The step of using the first sub-model of the feature element recognition model to identify feature elements in the image to be processed, and obtaining stacked feature maps corresponding to at least one feature element in the image to be processed, includes: The first convolutional layer is used to perform convolution processing on the image to be processed to obtain a first feature map; The first feature map is processed by the first attention mechanism model to obtain a second feature map, wherein the second feature map includes the feature element position of at least one feature element in the image to be processed; For each feature element in the at least one feature element, the third feature map is processed sequentially by the N convolution kernels to obtain the stacked feature map corresponding to the feature element. The third feature map is determined by multiplying the first feature map and the second feature map based on the feature element position of the feature element.
4. The method of claim 3, wherein, The step of processing the first feature map using the first attention mechanism model to obtain the second feature map includes: The first feature map is transformed into a first sub-feature code and a second sub-feature code through the first attention mechanism model; The first sub-feature code and the second sub-feature code are concatenated to obtain the total feature code, and the total feature code is processed using a convolution transformation function to obtain the fourth feature map; The fourth feature map is decomposed according to different dimensional directions to obtain the first sub-feature map and the second sub-feature map; The first feature map, the first target sub-feature map, and the second target sub-feature map are processed using a first attention mechanism to obtain a second feature map. The first target sub-feature map is determined based on the first sub-feature map, and the second target sub-feature map is determined based on the second sub-feature map.
5. The method of claim 3, wherein, Each of the convolutional kernels includes a depthwise separable convolutional layer, a second attention mechanism model, and a random deactivation layer; the step of processing the third feature map sequentially through the N convolutional kernels for each of the at least one feature element to obtain the stacked feature map corresponding to the feature element includes: For the (i+1)th convolutional kernel among the N convolutional kernels, feature extraction is performed on the output feature map of the (i+1)th convolutional kernel using the depth-separable convolutional layer in the (i+1)th convolutional kernel, to obtain the initial feature map corresponding to the (i+1)th convolutional kernel. The output feature map of the (i)th convolutional kernel is obtained by processing the second feature map sequentially using the (i)th convolutional kernel. The initial feature map corresponding to the (i+1)th convolutional kernel is adjusted using the second attention mechanism model in the (i+1)th convolutional kernel, and the adjusted initial feature map corresponding to the (i+1)th convolutional kernel is processed using the random deactivation layer in the (i+1)th convolutional kernel to obtain the reference feature map corresponding to the (i+1)th convolutional kernel. Based on the output feature map of the i-th convolutional kernel and the reference feature map corresponding to the (i+1)-th convolutional kernel, the output feature map of the (i+1)-th convolutional kernel is obtained. If i+1 is less than N, i = i+1 is updated, and the process returns to the step of using the depth-separable convolutional layer in the (i+1)-th convolutional kernel to extract features from the output feature map of the i-th convolutional kernel to obtain the initial feature map corresponding to the (i+1)-th convolutional kernel, until i+j = N.
6. The method of claim 2, wherein, The step of adjusting the multiple first proposal regions corresponding to each feature element using the third sub-model of the feature element recognition model to obtain multiple second proposal regions corresponding to the feature element includes: For each of the multiple first proposal regions corresponding to each feature element, the third sub-model of the feature recognition element model is used to extract the regional location feature vector of the first proposal region from the first proposal region. Based on the regional location feature vector of the first proposed region, the first proposed region is classified and bounding box regression is performed to obtain the second proposed region.
7. The method of claim 2, wherein, Before using the fourth sub-model of the feature element recognition model to recognize the target image corresponding to each feature element, determine the element information of the feature element, and obtain the element information of each feature element in at least one feature element in the image to be processed, the method further includes: Based on the region confidence level of each of the multiple second proposal regions corresponding to each feature element, at least two third proposal regions with region confidence levels greater than the confidence threshold are selected from the multiple second proposal regions. By performing nonmaximum suppression processing on the at least two third proposal regions, a target proposal region is determined from the at least two third proposal regions, and the image to be processed is cropped based on the target proposal region to obtain the target image corresponding to the feature element.
8. The method of claim 2, wherein, Before determining at least one feature element in the image to be processed and the element information of each feature element using the feature element recognition model, the method further includes: Obtain a training sample set, which includes multiple image samples and a sample label corresponding to each image sample. The sample label corresponding to each image sample is used to characterize the element information of each feature element in at least one feature element in the image sample. For each image sample, perform the following steps: The reference feature elements in the image sample are identified by using the first preset sub-model of the preset feature element recognition model, and reference stacked feature maps corresponding to at least one reference feature element in the image sample are obtained respectively. The reference stacked feature map corresponding to each reference feature element is processed using the second preset sub-model of the preset feature element recognition model to determine multiple first reference proposal regions corresponding to the reference feature element and the region confidence of each first reference proposal region. The reference region confidence of each first reference proposal region represents the authenticity of the first reference proposal region. The third preset sub-model of the preset feature element recognition model is used to adjust the multiple first reference proposal regions corresponding to each reference feature element to obtain multiple second reference proposal regions of the reference feature element. The target reference image corresponding to each reference feature element is identified by the fourth preset sub-model of the preset feature element recognition model, and the reference element information of the reference feature element is determined. The reference element information of each reference feature element in at least one reference feature element in the image sample is obtained. The target reference image corresponding to each reference feature element is obtained by cropping the image sample based on the target reference proposal region corresponding to each reference feature element. The target reference proposal region is determined from multiple second reference proposal regions corresponding to the reference feature element. Based on the reference element information corresponding to the target image sample and the sample label of the target image sample, the loss function value of the preset feature element recognition model is determined, wherein the target image sample is any one of multiple image samples; Based on the loss function value of the preset feature element recognition model, the preset feature element recognition model is trained using multiple image samples to obtain the trained feature element recognition model.
9. The method of claim 1, wherein, The element information of each feature element includes the number of sub-elements, element type, and element size of the feature element; Determining the heat information of each feature element in each of the images to be processed includes: Obtain the correspondence between element information and popularity information, including the correspondence between the number of sub-elements, element type, element size and popularity information; Based on the correspondence between the number of sub-elements, element type, element size and popularity information, the number of sub-elements, element type and element size of the feature element are matched to determine the popularity information of the feature element.
10. The method of claim 1, wherein, Before generating the story album by sorting the plurality of images to be processed according to a sorting timeline, the method further includes: For each of the plurality of images to be processed, a target element is determined from at least one feature element in the image to be processed, thereby obtaining the target element corresponding to the plurality of images to be processed respectively. Based on the heat information of the target elements in each of the multiple images to be processed, a sorted timeline is generated according to a preset heat order.
11. An album generation apparatus, the apparatus comprising: The acquisition module is configured to acquire multiple images to be processed for generating the story album; The determination module is configured to determine, for each image to be processed, at least one feature element in the image to be processed and the element information of each feature element using a feature element recognition model; The determining module is further configured to determine the heat information of each feature element in each of the images to be processed; The generation module is configured to generate a story album by sorting the plurality of images to be processed according to a sorting timeline, wherein the sorting timeline is generated based on the heat information of a target element in each image to be processed, and the target element is any one of at least one feature element in the image to be processed.
12. An electronic device, comprising: The device includes: a processor and a memory storing computer program instructions; The processor reads and executes the computer program instructions to implement the album generation method as described in any one of claims 1-10.
13. A computer storage medium, wherein, The computer storage medium stores computer program instructions, which, when executed by a processor, implement the album generation method as described in any one of claims 1-10.
14. A computer program product, wherein, The computer program product includes a computer program that, when executed by a processor, implements the album generation method according to any one of claims 1-10.