Method, device, electronic device and storage medium for video summarization
By analyzing the density and content characteristics of user comments, the video is segmented, and video segment summaries are generated.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2022-11-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for video segmentation and summary generation are relatively simplistic and cannot closely align with users' viewing perspectives or flexibly adapt to changes in user comments.
By analyzing the density and content characteristics of user comments, the video is segmented, and a summary of the video segments is generated using the comment information.
It enables the generation of video segments and summaries that are closer to user thoughts based on user comments, and can be flexibly adjusted as comments change, resulting in a variety of technical and summary results.
Smart Images

Figure CN115767207B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of video processing technology, and in particular to a method, apparatus, electronic device, and storage medium for video summarization. Background Technology
[0002] Video segmentation and video summaries help users understand the overall content of a video from a macro perspective. Typically, video segmentation and summaries are based on an understanding of the video's content itself. This often results in relatively simplistic segmentation and summary outputs. Summary of the Invention
[0003] This application provides a method, apparatus, electronic device, and storage medium for generating video summaries, which analyze videos from the viewer's perspective, thus better reflecting the user's thoughts. As user comments increase and change, the segmentation of the target video also changes accordingly, resulting in a richer summary.
[0004] In a first aspect, embodiments of this application provide a method for generating video summaries, which may include:
[0005] Determine the comment information for the target video;
[0006] By utilizing the density and content characteristics of comment information, the target video is segmented to obtain multiple video clips;
[0007] A summary of the video segment is generated using candidate comment information corresponding to the video segment; the candidate comment information is the comment information for the video segment.
[0008] Secondly, embodiments of this application provide a video summary generation apparatus, which may include:
[0009] The comment information determination module is used to determine comment information for the target video;
[0010] The video segmentation module is used to segment the target video by utilizing the density and content characteristics of comment information, resulting in multiple video clips;
[0011] The summary generation module is used to generate a summary of a video segment using candidate comment information corresponding to the video segment; the candidate comment information is the comment information for the video segment.
[0012] Thirdly, embodiments of this application provide an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor, when executing the computer program, implements the method described in any of the above-mentioned embodiments.
[0013] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the method described in any of the above-mentioned embodiments.
[0014] Compared with the prior art, this application has the following advantages:
[0015] The comments on a target video can be multi-dimensional, reflecting user feedback. Therefore, according to the embodiments of this application, the target video can be analyzed from the viewer's perspective, more closely aligning with user thoughts. For example, some comments might be from fans of a male actor concentrated in the first time period, while others might be from fans of a female actor concentrated in the second. Based on this, the target video can be segmented according to different actors, and a summary of the corresponding video can be generated based on the content of the comments. For another example, in shots containing props, comments might evaluate the props themselves, while in shots containing actors, some comments might evaluate the actors' clothing or makeup. Based on this, the target video can be segmented according to different shot content, and a summary of the corresponding video can be generated based on the content of the comments. In other words, through the above process, the target video can be segmented based on the user's perspective, and the segmentation and summary results will vary depending on the different comments. Furthermore, as user comments increase and change, the segmentation of the target video will also change accordingly, making it more flexible.
[0016] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application, it can be implemented according to the contents of the specification. In order to make the above and other objects, features and advantages of this application more obvious and understandable, specific embodiments of this application are given below. Attached Figure Description
[0017] In the accompanying drawings, unless otherwise specified, the same reference numerals throughout the various drawings denote the same or similar parts or elements. These drawings are not necessarily drawn to scale. It should be understood that these drawings depict only some embodiments according to this application and should not be construed as limiting the scope of this application.
[0018] Figure 1 A schematic diagram illustrating a scenario for the video summary generation method provided in this application;
[0019] Figure 2 This is one of the flowcharts for a method of generating video summaries according to an embodiment of this application;
[0020] Figure 3 This is a flowchart illustrating segmentation of a target video according to an embodiment of this application;
[0021] Figure 4 This is a flowchart illustrating a method for determining the content features of comment information according to an embodiment of this application;
[0022] Figure 5 A second flowchart illustrating a method for generating video summaries according to an embodiment of this application;
[0023] Figure 6 A flowchart of a method for generating video summaries according to an embodiment of this application is shown in Part 3.
[0024] Figure 7 A flowchart of a method for generating video summaries according to an embodiment of this application is shown in Figure 4.
[0025] Figure 8 This is one of the structural block diagrams of the device for generating video summaries in this application;
[0026] Figure 9 This is the second structural block diagram of the device for generating video summaries according to this application; and
[0027] Figure 10 This is a block diagram of an electronic device used to implement embodiments of this application. Detailed Implementation
[0028] In the following description, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments can be modified in various ways without departing from the concept or scope of this application. Therefore, the drawings and description are considered to be exemplary in nature and not restrictive.
[0029] To facilitate understanding of the technical solutions of the embodiments of this application, the relevant technologies of the embodiments of this application are described below. The following relevant technologies are optional solutions and can be combined with the technical solutions of the embodiments of this application in any way, and all of them fall within the protection scope of the embodiments of this application.
[0030] First, the terms used in this application will be explained.
[0031] The Chinese multimodal pre-trained model (M6, Multi-Modality to Multi-Modality MultitaskMega-Transformer) is based on a translation model (Transformer) and pre-trained through multiple tasks. This pre-training enables the model to understand and generate both unimodal and multimodal data. The Chinese multimodal pre-trained model can be applied to a range of downstream applications, such as object description generation, visual question answering, and Chinese poetry generation.
[0032] Text summarization algorithms are methods for extracting key information from one or more information sources. Based on words and sentences in the information source, candidate content is generated using textual features such as word frequency and sentence position. Then, external semantic resources are used to select keywords and key sentences from the candidate content to generate the corresponding summary.
[0033] Time segmentation algorithms: These algorithms can divide a video into multiple video segments along the time dimension using specified rules. For example, the specified rules could be shot similarity rules, scene similarity rules, etc.
[0034] Figure 1 This is a schematic diagram illustrating an application scenario for implementing the method of this application embodiment. When a user watches a target video using a smart terminal such as a mobile phone or tablet, they interact with other users by adding comments (bullet comments). Based on this, the comments from different users can be used as a reference to generate a basis for segmenting the target video, i.e., generating time segments. A time segment can be from the inth second to the ith second of the video as the first time segment. The ith second to the i+mth second of the video can be used as the second time segment. i, n, and m can all be natural numbers. For example, generating a basis for segmenting the target video based on the comments can include statistically analyzing the density of the comments (the number of comments per unit time). For instance, if the number of comments in a certain time period is a multiple of the number of comments in other time periods, then the video segment corresponding to that time period can be determined as a highlight segment of the video, thereby segmenting the target video. Alternatively, the content of the comments can be identified. When some comments comment on segment A of the target video, while other comments comment on segment B, the content of the comments can be used as the basis for segmenting the target video. This content identification can be based on natural language processing techniques, specifically on the textual features of the comments.
[0035] The target video can be segmented based on time segments to obtain multiple video clips. Furthermore, the comment information corresponding to each video clip can be analyzed to obtain a text summary for each clip. Ultimately, at least one video clip of the target video and its summary can be obtained. These video clips and their summaries can be used for previewing the target video. For example, at least one preview window can be pre-built; upon receiving a user's selection instruction for the target video, the preview window can display the video clip and its summary, facilitating quick previewing. Compared to segmenting the video based on its content, analyzing the video from the viewer's perspective is closer to the user's understanding. Moreover, as user comments increase and change, the segmentation of the target video can be adjusted accordingly, making it more flexible.
[0036] This application provides a method for video summary generation, such as... Figure 2 The diagram shown is a flowchart of a video summary generation method according to an embodiment of this application, which may include:
[0037] Step S201: Determine the comment information for the target video.
[0038] The execution entity in this application embodiment can be the cloud or a client. The target video can be a movie, TV series, or other video, or it can be live TV content or a short video. Determining the comment information of the target video can include subtitles that distinguish the target video, information bars corresponding to live TV broadcasts, and comment information.
[0039] For example, typically, an information bar appears at the bottom of the target video. Subtitles appear at the bottom of the target video. Comments appear at the top of the target video in a left-to-right scrolling manner, or on the left side of the target video in a bottom-to-top scrolling manner, etc. Based on this, comment information for the target video can be determined through location recognition.
[0040] Step S202: Utilize the density and content characteristics of the comment information to segment the target video, resulting in multiple video clips.
[0041] The density of comment information can be determined based on the number of comments per unit time. For example, the unit time can be determined based on the length of the target video. For longer target videos such as movies or TV series, the unit time can be set to 1 minute, 2 minutes, etc. For shorter target videos such as short videos, the unit time can be 5 seconds, 10 seconds, etc. Based on the number of comments per unit time, the density of comment information appearing in different time periods can be determined.
[0042] The content features of comment information can be identified using natural language processing (NLP) techniques, resulting in a textual feature representation of the comment information. NLP techniques can include Chinese multimodal pre-trained models. By utilizing the text feature encoding capabilities of these models, the textual features of the comment information can be determined.
[0043] Based on a time-segmentation algorithm, segmentation nodes can be obtained by utilizing the density of comment information in different time periods and the content characteristics of the comment information. Based on these segmentation nodes, the target video can be segmented, ultimately dividing it into at least two video segments.
[0044] Step S203: Generate a summary of the video segment using the candidate comment information corresponding to the video segment; the candidate comment information is the comment information for the video segment.
[0045] For different video clips, comment information corresponding to the video clip can be selected as candidate comment information. For example, if the video clip is the segment between the 4th and 5th minute of the target video, then comment information appearing between the 4th and 5th minute can be used as candidate comment information.
[0046] Based on filtering, deduplication, and content recognition of candidate comment information, keywords and key sentences are identified from the candidate comment information. Keywords and key sentences in candidate comment information can be represented as follows: t1 and t2 can correspond to the start and end times of the video segment, respectively. Using a text summarization algorithm, based on the textual features of the comment information, a summary of the video segment is finally obtained.
[0047] Compared to segmenting videos based on content, this implementation analyzes the target video from the viewer's perspective, which is closer to the user's thinking. For example, fans of a male actor might focus their evaluations of the target video in the first time period, while fans of a female actor might focus their evaluations in the second time period. Based on this, when segmenting the target video, it can be done according to different actors, and corresponding video summaries can be generated based on the content of the evaluations. That is, through the above process, the target video can be segmented based on the user's perspective, and the segmentation and summary results will be diverse based on different comment information. Furthermore, as user comment information increases and changes, the segmentation of the target video will also change accordingly, thus becoming more flexible.
[0048] like Figure 3 As shown, in one possible implementation, step S202 involves segmenting the target video using the density and content features of the comment information to obtain multiple video segments, including:
[0049] Step S301: Determine the segmentation nodes of the target video by utilizing the density and content characteristics of the comment information.
[0050] The time interval can be set according to the length of the target video; the interval can be 10 seconds, 30 seconds, 1 minute, etc. The density and content characteristics of comment information within each time interval are statistically analyzed to obtain statistical results. Based on these results, the similarity of comment information between adjacent time intervals is determined to classify them as similar. If the statistical results of two adjacent time intervals are similar, they can be combined into a single time interval. Conversely, if the statistical results of two adjacent time intervals are dissimilar, the time point between them can be used as a segmentation point for the target video. For example, if the time interval is 10 seconds, the statistical results for the first time interval (10 to 20 seconds) and the second time interval (20 to 30 seconds) are obtained. If these two results are similar, the 10 to 30 seconds interval can be combined into a single time interval. Subsequently, if the statistical results for the third time interval (30 to 40 seconds) are similar to those for the second time interval, the 10 to 40 seconds interval can also be combined into a single time interval. Conversely, if the statistical results of the third time interval from 30 to 40 seconds are not similar to those of the second time interval, the 30th second can be used as the segmentation node of the target video.
[0051] Step S302: Use the segmentation nodes to segment the target video to obtain multiple video segments.
[0052] By using the identified segmentation nodes of the target video, the target video can be segmented, thus dividing it into multiple video segments.
[0053] In one possible implementation, step S2021, which involves determining the segmentation nodes of the target video using the density and content features of the comment information, may include:
[0054] Step S303: Utilize the density of comments in the target video during the i-th time period and the content features of the comments during the i-th time period to determine the time period features of the i-th time period, where i is a positive integer.
[0055] Based on the number of comments in the i-th time period, the density of comments in that time period can be obtained. This can be achieved by directly using numerical density values as the density feature; alternatively, the density can be represented as a vector to obtain the density feature for the i-th time period. The density feature of comments in the i-th time period can be expressed as: .
[0056] Furthermore, the comment information in the i-th time period can be filtered, deduplicated, and aggregated to obtain usable comment information. Text recognition technology can be used to obtain the content features of the comment information. The content features of the comment information in the i-th time period can be represented as follows: .
[0057] By concatenating the density feature and content feature of the comment information in the i-th time period, we can obtain the time period feature of the i-th time period. The time period feature of the i-th time period can be represented as... .
[0058] Step S304: If the difference between the time period feature of the i-th time period and the time period feature of the (i-1)-th time period is greater than the corresponding difference threshold, the time node between the i-th time period and the (i-1)-th time period is determined as the segmentation node; the i-th time period and the (i-1)-th time period are obtained by pre-segmenting the target video according to a predetermined strategy.
[0059] The difference between the time-segment features of the i-th time segment and the (i-1)-th time segment can be determined by feature comparison. If the determined difference exceeds the corresponding difference threshold, it indicates a significant difference in the comment information between the i-th and (i-1)-th time segments. Based on this, it can be assumed that the target video changed during the i-th time segment. Therefore, the time point between the i-th and (i-1)-th time segments can be determined as the segmentation point.
[0060] The i-th and (i-1)-th time segments can be obtained by pre-segmenting the target video according to a predetermined strategy. The predetermined strategy can be to segment the video every minute, or to randomly segment any time segment between 10 seconds and 30 seconds.
[0061] like Figure 4 As shown, in one possible implementation, the method for determining the content characteristics of comment information may include:
[0062] Step S401: Perform feature extraction processing on the word segments in the comment information to obtain the word feature vectors corresponding to the word segments.
[0063] For comment information, filtering can be performed first. For example, invalid information can be filtered out. Secondly, comments with identical or similar content can be clustered. For example, comments with similar content such as "Great!", "Wonderful!", and "That was really interesting!" can be clustered into groups with the same content. After clustering, word segmentation can be performed on the comment information. The word segments are extracted to obtain the corresponding word feature vectors.
[0064] Step S402: Obtain the content feature vector of the comment information using word feature vectors.
[0065] By concatenating word feature vectors, we can obtain the content feature vector of a single comment. For example, during the concatenation process, natural language processing techniques can be used to concatenate the word feature vectors in an order consistent with Chinese expression.
[0066] Step S403: Perform average pooling on the content feature vectors of multiple comment messages to obtain the content features of the comment messages.
[0067] If there are multiple comments within the current time period, average pooling can be performed on the feature vectors of these multiple comments to obtain an average of the feature vectors. This average can then be used as the content feature of the comments within the current time period.
[0068] In one possible implementation, the method for determining the difference may include:
[0069] Assign a first initial weight to the density of comment information.
[0070] Assign a second initial weight to the content features of the comment information.
[0071] According to the specified rules, the first initial weight and the second initial weight are dynamically adjusted to obtain the dynamic adjustment result. The specified rules are determined based on the number of comment messages or the repetition of comment messages.
[0072] Based on the dynamic adjustment results, the differences between the time period characteristics of the i-th time period and the time period characteristics of the (i-1)-th time period are determined.
[0073] When comparing differences, both the density characteristics and content characteristics of the comment information can be considered simultaneously. Specifically, initial weights can be assigned to both. For example, the initial weights can each be 50%, or they can be set to 45% and 55% based on empirical values, etc. The specific values are not limited in the current implementation.
[0074] Furthermore, the initial weights can be dynamically adjusted. The basis for dynamically adjusting the initial weights can be the number of comments or the degree of repetition among them. For example, adjustments can be made based on the number of comments. If the number of comments is less than a corresponding threshold, the weight of the comment density feature can be reduced. As another example, adjustments can be made based on the content repetition of comments. If the proportion of repetitive content exceeds a corresponding threshold, the weight of the comment content feature can be reduced.
[0075] The dynamic adjustment results correspond to the adjustment results of the first initial weight of comment information density and the second initial weight of comment information content features. When comparing differences, the adjustment results can be used as coefficients. Comparisons based on these coefficients yield more objective results.
[0076] like Figure 5 As shown, in one possible implementation, it may further include:
[0077] Step S501: Determine the score of the video segment based on the quality of the candidate comment information.
[0078] The quality of candidate comments can be determined based on both the quantity and content of the comments. The content of candidate comments can be determined from several dimensions, including sentiment, content redundancy, and the number of users.
[0079] Taking the sentiment dimension as an example, based on text recognition technology, the sentiment classification of candidate comments can be determined. In a coarse-grained sense, sentiment classification can include positive, neutral, and negative sentiment. For example, phrases like "That's so interesting!" and "That's hilarious!" can be considered positive sentiment. Phrases like "It's okay!" and "It's acceptable!" can be considered neutral sentiment. Phrases like "That's so awkward!" and "That's boring!" can be considered negative sentiment.
[0080] The content redundancy dimension can be based on clustering of candidate comment information. For example, a threshold N can be set. If the clustering result is less than N, the content redundancy of the candidate comment information can be considered high, meaning that most candidate comment information contains the same content. If the clustering result is not less than N, the content redundancy of the candidate comment information can be considered low, meaning that the candidate comment information is diverse.
[0081] The number of users can be considered the number of users who posted candidate comments. For example, in an extreme case, a candidate comment might be posted by a single user. In this case, the number of users could be considered very low. Conversely, if each candidate comment corresponds to a different user, then the number of users could be considered very high.
[0082] Based on the different dimensions mentioned above, the quality of the candidate comment information can be determined. Based on the quality of the candidate comment information, each video segment can be scored, thus obtaining a score for each video segment.
[0083] Step S502: Based on the scores of the video segments, identify the highlights of the target video.
[0084] The scores of video clips can be sorted, and a specified number of video clips can be selected as highlights of the target video.
[0085] like Figure 6 As shown, in one possible implementation, it may further include:
[0086] Step S601: Associate the video clip with the target video.
[0087] For the same target video, multiple video segments can be obtained. Each video segment can be associated with the target video to indicate that each video segment is derived from the target video. The video segments can be displayed as a progress bar.
[0088] Step S602: Upon receiving a video clip display instruction, display the video clip in the video preview window of the target video.
[0089] Video clip display commands can be in the form of voice, actions, etc. For example, a voice command could be "Play a video clip from [video name]". An action command could be a target video clip selected by an action. The display could include: showing the video clip in the target video's preview window as a progress bar, and displaying a summary of the video clip at a specified location.
[0090] The method for generating the video summary can be a local application (APP) on the user's terminal, or a functional module of an APP, or a service provided by the cloud. The user calls the corresponding call interface of the service, uploads the information of interest to the cloud, and receives the results fed back by the cloud, such as video clips or video summaries.
[0091] For example, several distributed computing nodes can be deployed in the cloud, each with processing resources such as computing and storage. In the cloud, multiple computing nodes can be organized to provide one or more services within the video summarization method; or, multiple computing nodes can be organized to provide one or more services within the video summarization method. For example, the service may include determining comment information for a target video; segmenting the target video using the density and content characteristics of the comment information to obtain multiple video segments; and generating a summary of the video segments using candidate comment information corresponding to the video segments; where the candidate comment information is comment information specific to the video segments. Of course, a single computing node can also provide one or more services. The cloud can provide this service by providing a service interface to the outside world, which users can call to use the corresponding service.
[0092] According to the solution provided in this embodiment of the invention, the cloud can provide a service interface with information recognition services, referred to as the target service interface. When a user needs to view a video summary, the user device calls the target service interface to trigger a request to the cloud to call the target service interface. The cloud determines the computing node that responds to the request and uses the processing resources in the computing node to execute the steps provided in this embodiment of the application.
[0093] This application provides a method for video summary generation, such as... Figure 7 The diagram shown is a flowchart of a video summary generation method according to an embodiment of this application, which may include:
[0094] Step S701: Upon receiving a summary generation instruction for the target video, acquire multiple video segments of the target video; each video segment contains a summary; the video segment is obtained by segmenting the target video after determining the comment information for the target video, utilizing the density and content features of the comment information; the summary is generated using candidate comment information corresponding to the video segment; the candidate comment information is comment information specific to the video segment.
[0095] The execution entity in this application embodiment can be a client device such as a smartphone, television, or tablet. The target video can be a movie, TV series, or other video, or it can be live television content or a short video. The summary generation instruction can be an instruction given by the user through actions, voice, or other means. Scenarios for issuing summary production instructions can include video previewing, video summary creation, etc.
[0096] The comment information for the target video can include subtitles that distinguish the target video, information bars corresponding to the live TV broadcast, and comment information. Summary generation instructions can be in the form of speech, actions, etc. For example, a speech instruction could be "Play a video clip from [video name]". An action instruction could be a target video clip selected for playback via a specific action.
[0097] The density of comment information can be determined based on the number of comment messages per unit time. For example, the unit time can be determined based on the length of the target video.
[0098] The content features of comment information can be identified using natural language processing (NLP) techniques, resulting in a textual feature representation of the comment information. NLP techniques can include Chinese multimodal pre-trained models. By utilizing the text feature encoding capabilities of these models, the textual features of the comment information can be determined.
[0099] Based on a time-segmentation algorithm, segmentation nodes can be obtained by utilizing the density of comment information in different time periods and the content characteristics of the comment information. Based on these segmentation nodes, the target video can be segmented, ultimately dividing it into at least two video segments.
[0100] For different video clips, corresponding comment information can be selected as candidate comment information. Based on filtering, deduplication, and content recognition of the candidate comment information, keywords and key sentences are identified. Using a text summarization algorithm, the keywords and key sentences are combined according to grammatical structure to finally obtain a summary of the video clip.
[0101] Step S702: Display the video clip and the corresponding summary in the video preview window of the target video.
[0102] The video preview window can be a pre-defined window, such as a pop-up or a picture-in-picture format. The presentation can include displaying video clips in the target video's preview window using a progress bar, and showing a summary of the video clip at a specified location.
[0103] Corresponding to the application scenarios and methods provided in the embodiments of this application, the embodiments of this application also provide a video summarization apparatus. For example... Figure 8 The diagram shown is a structural block diagram of a video summarization apparatus according to an embodiment of this application. The video summarization apparatus may include:
[0104] The comment information determination module 801 is used to determine comment information for the target video.
[0105] The video segmentation module 802 is used to segment the target video by utilizing the density and content characteristics of the comment information to obtain multiple video segments;
[0106] The summary generation module 803 is used to generate a summary of the video segment using candidate comment information corresponding to the video segment; the candidate comment information is comment information for the video segment.
[0107] In one possible implementation, the video segmentation module 802 may include:
[0108] The segmentation node determination submodule is used to determine the segmentation nodes of the target video by utilizing the density and content characteristics of the comment information;
[0109] The video segmentation execution submodule is used to segment the target video using segmentation nodes to obtain multiple video clips.
[0110] In one possible implementation, the segmentation node determination submodule may include:
[0111] The time period feature determination unit is used to determine the time period features of the i-th time period by utilizing the density of the comment information of the target video in the i-th time period and the content features of the comment information in the i-th time period, where i is a positive integer;
[0112] The segmentation node determination execution unit is used to determine the time node between the i-th time period and the (i-1)-th time period as a segmentation node when the difference between the time period feature of the i-th time period and the time period feature of the (i-1)-th time period is greater than the corresponding difference threshold; the i-th time period and the (i-1)-th time period are obtained by pre-segmenting the target video according to a predetermined strategy.
[0113] In one possible implementation, the video segmentation module 802 may further include:
[0114] The word feature vector determination submodule is used to perform feature extraction processing on the word segmentation in the comment information to obtain the word feature vector corresponding to the word segmentation.
[0115] The content feature vector determination submodule is used to obtain the content feature vector of comment information using word feature vectors;
[0116] The content feature determination submodule is used to perform average pooling on the content feature vectors of multiple comment information to obtain the content features of the comment information.
[0117] In one possible implementation, the segmentation node determining the execution unit may include:
[0118] The first initial weight allocation subunit is used to allocate a first initial weight to the density of comment information;
[0119] The second initial weight allocation subunit is used to allocate second initial weights to the content features of the comment information;
[0120] The weight dynamic adjustment subunit is used to dynamically adjust the first initial weight and the second initial weight according to the specified rules to obtain the dynamic adjustment result. The specified rules are determined based on the number of comment information or the repetition of comment information.
[0121] The difference determination subunit is used to determine the difference between the time period characteristics of the i-th time period and the time period characteristics of the (i-1)-th time period based on the dynamic adjustment results.
[0122] In one possible implementation, it may also include:
[0123] The scoring module is used to determine the score of a video segment by utilizing the quality of the candidate comment information;
[0124] The Highlights Identification module is used to identify highlights from a target video based on the scores of the video clips.
[0125] In one possible implementation, it may also include:
[0126] The association module is used to associate video clips with target videos;
[0127] The display module is used to display video clips in the video preview window of the target video when a video clip display instruction is received.
[0128] Corresponding to the application scenarios and methods provided in the embodiments of this application, the embodiments of this application also provide a video summarization apparatus. For example... Figure 9 The diagram shown is a structural block diagram of a video summarization apparatus according to an embodiment of this application. The video summarization apparatus may include:
[0129] The attention information sending module 901 is used to obtain multiple video segments of the target video upon receiving a summary generation instruction for the target video; the video segments contain summaries; the video segments are obtained by segmenting the target video after determining the comment information for the target video, using the density and content features of the comment information; the summaries are generated using candidate comment information corresponding to the video segments; the candidate comment information is comment information specific to the video segments.
[0130] The video summary display module 902 is used to display video clips and corresponding summaries in the video preview window of the target video.
[0131] The functions of each module in each device in the embodiments of this application can be found in the corresponding description in the above method, and they have corresponding beneficial effects, which will not be repeated here.
[0132] Figure 10 This is a block diagram of an electronic device used to implement embodiments of this application. For example... Figure 10 As shown, the electronic device includes a memory 1010 and a processor 1020. The memory 1010 stores a computer program that can run on the processor 1020. When the processor 1020 executes the computer program, it implements the method described in the above embodiments. The number of memories 1010 and processors 1020 can be one or more.
[0133] The electronic device also includes:
[0134] The communication interface 1030 is used to communicate with external devices and perform data exchange and transmission.
[0135] If the memory 1010, processor 1020, and communication interface 1030 are implemented independently, they can be interconnected via a bus to communicate with each other. This bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 10 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0136] Optionally, in a specific implementation, if the memory 1010, processor 1020 and communication interface 1030 are integrated on a single chip, then the memory 1010, processor 1020 and communication interface 1030 can communicate with each other through an internal interface.
[0137] This application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method provided in this application.
[0138] This application also provides a chip including a processor for calling and executing instructions stored in a memory, causing a communication device with the chip installed to perform the method provided in this application.
[0139] This application also provides a chip, including: an input interface, an output interface, a processor, and a memory. The input interface, output interface, processor, and memory are connected through an internal connection path. The processor is used to execute code in the memory. When the code is executed, the processor is used to execute the method provided in the application embodiment.
[0140] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. General-purpose processors can be microprocessors or any conventional processor. It is worth noting that the processor can be a processor supporting Advanced Reduced Instruction Set Machines (ARM) architecture.
[0141] Further, optionally, the aforementioned memory may include read-only memory and random access memory. The memory may be volatile memory or non-volatile memory, or may include both. Non-volatile memory may include read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory may include random access memory (RAM), which serves as an external cache. By way of example, but not limitation, many forms of RAM are available. Examples include Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Sync Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
[0142] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another.
[0143] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of those different embodiments or examples.
[0144] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "a plurality of" means two or more, unless otherwise explicitly specified.
[0145] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process. Furthermore, the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functionality involved.
[0146] The logic and / or steps described in the flowchart or otherwise herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a processor-included system or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).
[0147] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware, the program being stored in a computer-readable storage medium, which, when executed, includes one or a combination of the steps of the method embodiments.
[0148] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. This storage medium can be a read-only memory, a disk, or an optical disk, etc.
[0149] The above description is merely an exemplary embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various variations or substitutions within the technical scope described in this application, and these should all be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method of video summarization generation, characterized by, include: Determine the comment information for the target video; The target video is segmented using the density and content features of the comment information to obtain multiple video segments. Multiple time periods of the target video are determined based on a preset time interval. If the difference between the time period features of any adjacent time periods of the target video exceeds a corresponding difference threshold, the time node between those adjacent time periods is determined as a segmentation node for segmenting the target video. The time period feature of each time period is a weighted calculation of the density and content features of the comment information within that time period. The weight of the density feature decreases when the number of comment information is less than a corresponding quantity threshold, and the weight of the content feature decreases when the proportion of repetitive content exceeds a corresponding proportion threshold. A summary of the video segment is generated using candidate comment information corresponding to the video segment; the candidate comment information is comment information for the video segment.
2. The method of claim 1, wherein, The target video is segmented using the density and content features of the comment information to obtain multiple video segments, including: By utilizing the density and content features of the comment information, the segmentation nodes of the target video are determined; The target video is segmented using the segmentation nodes to obtain the multiple video segments.
3. The method of claim 2, wherein, The step of determining the segmentation nodes of the target video using the density and content features of the comment information includes: By utilizing the density of comment information in the target video during the i-th time period and the content characteristics of the comment information during the i-th time period, the time period characteristics of the i-th time period are determined, where i is a positive integer; If the difference between the time period feature of the i-th time period and the time period feature of the (i-1)-th time period is greater than the corresponding difference threshold, the time node between the i-th time period and the (i-1)-th time period is determined as the segmentation node; the i-th time period and the (i-1)-th time period are obtained by pre-segmenting the target video according to a predetermined strategy.
4. The method according to any one of claims 1 to 3, characterized in that, The methods for determining the content characteristics of the comment information include: The word segmentation in the comment information is processed by feature extraction to obtain the word feature vector corresponding to the word segmentation; The content feature vector of the comment information is obtained by using the word feature vector; Average pooling is performed on the content feature vectors of multiple comment information to obtain the content features of the comment information.
5. The method of claim 1, wherein, Also includes: The quality of the candidate comment information is used to determine the score of the video segment; Based on the scores of the video segments, the highlights of the target video are determined.
6. The method of claim 1, wherein, Also includes: Associate the video clip with the target video; Upon receiving a video clip display instruction, the video clip is displayed in the video preview window of the target video.
7. A method of video summarization generation, characterized by, include: Upon receiving a summary generation instruction for the target video, multiple video segments of the target video are acquired; The video clip includes a summary; The video segment is obtained by segmenting the target video after determining the comment information for the target video, using the density and content features of the comment information. The summary is generated using candidate comment information corresponding to the video segment; the candidate comment information is the comment information for the video segment; wherein, multiple time periods of the target video are determined based on a preset time interval, and if the difference between the time period features of any adjacent time periods of the target video is greater than the corresponding difference threshold, the time node between the adjacent time periods is determined as a segmentation node for segmenting the target video; the time period feature of each time period is a weighted calculation value of the density feature and content feature of the comment information in that time period, the weight of the density feature decreases when the number of comment information is less than the corresponding number threshold, and the weight of the content feature decreases when the proportion of repetitive content exceeds the corresponding proportion threshold; The video clip and its corresponding summary are displayed in the video preview window of the target video.
8. An apparatus for video summarization generation, the apparatus comprising: include: The comment information determination module is used to determine the comment information for the target video; A video segmentation module is used to segment the target video using the density and content features of the comment information to obtain multiple video segments. The module determines multiple time periods of the target video based on preset time intervals. If the difference between the time period features of any adjacent time periods of the target video exceeds a corresponding difference threshold, the time node between those adjacent time periods is determined as a segmentation node for segmenting the target video. The time period feature of each time period is a weighted calculation of the density and content features of the comment information within that time period. The weight of the density feature decreases when the number of comment information is less than a corresponding quantity threshold, and the weight of the content feature decreases when the proportion of repetitive content exceeds a corresponding proportion threshold. A summary generation module is used to generate a summary of the video segment using candidate comment information corresponding to the video segment. The candidate comment information is comment information specific to the video segment.
9. An apparatus for video summarization generation, the apparatus comprising: include: The video summary generation module is used to obtain multiple video segments of the target video upon receiving a summary generation instruction for the target video. The video segment includes a summary; the video segment is obtained by segmenting the target video after determining the comment information for the target video, using the density and content features of the comment information; the summary is generated using candidate comment information corresponding to the video segment; the candidate comment information is comment information for the video segment; wherein, multiple time periods of the target video are determined based on a preset time interval, and if the difference between the time period features of any adjacent time periods of the target video is greater than a corresponding difference threshold, the time node between the adjacent time periods is determined as a segmentation node for segmenting the target video; the time period feature of each time period is a weighted calculation value of the density feature and content feature of the comment information in that time period, the weight of the density feature decreases when the number of comment information is less than a corresponding number threshold, and the weight of the content feature decreases when the proportion of repetitive content exceeds a corresponding proportion threshold; the video summary display module is used to display the video segment and the summary corresponding to the video segment in the video preview window of the target video.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory, wherein the processor, when executing the computer program, implements the method of any one of claims 1-7.
11. A computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of any one of claims 1-7.