Video title generating method and device
A technology for video titles and titles, applied in the Internet field, can solve the problems of low video title generation efficiency, achieve the effects of improving generation efficiency, increasing information volume, and saving manpower and material resources
Active Publication Date: 2018-11-16
SHENZHEN TENCENT NETWORK INFORMATION TECH CO LTD
7 Cites 7 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0005] Embodiments of the present invention provide a video title generation method and dev...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreMethod used
And, by obtaining the scene information according to the sound feature information and the image feature information of the game video, the generation module generates the title of the game video according to the scene information and the image feature information again, and increases the information available for reference when generating the title of the game video amount, so that the generated game video title can more accurately describe the main content of the game video, thus effectively improving the accuracy of the generated game video title.
And, obtain scene information according to sound feature information and image feature information of video by the second acquisition module, generation module generates video title according to scene information and image feature information again, increased the amount of information available for reference when generating video title , so that the generated video title can more accurately describe the main content of the video, thus effectively improving the accuracy of the generated video title.
And, the title generating method of this game video obtains scene information according to the sound feature information and image feature information of game video, then generates the title of game video according to scene information and image feature information, increases the time when generating the title of game video The amount of information available for reference enables the generated game video title to more accurately describe the main content of the game video, thus effectively improving the accuracy of the generated game video title.
And, this video title generating method obtains scene information according to the sound feature information and image feature information of video, then generates video title according to scene information and image feature information, increases the amount of information available for reference when generating video title, The generated video title can more accurately describe the main content of the video, thus effectively improving the accuracy of the generated video title.
Because a plurality of keywords recorded in each knowledge base are synonyms or synonyms, the semantics represented by these keywords are basically the same, so that the process of screening keywords is a controllable process, using the keywords obtained by the screening Filling the target title template can generate video titles with smooth semantics and enhance the readability of the generated video titles.
Because the image content in the sequential image frames in the video is continuously changing, and the image content of the adjacent image frames in the temporal sequence changes less, by screening the image frames in the image frames of the video, and obtaining the The image feature information of the screened image frames can reduce the redundant information in the image frames, so as to reduce the data required in the process of generatin...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreAbstract
The invention discloses a video title generating method and device and belongs to the technical field of internet. The video title generating method comprises the steps that sound characteristic information and image characteristic information of videos are obtained; based on the sound characteristic information and the image characteristic information, target scene information of the videos is acquired, wherein the target scene information is used for indicating scenes represented by the videos; based on the target scene information and the image characteristic information, video titles are generated. The video title generating efficiency is improved, and the video title generating method is used for generating video titles according to the videos.
Application Domain
Technology Topic
Image
Examples
- Experimental program(1)
Example Embodiment
[0036] In order to make the objectives, technical solutions, and advantages of the present application clearer, the following will further describe the embodiments of the present application in detail with reference to the accompanying drawings.
[0037] With the development of technology, more and more users obtain information by watching videos. Moreover, in order to meet the needs of different users, service providers generally provide a large number of videos for users to watch. Before watching a video, users usually select the video they want to watch from a large number of videos provided by the service provider according to the title of the video. Therefore, the video title has an important influence on the viewing rate of the video. For example, in order to better maintain the game ecology and generate greater user stickiness, game service providers produce a large number of game videos every day for users to watch. Faced with the large number of game videos, users usually follow the title of the game video Select the video you want to watch.
[0038] In related technologies, a method for generating a video title is usually: an operator watches the video, and after watching the video, determines the title of the video according to the content of the video. However, when the number of videos for which the title is to be generated is large, the generation efficiency of the video title is low.
[0039] In related technologies, video titles can also be generated through machine learning methods. For example: by combining Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and Attention models, and using the combined model to generate video titles. figure 1 For the structure diagram of the model, the model includes a multi-level network ( figure 1 Each dashed box in the middle represents a first-level network). For each level of the network, by inputting an image frame of the video to the CNN, the CNN extracts the image features of the image frame in the spatial dimension, and then inputs the image features to the RNN, and extracts the image of the image frame in the time dimension through the RNN Feature, and input the extracted image features to the RNN in the lower-level network and the last-level network to realize the transfer of image feature information. And the RNN in the last level of network can extract the image features of the image frames input to it by the CNN at this level from the time dimension, and generate the title of the video based on the image features and the image features input to it by RNNs in other levels of network . In the implementation of generating a video title through a deep learning method, when generating a video title based on image features, sampling is performed in a preset word set according to image features, and the sampled words are spliced to obtain the video title. However, because the sampling process is usually an uncontrollable process, the title generated from the words obtained by the sampling is usually a combination of words with unsound meaning. In addition, since this implementation method only generates video titles based on image features, some information in the video will be lost, resulting in poor description capabilities of the generated video titles for the main content of the video, that is, the overall accuracy of the generated video titles is low.
[0040] To this end, an embodiment of the present invention provides a method for generating a video title. By acquiring sound feature information and image feature information of a video, and according to the sound feature information and image feature information, the scene information of the scene presented by the video is obtained, and then according to The scene information and image feature information generate video titles. Compared with related technologies, video titles can be generated without operators watching the video, which effectively improves the efficiency of generating video titles. Moreover, the video title generation method obtains scene information according to the sound feature information and image feature information of the video, and then generates the video title based on the scene information and image feature information, which increases the amount of information available for reference when generating the video title, so that the generated The video title can more accurately describe the main content of the video, therefore, the accuracy of the generated video title is effectively improved.
[0041] figure 2 Is a flowchart of a method for generating a video title provided by an embodiment of the present invention, such as figure 2 As shown, the method can include:
[0042] Step 201: Acquire sound feature information and image feature information of the video.
[0043] The sound characteristic information may be information used to describe the properties of the sound source. For example, the sound characteristic information may be information used to describe the sounds of different firearms, car sounds, and other sound information. The image feature information may be information used to describe the content displayed in the image. For example, for an image frame of a game video, the image feature information may be information describing the content of the killed hero, the killed hero, the kill type, the kill under the defense tower, and the blood volume of the killed hero in the game.
[0044] When step 201 is executed, the implementation process includes: obtaining image feature information of the video, and obtaining sound feature information of the video. The realization process of the two parts are as follows:
[0045] In the first part, the process of obtaining the image feature information of the video may include: obtaining the image feature information of the target image frame among the multiple image frames included in the video.
[0046] Wherein, the target image frame may be each image frame included in the video, or the target image frame may be an image frame selected at intervals of a preset duration among multiple image frames included in the video. The preset duration can be set according to actual needs. For example, among the multiple image frames included in the video, the first image frame can be used as the starting point, and the image frames with an interval of 1 second (or 0.5 second) can be determined as the target image frame.
[0047] Moreover, the method for obtaining the image feature information of the target image frame may include: inputting the target image frame to the CNN to identify an image at a specific position in the target image frame through the CNN to obtain the image feature information of the target image frame.
[0048] Among them, the specific position in the target image frame can be set according to actual needs. For example: For the game video of a ** game, the image feature information that needs to be obtained is the information describing the content of the hero killed in the game, the type of kill, the hero killed, the kill under the defense tower, and the blood volume of the hero killed. image 3 This is a schematic diagram of a target image frame of the game video. Since the avatar of the killed hero is usually displayed at the position shown by the dashed box A1 in the image frame, the kill type is usually displayed at the position shown by the dashed box A2 in the image frame The avatar of the killed hero is usually displayed at the position shown by the dashed frame A3 in the image frame. Therefore, for the target image frame, the specific position can be the position shown by the dashed frame A1 and the dashed frame A2 respectively. And the position shown in the dashed box A3. Recognizing the image at the position shown by the dashed frame A1 through CNN, it can be obtained that the killed hero in the image frame is Yu Ji, and recognizing the image at the position shown by the dashed frame A2 through CNN, the image frame can be obtained The medium kill type is multiple kills (ie image 3 The three-game winning streak shown in), through CNN to identify the image at the position shown in the dotted frame A3, it can be obtained that the killed hero in the image frame is Zhang Fei, and then the target image frame describing the in-game hit Image feature information of the killed hero, the killed hero, and the kill type.
[0049] In addition, the target image frame can also be identified according to the preset display mode in the image to obtain image feature information. For example: in the display mode of the game video of the ** game, usually a circle with a size of S1 is used to display the defensive tower, and a circle with a size of S2 is used to display the avatar of the killed hero. Therefore, when obtaining image feature information, you can use CNN detects and recognizes the circle of size S1 and the circle of size S2 to obtain the information describing the defense tower and the information describing the killed hero. Moreover, after obtaining the information describing the defensive tower and the information describing the killed hero, the distance between the defensive tower and the killed hero can be obtained to obtain information describing the hero killed under the defensive tower. At the same time, since the blood volume is usually displayed above the hero's avatar (for example image 3 The position shown in the dotted box A4), after obtaining the hero’s information, correspondingly, the display position of the blood volume can be determined, and the image displayed at this position can be identified to obtain information describing the blood volume of the killed hero .
[0050] Because the image content in the sequential sequential image frames in the video is continuously changing, and the image content of the adjacent image frames in the sequential order changes little, the image frames are filtered out of the image frames of the video, and the screened image is obtained The image feature information of the frame can reduce the redundant information in the image frame and reduce the amount of data processing required in the process of generating the video title, compared with the implementation of acquiring the image feature information of each image frame included in the video. This will increase the speed of title generation.
[0051] The second part, because the sound feature information can be characterized by the Mel cepstrum coefficient feature of the sound information, please refer to Figure 4 , The realization process of obtaining the sound feature information of the video may include:
[0052] Step 2011: Obtain Mel cepstrum coefficient characteristics of the sound information.
[0053] Among them, Mel-scale Frequency Cepstral Coefficients (MFCC) is a cepstral parameter extracted in the Mel-scale frequency domain. By performing the following processing on sound information (usually represented by speech signals): pre-emphasis (pre-emphasis), framing (frame blocking), calculation of short-term energy (energy), windowing (hammingwindow), fast Fourier transform (FFT transform) ) And triangular band-pass filter (triangular band-passfilter) processing, you can get the Mel cepstrum coefficient characteristics of sound information.
[0054] Optionally, the sound information may be all the sound information of the video for which the title is to be generated from the start playing time to the end playing time of the video, or it may also be the sound information after screening all the sound information. For example, the sound information may be sound information within a preset time period in the video, or the sound information may be sound information within a preset time period before the time corresponding to the target image frame.
[0055] Step 2012: Classify the sound information based on the Mel cepstrum coefficient characteristics to obtain sound feature information.
[0056] After obtaining the Mel cepstrum coefficient feature of the sound information, the Mel cepstrum coefficient feature can be input to the RNN, and a classifier (for example: softmax classifier or support vector machine classifier, etc.) can be used to classify the sound information. To obtain the sound feature information of the sound information.
[0057] For example, please refer to Figure 5 , The Figure 5 For the target image frame of the game video of a shooting game, the image content at the position shown by the solid line box B1 in the target image frame can be identified through CNN, and the type of kill described in the shooting game can be obtained as multi-kill (ie Figure 5 Correspondingly, the sound information 5 seconds before the time corresponding to the target image frame is obtained, and the Mel cepstrum coefficient characteristic of the sound information is obtained, and then the Meier The Cepstrum coefficient feature is input to the RNN, and the sound category is output through the softmax classifier, and the sound feature information indicating that the sound represented by the sound information is the gunshot of the M gun can be obtained.
[0058] Step 202: Obtain target scene information of the video based on the sound feature information and the image feature information.
[0059] Among them, the target scene information is used to indicate the scene presented by the video. Please refer to Image 6 , The implementation process of step 202 may include:
[0060] Step 2021: Perform feature fusion on the sound feature information and the image feature information to obtain scene feature information.
[0061] Wherein, the scene feature information is usually information used to distinguish different scenes. For example, the scene feature information may be information used to distinguish scenes such as a residual blood killing scene, a tower killing scene, a multi-kill scene, or a dragon fighting scene in the game. Optionally, please refer to Figure 7 , The implementation process of this step 2021 may include:
[0062] Step 2021a: Obtain the type of video.
[0063] The type of video is used to distinguish different video content. For example, the types of game videos may include: gunfight game videos and real-time battle games videos, etc. The types of TV videos may include: family drama videos, idol drama videos, and costume drama videos.
[0064] Generally, information used to indicate the video type is recorded in the related information of the video (for example, in the video code). When this step 2021a is executed, the information can be read to obtain the video type.
[0065] Step 2021b, based on the type of the video, respectively determine the influence weights of the sound feature information and the image feature information on the scene information.
[0066] For different types of videos, the sound feature information and image feature information have different effects on the scene information. For example, in the video of a gunfight game, the sound feature information has a greater impact on the scene information. In video, image feature information has a greater impact on scene information. Therefore, before the feature fusion of the sound feature information and the image feature information to the scene information, the influence weights of the sound feature information and the image feature information on the scene information can be determined according to the type of the video, so that according to the influence weights, The feature fusion of the sound feature information and the image feature information is performed on the scene information to obtain the scene feature information that is more consistent with the video content, and then the target scene information that is more consistent with the video content is obtained according to the scene feature information.
[0067] Step 2021c: Perform feature fusion on the sound feature information and the image feature information according to the influence weight to obtain scene feature information.
[0068] After determining the influence weights of the sound characteristic information and the image characteristic information on the scene information, the sound characteristic information and the image characteristic information can be feature-fused according to the influence weight to obtain the scene characteristic information. Optionally, when performing feature fusion on sound feature information and image feature information, an algorithm based on Bayesian decision theory, an algorithm based on sparse representation theory, or an algorithm based on deep learning theory can be used for feature fusion. The embodiment of the present invention There is no specific restriction on it.
[0069] For example, suppose that the vector W1 is the acquired sound feature information, and the vector W2 is the acquired image feature information. The influence weights of the sound characteristic information and the image characteristic information on the scene information are respectively a and b. According to the influence weights, After feature fusion is performed on the sound feature information and the image feature information, the scene feature information Z=a×W1+b×W2 can be obtained.
[0070] Step 2022: Obtain target scene information based on the scene feature information.
[0071] Optionally, a classifier model can be used to obtain target scene information. Correspondingly, the implementation process of this step 2022 can include: inputting scene feature information into a second classifier model, and the second classifier model uses the scene feature information to determine The target scene information is determined in each scene information. Wherein, the multiple scene information may be scene information determined during the training process of the second classifier model. The multiple scene information may include information indicating scenes such as a residual blood killing scene, a killing scene under a tower, a multiple killing scene, or a dragon fighting scene. Optionally, the second classifier model may be a classifier model such as a softmax classifier or a support vector machine classifier.
[0072] As an example, suppose that the sound feature information acquired in step 201 is information describing that the sound is gunshots of M guns, and the acquired image feature information is information describing that the content displayed in the image is information indicating that the kill type is multiple kills. Sound feature information and image feature information are feature fused, and then the scene feature information obtained through feature fusion is input to the softmax classifier. After the softmax classifier is calculated, the target scene information is obtained as the description of the scene as a multi-kill scene through M gun information.
[0073] Step 203: Obtain a target title template of the video based on the target scene information and the image feature information.
[0074] Generally, the same video content can be described by multiple expressions in different formats, and the same scene can be described by multiple expressions in different formats. Correspondingly, different video contents can be described by video titles with the same or similar formats. Therefore, after the target scene information is acquired, a title template can be selected from a plurality of preset title templates according to the target scene information and image feature information, that is, a format for describing the video content is selected.
[0075] Optionally, the classifier model can be used to obtain the target title template of the video. Correspondingly, the implementation process of this step 203 can include: inputting the target scene information and image feature information into the first classifier model, and the first classifier model Target scene information and image feature information, determine the target title template from multiple title templates. Wherein, the multiple title templates may be title templates determined during the training process of the first classifier model. For example, multiple title templates may include: (killed hero) (hp) (killed type) (killed hero). (Kill heroes) (walking characteristics) kill (number) people. (Kill heroes) flash a perfect combo to harvest (quantity) winning streak. (Kill heroes) utmost (number) people and so on. It should be noted that in this example, the content in parentheses are words that need to be filled in according to the video content.
[0076] Moreover, because the main content of the video is not only related to image feature information and sound feature information, the main content is also related to the timing of multiple image frames included in the video, therefore, the second classifier model can be a timing that can obtain image frames Information model. For example, the second classifier model may be a softmax classifier.
[0077] For example, suppose that the acquired target scene information is information describing that the scene is a multi-kill scene through M gun, and the acquired image feature information is information that describes the kill type as multi-kill, and the target scene information and the image feature information are input After the softmax classifier, the target title template can be obtained: (kill hero) (walking feature) kill (number) people.
[0078] Step 204: Obtain multiple target knowledge bases corresponding to the video based on the target scene information.
[0079] Among them, keywords used to describe scene information are recorded in each target knowledge base, and multiple target knowledge bases are divided based on different scene features. That is, the multiple target knowledge bases may record keywords describing scene information from different angles. For example, for game videos, the multiple knowledge bases may include: hero knowledge base, kill type knowledge base, kill scene knowledge base, kill blood volume knowledge base, movement feature knowledge base, hero type knowledge base, etc. Keywords describing scene information from different angles are recorded in the multiple target knowledge bases.
[0080] In addition, since each scene information can be characterized by multiple scene features, and each scene feature can correspond to a knowledge base, there is a corresponding relationship between the scene information, the scene features, and the knowledge base. Before performing this step 204, the corresponding relationship between the scene information, the scene feature and the knowledge base can be established in advance, so that when the step 204 is performed, the corresponding relationship can be queried according to the target scene information to obtain multiple target knowledge bases.
[0081] For example, suppose that the target scene information can be characterized by scene feature a, scene feature b, scene feature c, scene feature d, and scene feature e, and query the pre-established scene information, the correspondence between the scene feature and the knowledge base according to the target scene information The relationship shows that the scene feature a corresponds to the knowledge base of hero Angela, the scene feature b corresponds to the knowledge base of the hero type, the scene feature c corresponds to the knowledge base of the position feature, and the scene feature d corresponds to the knowledge base of the output feature. The scene feature e corresponds to the knowledge base of the multi-kill scene, and multiple target knowledge bases can be obtained: the knowledge base of hero Angela, the knowledge base of hero type, the knowledge base of moving characteristics, the knowledge base of output characteristics and A knowledge base of multiple killing scenes, and the scene information can be described from different angles in the multiple target knowledge bases.
[0082] It should be noted that before performing step 204, a knowledge base needs to be established in advance. Among them, the knowledge base can be established by manual collection, or the database can be established by means of data mining. In addition, depending on the type of video to be generated for the title, the types of keywords recorded in the knowledge base may be different in thick lines. For example, the knowledge base may also have attributes describing heroes, players, teams, players, and scene attributes. Alternatively, the knowledge base may also record data obtained by analyzing the viewing rate of videos with the same attribute, which is not specifically limited in the embodiment of the present invention.
[0083] Step 205: Generate a title based on the target title template and multiple target knowledge bases.
[0084] Optional, please refer to Picture 8 , The implementation process of step 205 may include:
[0085] Step 2051: In each target knowledge base, obtain keywords for filling the target title template.
[0086] Since each knowledge base contains multiple keywords describing a word of an attribute, the multiple keywords can be synonyms or similar words. Therefore, after obtaining multiple target knowledge bases corresponding to the video, each Filter multiple keywords in the knowledge base to obtain keywords that fill the target title template.
[0087] Optionally, the screening method can be: random screening, or, according to the scene characteristics corresponding to the knowledge base, both the information of the knowledge base and the information of the scene characteristics are input to the classifier, and the classifier is in the knowledge base Choose from multiple keywords in to obtain the keywords that fill the target title template.
[0088] For example, suppose that multiple target knowledge bases are: hero Angela's knowledge base, hero type knowledge base, position knowledge knowledge base, output characteristics knowledge base and multi-kill scene knowledge base, among which, hero Anqila Keywords recorded in Kira’s knowledge base include: {Angela: {Hero nickname: {Loli Angela, High Explosion Angela, Strong Control Angela, etc.}, and keywords recorded in the knowledge base Including: {flexible, continuous displacement, etc.}, the keywords recorded in the output feature knowledge base include: {explosive output, etc.}, after filtering the keywords recorded in the knowledge base, you can get: the knowledge of hero Angela The keyword to fill the target title template in the library is: Lolita Angela, the keyword to fill the target title template in the positioning feature knowledge base is: continuous displacement, and the target title template is filled in the output feature knowledge base The keyword is: explosive output.
[0089] Since the multiple keywords recorded in each knowledge base are synonyms or synonyms, the semantics of the multiple keywords are basically the same, making the process of filtering keywords is a controllable process, and the keywords obtained by the filtering are used for the target title The template is filled to generate a semantically fluent video title, which enhances the readability of the generated video title.
[0090] Step 2052: Use keywords to fill the target title template to obtain the title.
[0091] After obtaining the keywords for filling the target title template, the keywords can be filled to the position of the corresponding attribute in the target title template according to the attributes of the keywords to obtain the title of the video.
[0092] For example, suppose that the target title template obtained in step 203 is: (kill hero) (walking feature) kill (number) people, and the target title template is performed in the knowledge base of the hero Angela obtained in step 2051 The keyword for filling is: Loli Angela, the keyword for filling the target title template in the moving feature knowledge base is: continuous displacement, the keyword for filling the target title template in the output feature knowledge base is: explosion output , And the type of kill determined according to the image feature information is: multiple kills. When using keywords to fill the target title template, Loli Angela can be filled to the position of the kill hero in the target title template, which will continue The displacement is filled to the position of the walking feature in the target title template, and the multi-kill is filled to the position of the number in the target title template, and the title is: Loli Angela continuous displacement kills multiple people.
[0093] Optionally, the video title generation method provided by the embodiment of the present invention may be Picture 9 Implementation of the model shown, which includes a multi-level network ( Picture 9 Each dashed box in the middle represents a first-level network). For each level of network, step 201 is performed on each target image frame, that is, the target image frame is input to CNN, and the spatial dimension of the image frame is extracted through CNN. The image feature is then input to the RNN, and the image feature of the image frame in the time dimension is extracted through the RNN to obtain the image feature information of the target image frame. And input the image feature information of the target image frame to the RNN in the lower-level network and the last-level network to realize the transfer of image feature information. At the same time, the sound feature information of the video sound information is extracted. Then, step 202 is executed according to the image feature information and the sound feature information to perform feature fusion on the sound feature information and the image feature information to obtain scene feature information. And input the scene feature information into the softmax classifier ( Picture 9 Not shown in) to obtain target scene information. Then, step 203 and step 204 are respectively performed according to the target scene information. In step 203, the target scene information and image feature information are input into the softmax classifier ( Picture 9 Not shown in) to obtain the target title template of the video. In step 204, according to the target scene information, the corresponding relationship between the scene information, the scene feature and the knowledge base is queried to obtain multiple target knowledge bases corresponding to the target scene information. Then, according to the target title template obtained in step 203 and the multiple target knowledge bases obtained in step 204, step 205 is executed to fill the target title template according to the keywords recorded in the target knowledge base to generate a video title.
[0094] At that Picture 9 In the model shown, by inputting the image feature information obtained by each level network into the RNN in the lower level network and the last level network (that is, using the Attention model), the attenuation of the image feature information can be reduced, thereby improving the The accuracy of the video title generated by the feature information.
[0095] In addition, by acquiring video image feature information and sound feature information, the amount of information that can be referred to when generating video titles is increased, which can improve machine learning capabilities, so that video titles generated by machine learning methods can more accurately describe the main features of the video. The content, therefore, can effectively improve the accuracy of the generated video title. Further, it is also possible to obtain other modal feature information except image feature information and sound feature information according to actual needs, and generate video titles based on the image feature information, sound feature information, and other modal feature information to further improve The accuracy of the generated video title.
[0096] In summary, the video title generation method provided by the embodiment of the present invention obtains the sound feature information and image feature information of the video, and obtains the scene information of the scene presented by the video according to the sound feature information and image feature information, and then according to the Scene information and image feature information generate video titles. Compared with related technologies, operators can generate video titles without watching videos, which effectively improves the efficiency of video title generation and saves manpower and material resources for determining video titles.
[0097] Moreover, the video title generation method obtains scene information according to the sound feature information and image feature information of the video, and then generates the video title based on the scene information and image feature information, which increases the amount of information available for reference when generating the video title, so that the generated The video title can more accurately describe the main content of the video, therefore, the accuracy of the generated video title is effectively improved.
[0098] It should be noted that the sequence of steps of the video title generation method provided by the embodiment of the present invention can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. Anyone skilled in the art is within the technical scope disclosed in the present invention. Any changes can be easily conceived, which should be covered by the protection scope of the present invention, so they will not be repeated.
[0099] The embodiment of the present invention provides a method for generating a title of a game video, such as Picture 10 As shown, the method can include:
[0100] Step 301: Acquire sound feature information and image feature information of the game video.
[0101] Step 302: Obtain target game scene information of the game video based on the sound feature information and the image feature information.
[0102] Among them, the target game scene information is used to indicate the game scene presented by the game video.
[0103] Step 303: Generate a title of the game video based on the target game scene information and the image feature information.
[0104] In the above steps 301 to 303, the specific implementation process of each step can be referred to figure 2 The corresponding steps in the illustrated embodiment will not be repeated in the embodiment of the present invention.
[0105] In summary, the method for generating the title of the game video provided by the embodiment of the present invention acquires the sound feature information and image feature information of the game video, and obtains the scene of the game scene presented by the game video according to the sound feature information and image feature information. Information, and then generate the title of the game video according to the scene information and image feature information. Compared with related technologies, the title of the game video can be generated by watching the game video without the operator, which effectively improves the generation efficiency of the game video title. It saves manpower and material resources for determining the title of the game video.
[0106] In addition, the method for generating the title of the game video obtains scene information based on the sound feature information and image feature information of the game video, and then generates the title of the game video based on the scene information and image feature information, which can be used as a reference when generating the title of the game video. The information volume of the generated game video title can more accurately describe the main content of the game video, therefore, the accuracy of the generated game video title is effectively improved.
[0107] Picture 11 It is a schematic structural diagram of a video title generating device provided by an embodiment of the present invention, such as Picture 11 As shown, the apparatus 800 may include:
[0108] The first acquisition module 801 is used to acquire sound feature information and image feature information of the video.
[0109] The second acquisition module 802 is configured to acquire target scene information of the video based on the sound characteristic information and the image characteristic information, and the target scene information is used to indicate the scene presented by the video.
[0110] The generating module 803 is used to generate the title of the video based on the target scene information and the image feature information.
[0111] Optionally, the generating module 803 can be used for:
[0112] Based on the target scene information and image feature information, the target title template of the video is obtained.
[0113] Based on the target scene information, multiple target knowledge bases corresponding to the video are obtained, each target knowledge base records keywords used to describe the scene information, and the multiple target knowledge bases are divided based on different scene features.
[0114] Based on the target title template and multiple target knowledge bases, a title is generated.
[0115] Optionally, the process of generating the title based on the target title template and multiple target knowledge bases by the generating module 803 may include:
[0116] In each target knowledge base, obtain keywords for filling the target title template.
[0117] Use keywords to fill the target title template to get the title.
[0118] Optionally, the generating module 803, based on the target scene information and image feature information, the process of obtaining the target title template of the video may include:
[0119] The target scene information and the image feature information are input into the first classifier model, and the first classifier model determines the target title template from the multiple title templates according to the target scene information and the image feature information.
[0120] Based on the target scene information, obtain multiple target knowledge bases corresponding to the video, including:
[0121] Based on the target scene information, the corresponding relationship between the scene information and the knowledge base is queried to obtain multiple target knowledge bases.
[0122] Optionally, the process of the second acquiring module 802 acquiring the target scene information of the video based on the sound feature information and the image feature information may include:
[0123] Feature fusion is performed on sound feature information and image feature information to obtain scene feature information.
[0124] Based on the scene feature information, the target scene information is obtained.
[0125] Optionally, the second acquisition module 802 performs feature fusion on the sound feature information and the image feature information to obtain the scene feature information, which may include:
[0126] Get the type of video.
[0127] Based on the type of video, the influence weights of sound feature information and image feature information on scene information are determined respectively.
[0128] According to the influence weight, feature fusion is performed on the sound feature information and the image feature information to obtain the scene feature information.
[0129] Optionally, the second acquiring module 802 acquires target scene information based on the scene feature information. The process may include: inputting the scene feature information into the second classifier model, and the second classifier model uses the scene feature information in multiple scenes. The target scene information is determined in the information.
[0130] Optionally, the process of acquiring the sound feature information of the video by the first acquiring module 801 may include:
[0131] Acquire sound information within a preset time period in the video.
[0132] Acquire the sound feature information of the sound information.
[0133] Optionally, the process of acquiring the sound feature information of the sound information by the first acquiring module 801 may include:
[0134] Get the Mel cepstrum coefficient feature of the sound information.
[0135] Based on the characteristics of Mel cepstrum coefficients, the sound information is classified to obtain sound characteristic information.
[0136] Optionally, the process of obtaining the image feature information of the video by the first obtaining module 801 may include: obtaining the image feature information of the target image frame from among the multiple image frames included in the video.
[0137] Optionally, the target image frame is an image frame selected every preset time length among the multiple image frames.
[0138] In summary, in the video title generation device provided by the embodiment of the present invention, the first acquisition module acquires the sound feature information and image feature information of the video, and the second acquisition module acquires the information of the scene presented by the video according to the sound feature information and image feature information. Scene information, the generation module generates video titles based on the scene information and image feature information. Compared with related technologies, video titles can be generated without the need for operators to watch videos, which effectively improves the efficiency of video title generation and saves money for determining The human and material resources of the video title.
[0139] In addition, the second acquisition module acquires scene information according to the sound characteristic information and image characteristic information of the video, and the generation module generates the video title according to the scene information and image characteristic information, which increases the amount of information available for reference when generating the video title, so that the generation The video title of can more accurately describe the main content of the video, so it effectively improves the accuracy of the generated video title.
[0140] Picture 12 It is a schematic structural diagram of a game video title generating apparatus provided by an embodiment of the present invention, such as Picture 12 As shown, the apparatus 900 may include:
[0141] The first acquisition module 901 is configured to acquire sound feature information and image feature information of the game video.
[0142] The second acquiring module 902 is configured to acquire target game scene information of the game video based on the sound feature information and image feature information, and the target game scene information is used to indicate the game scene presented by the game video.
[0143] The generating module 903 is used to generate the title of the game video based on the target game scene information and image feature information.
[0144] In summary, in the video title generating device provided by the embodiment of the present invention, the first acquisition module acquires the sound feature information and image feature information of the game video, and the second acquisition module acquires the game video presentation based on the sound feature information and image feature information. The scene information of the game scene, the generating module generates the title of the game video according to the scene information and image feature information. Compared with related technologies, the title of the game video can be generated by watching the game video without the operator, which effectively improves the game video The title generation efficiency saves manpower and material resources for determining the title of the game video.
[0145] In addition, by acquiring scene information according to the sound feature information and image feature information of the game video, the generating module generates the title of the game video according to the scene information and image feature information, which increases the amount of information available for reference when generating the title of the game video, so that The generated game video title can more accurately describe the main content of the game video, therefore, the accuracy of the generated game video title is effectively improved.
[0146] Regarding the device in the foregoing embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and detailed description will not be given here.
[0147] Figure 13 It shows a schematic structural diagram of a terminal 1300 provided by an exemplary embodiment of the present invention. The terminal 1300 can be a portable mobile terminal, such as a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III), MP4 (Moving Picture Experts Group Audio Layer IV, a moving picture expert) Compression standard audio level 4) Player, laptop or desktop computer. The terminal 1300 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
[0148] Generally, the terminal 1300 includes a processor 1301 and a memory 1302.
[0149] The processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1301 may adopt at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, programmable logic array). achieve. The processor 1301 may also include a main processor and a co-processor. The main processor is a processor used to process data in an awake state, and is also called a CPU (Central Processing Unit, central processing unit). The coprocessor is a low-power processor used to process data in the standby state. In some embodiments, the processor 1301 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 1301 may further include an AI (Artificial Intelligence, artificial intelligence) processor, which is used to process computing operations related to machine learning.
[0150] The memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1301 to implement the video title provided in the method embodiment of the present application. The generation method, or the title generation method of the game video.
[0151] In some embodiments, the terminal 1300 may optionally further include: a peripheral device interface 1303 and at least one peripheral device. The processor 1301, the memory 1302, and the peripheral device interface 1303 may be connected by a bus or a signal line. Each peripheral device can be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1304, a display screen 1305, a camera component 1306, an audio circuit 1307, a positioning component 1308, and a power supply 1309.
[0152] The peripheral device interface 1303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1301 and the memory 1302. In some embodiments, the processor 1301, the memory 1302, and the peripheral device interface 1303 are integrated on the same chip or circuit board. In some other embodiments, any one or both of the processor 1301, the memory 1302, and the peripheral device interface 1303 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
[0153] The radio frequency circuit 1304 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 1304 communicates with a communication network and other communication devices through electromagnetic signals. The radio frequency circuit 1304 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on. The radio frequency circuit 1304 can communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area network and/or WiFi (Wireless Fidelity, wireless fidelity) network. In some embodiments, the radio frequency circuit 1304 may further include a circuit related to NFC (Near Field Communication), which is not limited in this application.
[0154] The display screen 1305 is used to display UI (User Interface, user interface). The UI can include graphics, text, icons, videos, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to collect touch signals on or above the surface of the display screen 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this time, the display screen 1305 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 1305, and the front panel of the terminal 1300 is provided. In other embodiments, there may be at least two display screens 1305, which are respectively disposed on different surfaces of the terminal 1300 or are of a folding design. In still other embodiments, the display screen 1305 may be a flexible display screen, which is disposed on the curved surface or the folding surface of the terminal 1300. Even the display screen 1305 can also be set as a non-rectangular irregular pattern, that is, a special-shaped screen. The display screen 1305 can be made of materials such as LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), etc.
[0155] The camera assembly 1306 is used to collect images or videos. Optionally, the camera assembly 1306 includes a front camera and a rear camera. Generally, the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal. In some embodiments, there are at least two rear cameras, each of which is a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera Integrate with the wide-angle camera to realize panoramic shooting and VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1306 may also include a flash. The flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
[0156] The audio circuit 1307 may include a microphone and a speaker. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 1301 for processing, or input to the radio frequency circuit 1304 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively set in different parts of the terminal 1300. The microphone can also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 1301 or the radio frequency circuit 1304 into sound waves. The speaker can be a traditional thin-film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert the electrical signal into human audible sound waves, but also convert the electrical signal into human inaudible sound waves for distance measurement and other purposes. In some embodiments, the audio circuit 1307 may also include a headphone jack.
[0157] The positioning component 1308 is used to locate the current geographic location of the terminal 1300 to implement navigation or LBS (Location Based Service, location-based service). The positioning component 1308 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, or the Galileo system of Russia.
[0158] The power supply 1309 is used to supply power to various components in the terminal 1300. The power supply 1309 may be alternating current, direct current, disposable batteries or rechargeable batteries. When the power source 1309 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery charged through a wired line, and a wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.
[0159] In some embodiments, the terminal 1300 further includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: an acceleration sensor 1311, a gyroscope sensor 1312, a pressure sensor 1313, a fingerprint sensor 1314, an optical sensor 1315, and a proximity sensor 1316.
[0160] The acceleration sensor 1311 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 1300. For example, the acceleration sensor 1311 can be used to detect the components of the gravitational acceleration on three coordinate axes. The processor 1301 may control the display screen 1305 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 may also be used for the collection of game or user motion data.
[0161] The gyroscope sensor 1312 can detect the body direction and rotation angle of the terminal 1300, and the gyroscope sensor 1312 can cooperate with the acceleration sensor 1311 to collect the user's 3D actions on the terminal 1300. The processor 1301 can implement the following functions according to the data collected by the gyroscope sensor 1312: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
[0162] The pressure sensor 1313 may be arranged on the side frame of the terminal 1300 and/or the lower layer of the display screen 1305. When the pressure sensor 1313 is arranged on the side frame of the terminal 1300, the user's holding signal of the terminal 1300 can be detected, and the processor 1301 performs left and right hand recognition or quick operation according to the holding signal collected by the pressure sensor 1313. When the pressure sensor 1313 is arranged on the lower layer of the display screen 1305, the processor 1301 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1305. The operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
[0163] The fingerprint sensor 1314 is used to collect the user's fingerprint. The processor 1301 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the user's identity according to the collected fingerprint. When it is recognized that the user's identity is a trusted identity, the processor 1301 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings. The fingerprint sensor 1314 may be provided on the front, back or side of the terminal 1300. When a physical button or a manufacturer logo is provided on the terminal 1300, the fingerprint sensor 1314 can be integrated with the physical button or the manufacturer logo.
[0164] The optical sensor 1315 is used to collect the ambient light intensity. In an embodiment, the processor 1301 may control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1305 is increased. When the ambient light intensity is low, reduce the display brightness of the display screen 1305. In another embodiment, the processor 1301 may also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.
[0165] The proximity sensor 1316, also called a distance sensor, is usually arranged on the front panel of the terminal 1300. The proximity sensor 1316 is used to collect the distance between the user and the front of the terminal 1300. In one embodiment, when the proximity sensor 1316 detects that the distance between the user and the front of the terminal 1300 gradually decreases, the processor 1301 controls the display screen 1305 to switch from the on-screen state to the off-screen state. When the proximity sensor 1316 detects that the distance between the user and the front of the terminal 1300 is gradually increasing, the processor 1301 controls the display screen 1305 to switch from the rest screen state to the bright screen state.
[0166] Those skilled in the art can understand, Figure 13 The structure shown in does not constitute a limitation on the terminal 1300, and may include more or fewer components than shown, or combine some components, or adopt different component arrangements.
[0167] The embodiment of the present invention also provides a computer-readable storage medium, the storage medium is a non-volatile storage medium, the storage medium stores at least one instruction, at least one program, code set, or instruction set, and the at least one instruction , The at least one program, the code set or the instruction set is loaded and executed by the processor to implement the video title generating method provided in the above-mentioned embodiment of the present application or the game video title generating method.
[0168] The embodiment of the present invention also provides a computer program product. The computer program product stores instructions that, when run on a computer, enables the computer to execute the video title generation method provided by the embodiments of the present invention, or the video game video Title generation method.
[0169] The embodiment of the present invention also provides a chip that includes a programmable logic circuit and/or program instructions, and when the chip is running, it can execute the video title generation method provided by the embodiment of the present invention, or the game video title generation method .
[0170] In the embodiment of the present invention, the relationship qualifier "and/or" indicates three logical relationships, and A and/or B indicate the existence of A alone, B alone, and both A and B.
[0171] Those of ordinary skill in the art can understand that all or part of the steps in the foregoing embodiments can be implemented by hardware, or by a program instructing relevant hardware to be completed. The program can be stored in a computer-readable storage medium. The storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
[0172] The above descriptions are only preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection of this application. Within range.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more Similar technology patents
Energy storage module comprising several prismatic storage cells and method for producing an energy storage module
ActiveCN102782896AImprove production efficiencyCompact layoutFinal product manufactureSmall-sized cells cases/jacketsStorage cellVoltage
Owner:BAYERISCHE MOTOREN WERKE AG
FPGA (field programmable gate array) network flow generating system and method based on multifractal wavelet model
ActiveCN103248540AImprove production efficiencyImprove work efficiencyData switching networksField-programmable gate arrayMAC address
Owner:UNIV OF JINAN
K-carrageenan oligosaccharide monomer and its preparation method
Owner:OCEAN UNIV OF CHINA
Waste liquid collector
ActiveUS20090218271A1Improve production efficiencySimple configurationPrintingStationary filtering element filtersLiquid waste
Owner:SEIKO EPSON CORP
Method and device for loading webpage
InactiveCN104133844AImprove production efficiencyReduce the cumbersomeness of descriptionSpecial data processing applicationsCascading Style SheetsScripting language
Owner:XIAOMI INC
Classification and recommendation of technical efficacy words
- Improve production efficiency
- Save manpower and material resources
Method of selective coating of articles
InactiveUS6106889AConvenient coatingImprove production efficiencyGlovesPharmaceutical containersWater soluble polymersWater soluble
Owner:BIOCOAT
New method and equipment for preparing vacuum glass faceplate
InactiveCN101050056AImprove production efficiencyImprove yield and mechanical strengthGlass reforming apparatusElectric heatingEngineering
Owner:罗建超
Image pickup lens, camera module using the same, image pickup lens manufacturing method and camera module manufacturing method
InactiveUS20110149143A1Improve production efficiencyImproved optical performanceTelevision system detailsColor television detailsOptoelectronicsCamera module
Owner:SONY CORP
Array substrate, touch display panel and touch display device
InactiveCN105652498AReduce processImprove production efficiencyNon-linear opticsInput/output processes for data processingPeripheralElectricity
Owner:SHANGHAI AVIC OPTOELECTRONICS +1
Method for rapidly preparing cellulose nanocrystalline
Owner:NORTHEAST FORESTRY UNIVERSITY
Fracturing pump truck driven by turbine engine
ActiveCN102602323AIncrease output powerSave manpower and material resourcesAuxillary drivesItem transportation vehiclesTruckShale gas
Owner:LIAONING HUAFU PETROLEUM HIGH TECH
Under-forest three-dimensional cultivation method for Dendrobium officinale
InactiveCN104115733ASave spaceSave manpower and material resourcesCultivating equipmentsSoilless cultivationFertilizerPesticide
Owner:SHENZHEN XIANHU BOTANICAL GARDEN ADMINISTRATION +1
Solar gas detecting and alarming system
ActiveCN102831753AImprove reliabilitySave manpower and material resourcesAlarmsData transmissionVIT signals
Owner:上海兆莹自控设备有限公司
Simultaneous interpretation system based on speech recognition technology
InactiveCN106486125AImprove efficiencySave manpower and material resourcesSpeech recognitionSpeech synthesisMaterial resourcesSpeech sound
Owner:ANHUI SEMXUM INFORMATION TECH CO LTD