Video processing methods and related equipment
By segmenting and deduplicating video content based on semantic analysis, the method addresses storage and computing inefficiencies, enhancing video editing efficiency and quality.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- BEIJING ZITIAO NETWORK TECH CO LTD
- Filing Date
- 2024-08-13
- Publication Date
- 2026-07-02
AI Technical Summary
High-quality, long-length video footage consumes significant storage space and computing resources, leading to inefficiencies in video editing and increased workload.
Divide video into scene segments based on transition positions, determine text descriptions for each segment, and remove segments with overlapping semantics using semantic relationships to obtain a target video.
Reduces memory and computational requirements while ensuring complete content presentation, improving video editing efficiency and balancing presentation quality.
Smart Images

Figure 2026521951000001_ABST
Abstract
Description
Technical Field
[0001] (Cross-reference to Related Applications) This application claims the priority of a Chinese patent application with application number 2023110221283 and title "Video Processing Method and Related Devices" filed on August 14, 2023. The entire content of the said application is incorporated herein by reference.
[0002] (Field of the Invention) The present disclosure relates to the field of computer technology, and particularly to video processing methods, devices, equipment, media, and program products.
Background Art
[0003] In video creation with high-quality materials and long video lengths, it often occupies a large storage space, and these materials for video editing software often pose problems not only in terms of storage space for editing equipment and computing power but also require more computing resources, leading to a reduction in the efficiency of video processing by users and an increase in the workload for editing videos.
Summary of the Invention
Problems to be Solved by the Invention
[0004] The present disclosure proposes a video processing method, device, equipment, storage medium, and program product to address, to some extent, the technical problem of low efficiency in video editing.
Means for Solving the Problems
[0005] A first aspect of the present disclosure provides a video processing method which includes acquiring a video to be processed, dividing the video to be processed into a plurality of scene video segments based on scene transition positions, determining a text description corresponding to each of the scene video segments, and removing the scene video segments having overlapping semantics in the video to be processed based on semantic relationships between the text descriptions to obtain a target video.
[0006] A second aspect of the present disclosure provides a video processing apparatus comprising: an acquisition module for acquiring a video to be processed; a segmentation module for dividing the video to be processed into a plurality of scene video segments based on scene transition positions; a text module for determining a text description corresponding to each of the scene video segments; and a deduplication module for removing scene video segments having overlapping semantics in the video to be processed based on semantic relationships between the text descriptions to obtain a target video.
[0007] A third aspect of this disclosure provides an electronic device including memory, one or more processors, and one or more computer programs stored in the memory and executable on one or more processors, the programs including instructions for performing the method described in the first or second aspect.
[0008] In a fourth aspect of the present disclosure, a non-volatile computer-readable storage medium is provided which, when the computer program is executed by one or more processors, causes the processors to perform the method described in the first or second aspect.
[0009] In a fifth aspect of this disclosure, a computer program product is provided which includes a computer program instruction that, when executed on a computer, causes the computer to perform the method described in the first aspect. [Brief explanation of the drawing]
[0010] To more clearly illustrate the technical solutions in this disclosure or related technologies, the accompanying drawings used in the descriptions of embodiments or related technologies are briefly introduced below. However, the accompanying drawings in the following descriptions are merely embodiments of this disclosure, and it will be apparent to those skilled in the art that other drawings can be obtained from these drawings without any creative effort. [Figure 1] This is a schematic diagram of the video processing architecture of the embodiment of the disclosure. [Figure 2] This is a schematic diagram of the hardware structure of an exemplary electronic device according to an embodiment of the present disclosure. [Figure 3] This is a schematic flowchart illustrating the video processing method according to the embodiments of this disclosure. [Figure 4] This is a schematic diagram of the video processing method according to the embodiment of the disclosure. [Figure 5] This is a schematic diagram of a video processing device according to an embodiment of the disclosure. [Modes for carrying out the invention]
[0011] The purpose, technical solutions, and benefits of this disclosure will be described in more detail below, along with specific embodiments, with reference to the accompanying drawings, in order to make them clearer and easier to understand.
[0012] Unless otherwise defined, technical or scientific terms used in the embodiments of this disclosure should have the ordinary meaning understood by those skilled in the art of the field to which this disclosure belongs. Terms such as “first,” “second,” etc., used in the embodiments of this disclosure do not imply any order, quantity, or importance, but are used simply to distinguish different components. “Includes” or “constitutes” and similar phrases mean that the components or objects appearing before the phrase encompass the components or objects and their equivalents listed after the phrase, without excluding other components or objects. Terms such as “connected” or “connected” may include electrical connections, whether direct or indirect, but are not limited to physical or mechanical connections. “Up,” “down,” “left,” “right,” etc., are used only to describe relative positions, and if the absolute position of the object being described changes, the relative position may change accordingly. If the absolute position of the object being described changes, the relative position may change accordingly.
[0013] Before using the technical solutions disclosed in each embodiment of this disclosure, please understand that, in accordance with applicable laws and regulations, it is necessary to notify users of the types, scope, and scenarios of use of personal information related to this disclosure and to obtain their consent in an appropriate manner.
[0014] For example, in response to receiving a voluntary request from a user, prompt information is sent to the user, explicitly informing the user that the requested operation requires the acquisition and use of the user's personal information. This allows the user to independently choose, based on the prompt information, whether or not to provide personal information to software or hardware such as electronic devices, applications, servers, or storage media that perform the operation of the technical solution of this disclosure.
[0015] In an optional but non-limiting embodiment, in response to receiving a voluntary request from the user, prompt information is sent to the user, for example, in the form of a pop-up window in which the prompt information is presented in text form. Furthermore, the pop-up window may include option controls for the user to select whether to "agree" or "disagree" to providing personal information to the electronic device.
[0016] The above notice and user authorization process are general in nature and do not limit the ways in which this disclosure may be implemented. Please understand that other methods that comply with applicable laws and regulations may be applied in how this disclosure may be implemented.
[0017] Figure 1 shows a schematic diagram of a video processing architecture according to an embodiment of the present disclosure. Referring to Figure 1, the video processing architecture 100 may include a server 110, a terminal 120, and a network 130 that provides communication links. The server 110 and the terminal 120 can be connected by a wired or wireless network 130. Here, the server 110 may be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, security services, and CDNs.
[0018] Terminal 120 can be implemented in hardware or software. For example, if terminal 120 is implemented in hardware, it may be any electronic device having a display and supporting page display, including but not limited to intelligent mobile phones, tablet PCs, e-book readers, laptop-type portable computers, and desktop computers. If terminal 120 is implemented in software, it can be attached to any of the electronic devices listed above, and may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module, without specific limitations.
[0019] The video processing method according to the embodiment of this application may be executed by terminal 120 or by server 110. Please understand that the number of terminals, networks, and servers in Figure 1 are merely examples and are not intended to limit their use. Any number of terminals, networks, and servers can be used depending on the implementation needs.
[0020] Figure 2 shows a schematic diagram of the hardware structure of an exemplary electronic device 200 according to an embodiment of the present disclosure. As shown in Figure 2, the electronic device 200 may include a processor 202, a memory 204, a network module 206, a peripheral interface 208, and a bus 210. Here, the processor 202, the memory 204, the network module 206, and the peripheral interface 208 communicate with each other internally within the electronic device 200 via the bus 210.
[0021] The processor 202 may be a central processing unit (CPU), a video processor, a neural network processor (NPU), a microcontroller (MCU), a programmable logic device, a digital signal processor (DSP), an application specific integrated circuit (ASIC), or one or more integrated circuits. The processor 202 may be used to execute functions related to the technologies described in this disclosure. In some embodiments, the processor 202 may further include multiple processors integrated into a single logic component. For example, as shown in FIG. 2, the processor 202 may include multiple processors 202a, 202b, and 202c.
[0022] The memory 204 may be configured to store data (e.g., instructions, computer code, etc.). As shown in FIG. 2, the data stored in the memory 204 may include program instructions (e.g., program instructions for implementing the video processing method of the embodiments of this disclosure) and processed data (e.g., the memory may store configuration files of other modules, etc.). The processor 202 can further access the program instructions and data stored in the memory 204 and execute the program instructions to manipulate the processed data. The memory 204 may include a volatile storage device or a non-volatile storage device. In some embodiments, the memory 204 may include a random access memory (RAM), a read only memory (ROM), a disk, a magnetic disk, a hard disk, a solid state drive (SSD), a flash memory, a memory stick, etc.
[0023] The network module 206 may be configured to provide communication between the electronic device 200 and other external devices via a network. This network may be any wired or wireless network capable of transmitting and receiving data. For example, this network may be a wired network, a local wireless network (e.g., Bluetooth®, WiFi®, Near Field Communication (NFC), etc.), a cellular network, the Internet, or a combination of the above. As can be understood, the type of network is not limited to the above specific examples. In some embodiments, the network module 306 may include any combination of any number of network interface controllers (NICs), radio frequency modules, transceivers, modems, routers, gateways, adapters, cellular network chips, etc.
[0024] The peripheral interface 208 may be configured to connect the electronic device 200 to one or more peripheral devices in order to realize input and output of information. For example, the peripheral devices may include input devices such as a keyboard, a mouse, a touch pad, a touch screen, a microphone, various sensors, etc. and output devices such as a display, a speaker, a vibrator, an indicator, etc.
[0025] The bus 210 may be configured to transmit information between various components of the electronic device 200 (e.g., the processor 202, the memory 204, the network module 206, and the peripheral interface 208), such as an internal bus (e.g., a processor-memory bus), an external bus (USB port, PCI-E bus), etc.
[0026] Although the architecture of the electronic device 200 shown above only includes the processor 202, memory 204, network module 206, peripheral interface 208, and bus 210, in a specific implementation, the architecture of the electronic device 200 may include other components necessary to achieve normal operation. A person skilled in the art will understand that the architecture of the electronic device 200 may include only the components necessary to implement the embodiments of this disclosure, and does not necessarily have to include all the components shown.
[0027] The development of recording equipment allows people to shoot video footage anytime, anywhere, and the advancement of large-capacity storage devices allows for increasingly longer shooting times. This high-quality, large-capacity, and lengthy video footage has undoubtedly presented significant obstacles and difficulties for video creators. Loading this footage into video clipping software requires more computing resources, placing higher demands and challenges on clipping equipment. Therefore, how to conserve memory and computing resources related to video clipping, how to improve video clipping efficiency, and how to balance the presentation effect of video clips are urgent technological challenges that need to be addressed.
[0028] In this regard, embodiments of the present disclosure propose a video processing method and related equipment. By dividing the video to be processed into multiple scene video segments based on scenes, and removing scene video segments with overlapping semantics based on the semantic relationships between these scene video segments, it is possible to ensure that the complete content of the video to be processed is reflected while reducing the occurrence of overlapping screens or content, saving memory and computational resources, improving the efficiency of video clips, and achieving a balance between the presentation effect of video clips.
[0029] Specifically, the video to be processed is monitored for transition positions to determine one or more scene transition positions. Based on these positions, the video can be cut into multiple different scene video segments, i=1, 2, ..., n, where n is a positive integer. For each scene video segment, a text description, text_scene_i (e.g., a digest corresponding to the scene video segment), is determined. The text description, text_scene_i, is then feature-extracted to obtain corresponding text features, feature_scene_i. Based on the semantic relationships between the text features, feature_scene_i (i.e., semantic relationships with the corresponding scene video segments), scene video segments with overlapping semantics are removed from the video to be processed, resulting in the final target video. This reduces the occurrence of overlapping screens or content, saves memory and computational resources, and improves the efficiency of video clips.
[0030] Referring to Figure 3, Figure 3 shows an exemplary flowchart of a video processing method according to an embodiment of the present disclosure. In Figure 3, the video processing method 300 may further include the following steps.
[0031] In step S310, the video to be processed is obtained.
[0032] Here, the video to be processed may include video frames from different scenes, and the transitions between different scenes may include transition images connecting the two preceding and succeeding scene videos, such as fade-in / fade-out, hard cuts, rotation transformations, stretch transformations, etc. The video to be processed can be uploaded locally or retrieved over a network.
[0033] In some embodiments, method 300 may further include performing a frame extraction process on the video to be processed at a preset sampling frequency to obtain a sequence of processed video frames. Here, the sampling frequency can be expressed as the frame rate FPS (frames per second), which represents the number of images contained in video data with a duration of 1 second. For example, if the video data contains 30 images within a 1-second duration, the FPS of this video data is 30. Generally, the frame rate of video data is between 30 and 60. However, when expressed as sampling frequency + sampling frame rate FPS, it may also represent the number of images sampled within 1 second. Specifically, if the sampling frequency is FPS = a, then frames can be extracted from the video to be processed, and a frames can be extracted per second. For example, if video data with a frame rate of 30 contains 30 frames within a 1-second duration, and sampling is performed at a sampling frequency of FPS = 2, then 2 frames can be equally selected from every 30 frames. Because video data generally has a relatively high frame rate, video processing can be time-consuming. By sampling video data based on downsampling, computational complexity can be reduced, improving video processing efficiency and response speed.
[0034] In some embodiments, the amount of data in video processing can be reduced, computational costs saved, and the efficiency of video processing can be improved by performing subsequent processing on a sequence of video frames to be processed instead of the video being processed. Specifically, frames can be extracted from a video to be processed with a duration of b seconds based on a frame rate a, i.e., 1 second a-frame video frames can be extracted in chronological order, and a video frame sequence X_a_b containing a*b frames can be obtained. In the subsequent processing steps of embodiments of this disclosure, the video frame sequence can be used as a database to remove duplicate semantic segments instead of the video being processed.
[0035] In step S320, the video to be processed is divided into multiple scene video segments based on the scene transition position.
[0036] Here, the scene transition position can refer to the transition point in the video. Each scene video segment corresponds to the same scene, and the screen within this scene video segment remains stable, with virtually no change in video content.
[0037] In some embodiments, dividing the video to be processed into a plurality of scene video segments based on scene transition positions includes determining the scene transition probability of video frames in the processed video, determining video frames whose scene transition probability is equal to or greater than a transition threshold as the scene transition positions, and trimming the processed video at the scene transition positions to obtain a plurality of scene video segments.
[0038] Specifically, the video to be processed is input to a transition detection network to determine scene transition positions, and the video can be truncated at these scene transition positions, thereby obtaining multiple scene video segments. For example, referring to Figure 4, Figure 4 shows a schematic diagram of a video processing method according to an embodiment of the present disclosure. In Figure 4, video frame extraction is performed on the video 410 to be processed to obtain a sequence of video frames to be processed, and m (m is a positive integer) video frames in this sequence of video frames to be processed are sequentially input to the transition detection network 420 to obtain n scene video segments video_scene_1, ..., video_scene_n. Here, the transition detection network 420 can perform transition detection on m video frames, for example, a transition score sk = f(xk), where xk is the k-th video frame that was input, and f is the transition score function. This allows us to obtain the transition scores S = [s1, s2, s3, ..., sm]) of the m video frames, and express the probability that each video frame is a transition. If a transition threshold threshold1 can be set, then among the transition scores S=[s1, s2, s3, ..., sm]), video frames that are greater than or equal to threshold1 are scene transition positions. For example, if s20 is greater than or equal to threshold1, then the video frame frame20 corresponding to s20 is a scene transition position. Subsequently, the video frame sequence processed at the scene transition position is discarded, and multiple scene video segments video_scene_1, ..., video_scene_n can be obtained.
[0039] In some embodiments, a transition detection network can be trained by training an initial neural network based on transition detection training data. Furthermore, the transition detection training data may include training images and corresponding transition scores, where a higher transition score indicates a higher probability that the training image is a transition. Specifically, the initial neural network is trained using training images as input layer data and the corresponding transition scores as output layer data. Based on the transition estimation scores of the training images obtained by the transition training network, the difference (or variance) between the transition estimation scores and the transition scores is calculated to obtain a loss function. The weights of the initial neural network are adjusted based on the loss function to minimize the loss function and obtain a trained transition detection network.
[0040] In step S330, a text description corresponding to each scene video segment is determined.
[0041] Here, the text description corresponding to a scene video segment can refer to a representative text description that explains the main image content of that scene video segment, such as a scene video digest.
[0042] In some embodiments, determining a text description corresponding to each scene video segment includes determining a plurality of video frame text descriptions corresponding to the video frame based on the image content of each video frame in the scene video segment, feature extracting the plurality of video frame text descriptions to obtain a plurality of frame text features; calculating a frame text similarity between each frame text feature and other frame text features, and determining the video frame text description corresponding to the frame text feature with the highest frame text similarity as the text description for the scene video segment.
[0043] Here, for each scene video segment, based on the video frame text description corresponding to each video frame in the scene video segment, the video frame text description with the highest similarity to other video frame text descriptions can be selected as the video digest of the scene video segment. Specifically, the scene video segment is input into an image description network, which can be represented as h, and if the output text is described to t, then t = h(x), where x is the input image. For example, if the input image x contains a dog and grass, the text description t may be "The dog is playing on the grass."
[0044] This image description network 430 can describe the video frames in each scene video segment and output a video frame text description corresponding to each video frame. Specifically, as shown in Figure 4, it is possible to obtain text descriptions text_scene_1, ..., text_scene_n corresponding to n scene video segments video_scene_1, ..., video_scene_n. For example, if scene video segment video_scene_1 contains K1 video frames and K1 is a positive integer, the image description network 430 can output K1 video frame text descriptions corresponding to the K1 video frames. Feature extraction is performed on the K1 video frame text descriptions based on a text feature network (which can convert input text information into fixed-length text features; for example, it can convert input text information into a feature vector of length 512, i.e., a floating-point vector) to obtain K1 frame text features. For each frame text feature, the similarity between this frame text feature and other frame question features is calculated, yielding K¹-1 similarities (e.g., cosine similarity). The frame text description corresponding to the frame text feature with the highest similarity is determined as the text description text_scene_1 for the scene video segment video_scene_1. Similarly, text descriptions for other scene video segments can be obtained, but this will not be explained further here.
[0045] In step S340, the scene video segments having overlapping semantics in the video to be processed are removed based on the semantic relationships between the text descriptions to obtain the target video.
[0046] Here, semantic relevance can refer to the degree of similarity between image features and text features, i.e., cosine similarity. The semantic relevance between text descriptions corresponding to scene video segments can reflect the overall correlation between the main content of the scene video segments and the entire video being processed, as well as the segment correlation between scene video segments themselves.
[0047] In some embodiments, removing scene video segments having overlapping semantics in the video to be processed based on the semantic relationships between the text descriptions to obtain a target video includes determining the overall correlation between the scene video segments and the video to be processed and the segment correlation between the scene video segments based on the semantic relationships between the text descriptions, determining the retained segments and the segments to be removed in the scene video segments based on the overall correlation and the segment correlation, wherein the segments to be removed and at least one of the retained segments have overlapping semantics, and removing the segments to be removed from the video to be processed to obtain the target video.
[0048] Here, segment correlation allows for precise determination of whether or not overlapping semantics exist between scene video segments, while overall correlation ensures the completeness of the content of the entire video being processed. By combining these two, it is possible to ensure the presentation of complete content while simultaneously determining which segments to remove overlapping semantics from. By removing these segments from the video being processed, the target video can be obtained. This reduces the occurrence of overlapping screens or content, saves memory and computational resources, and improves the efficiency of video clips.
[0049] In some embodiments, the retained segments and segments to be removed in the scene video segment are determined based on the overall correlation and the segment correlation, and the removal segment and at least one of the retained segments having overlapping semantics includes designating the scene video segment with the minimum overall correlation as a retained segment in the retained segment set, and designating the scene video segment that does not have the minimum overall correlation as a removal segment in the removal segment set, and repeating the following steps until a preset condition is met: calculating the semantic similarity between each removal segment in the removal segment set and the retained segment set based on the segment correlation, and moving the removal segment corresponding to the minimum value to the retained segment set if the minimum semantic similarity is less than the similarity threshold, wherein the preset condition includes the removal segment set being empty and / or the semantic similarity between all removal segments in the removal segment set and the retained segment set being less than the similarity threshold.
[0050] In some embodiments, determining the overall correlation between the scene video segments and the video being processed, and the segment correlation between the scene video segments, based on the semantic relationships between the text descriptions, includes: feature extraction of each text description to obtain a corresponding text feature; calculating multiple text similarities between each text feature and other text features to obtain a segment correlation between the scene video segments; and obtaining an overall correlation between the scene video segments corresponding to each text feature and the video being processed, based on the average of the multiple text similarities.
[0051] In some embodiments, calculating the semantic similarity between each segment to be removed in the segment set to be removed and the retained segment set based on the segment correlation includes determining the maximum value of the segment correlation between the segment to be removed and each of the retained segments in the segment set to be removed as the semantic similarity between the segment to be removed and the retained segment set.
[0052] Specifically, as shown in Figure 4, semantic deduplication can be performed on n scene video segments video_scene_1, ..., video_scene_n based on an iterative deduplication policy. For example, the text descriptions text_scene_1, ..., text_scene_n of multiple scene video segments output by the image description network 430 can be input to the text feature network 440. The text feature network 440 can extract features from each text description text_scene_1, ..., text_scene_n to obtain the corresponding n text features feature_1, ..., feature_n.
[0053] For each text feature feature_i (i=1, ..., n), the text similarity between this text feature feature_i and other text features feature_j (j≠i) is calculated, yielding n-1 text similarities k1, ..., kn-1. By calculating the average of these n-1 text similarities Σ(k1+...+kn-1)k / (n-1), the overall correlation similarity_i between the scene video segment video_scene_i and the entire video frame sequence (or video being processed) can be obtained. Since there are n text features, n overall correlations can be obtained. Then, the scene video segment video_scene_p corresponding to the minimum value similarity_p = min(similarity_i) among the n overall correlations is set as segment K1 and placed in the reserved segment set K=[K1]. The other scene video segments are designated as R2, ..., Rn-1 and placed in the removal target set R=[R2, ..., Rn-1].
[0054] The final retained segments and removed segments can be determined based on the following iterative deduplication policy. For each element Ry, y=1, ..., n-1 in the set of segments to be removed R, the text similarity_text is calculated between each element Ry and each element in the set of segments to be removed K. The maximum value of the text similarity_text is determined as the semantic similarity_Ry_K between element Ry and the set of segments to be removed K. In this way, the semantic similarities of all n-1 elements in the set of segments to be removed and the n-1 elements in the set of segments to be removed K can be obtained. The minimum value of this semantic similarity_Ry_K is compared with the preset similarity threshold threshold2. If this minimum value is smaller than the similarity threshold threshold2 (e.g., threshold2=0.8), the element corresponding to the minimum value is moved to the set of segments to be removed K and becomes a retained segment. The calculation is repeated for each element in the set of segments to be removed R based on the ascending iteration deduplication policy until the preset condition is met.
[0055] Specifically, the preset conditions may include the fact that the reserved segment set K contains n segments, i.e., the set to be removed is empty. The preset conditions may further include the fact that the semantic similarity between all segments to be removed in the set to be removed and the reserved segment set is less than the similarity threshold threshold 2, i.e., in this case, the similarity of elements in the reserved segment set K and the set to be removed R is greater than the similarity threshold threshold 2, and there are no scene video segments with low similarity. After the preset conditions are met, it is found that the segments to be removed in the set to be removed are scene video segments that have overlapping semantics with at least one reserved segment in the reserved segment set. Furthermore, other video processing steps can be performed on the target video obtained after removing the scene video segments with overlapping semantics, for example, by extracting highlight scenes of a specified length for the target video, thus obtaining a video result in which the length of each scene is uniform.
[0056] The video processing method according to the embodiments of this disclosure ensures that the entire content of the video being processed is reflected, while simultaneously reducing the occurrence of duplicate screens or content, saving memory and computing resources, improving the efficiency of video clips, and achieving a balance between these factors and the presentation effect of the video clips.
[0057] The methods of the embodiments of this disclosure can be performed by a single device, such as a single computer or server. The methods of the embodiments can also be applied to distributed scenarios in which multiple devices can cooperate to complete the process. In such distributed scenarios, one of the multiple devices may perform only one or more steps in the methods of the embodiments of this disclosure, while the multiple devices may interact with each other to complete the described method.
[0058] The above describes some embodiments of the present disclosure. Other embodiments are within the scope of the appended claims. In some cases, the operations or steps described in the claims may be performed in an order different from that in the embodiments above, and the desired results may still be achieved. Furthermore, the processes depicted in the accompanying drawings do not necessarily require that only a specific order or sequence shown be followed to achieve the desired results. In some embodiments, multitasking and parallel processing may be possible or advantageous.
[0059] Based on the same technical concept and corresponding to the methods of any embodiment described above, the present disclosure further provides a video processing apparatus, referring to Figure 5, the video processing apparatus includes an acquisition module for acquiring a video to be processed and a target time length; an acquisition module for acquiring a video to be processed; a segmentation module for dividing the video to be processed into a plurality of scene video segments based on scene transition positions; a text module for determining a text description corresponding to each of the scene video segments; and a deduplication module for removing scene video segments having overlapping semantics in the video to be processed based on the semantic relationships between the text descriptions to obtain a target video.
[0060] For the sake of clarity, the above-mentioned device will be described by dividing it into various modules based on its function. Of course, when implementing this disclosure, the functions of each module can be realized using the same or multiple software and / or hardware.
[0061] The apparatus of the above embodiment is used to implement the corresponding video processing method in any one of the above embodiments and has the beneficial effects of the embodiment of the corresponding method, which will not be described further here.
[0062] Based on the same technical concept and corresponding to the methods of any of the above embodiments, the present disclosure further provides a non-temporary computer-readable storage medium in which computer instructions are stored, the computer instructions being used to cause the computer to execute the video processing method described in any one of the above embodiments.
[0063] The computer-readable media of this embodiment may include non-volatile and volatile media, removable and non-removable media, and information storage may be realized by any method or technique. The information may be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), other types of random-access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disk memory (CD-ROM), digital versatile disk (DVD) or other optical storage, and cartridge-type magnetic tape, and magnetic tape, magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information accessible by a computing device.
[0064] The computer instructions stored in the storage medium of the above embodiment are used to have the computer execute the video processing method described in any one of the above embodiments, and have the beneficial effects of the embodiment of the applicable method, which will not be described further here.
[0065] Those skilled in the art will understand that the discussion of any of the above embodiments is merely illustrative and not intended to limit the scope of this disclosure (including the claims) to these examples. Technical features in the above embodiments or different embodiments can also be combined in the concepts of this disclosure. The steps may be carried out in any order, and there are many other modifications in various aspects of the above embodiments of this disclosure, which are not shown in detail.
[0066] Furthermore, for the sake of simplifying the explanation and discussion, and to avoid obscuring the embodiments of this disclosure, well-known power / ground connections to integrated circuit (IC) chips and other components may or may not be illustrated. It should also be noted that, to avoid obscuring the embodiments of this disclosure, the apparatus may be shown in block diagram form, and the details of the embodiments relating to the apparatus in these block diagrams are highly dependent on the platform on which the embodiments of this disclosure are implemented (i.e., these details should be entirely within the comprehension of those skilled in the art). Where specific details (e.g., circuits) are described to illustrate the exemplary embodiments of this disclosure, it will be apparent to those skilled in the art that the embodiments of this disclosure can be implemented without these specific details, or with modifications to these details. Therefore, these descriptions should be considered explanatory rather than restrictive.
[0067] While this disclosure has been described in relation to specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art based on the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) can be used with the embodiments discussed.
[0068] The embodiments of this disclosure are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the embodiments of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A video processing method, To obtain the video to be processed, The video to be processed is divided into multiple scene video segments based on the scene transition position, To determine a text description corresponding to each of the aforementioned scene video segments, A video processing method comprising removing scene video segments having overlapping semantics in the video to be processed based on the semantic relationships between the text descriptions, thereby obtaining a target video.
2. Based on the semantic relationships between the text descriptions, removing the scene video segments having overlapping semantics in the video to be processed and obtaining the target video is: Based on the semantic relationships between the text descriptions, the overall correlation between the scene video segments and the video to be processed, and the segment correlation between the scene video segments are determined. Based on the overall correlation and the segment correlation, the reserved segments and the segments to be removed in the scene video segment are determined, and the segments to be removed and at least one of the reserved segments have overlapping semantics. The method according to claim 1, comprising removing the segment to be removed from the video to be processed to obtain the target video.
3. Based on the overall correlation and the segment correlation, the reserved segments and segments to be removed in the scene video segment are determined, and the segments to be removed and at least one of the reserved segments have overlapping semantics. This includes designating the scene video segment with the minimum overall correlation as a reserved segment in the reserved segment set, and designating the scene video segment that does not have the minimum overall correlation as a segment to be removed in the segment to be removed set. Repeat the following steps until the preset conditions are met: Based on the segment correlation, calculate the semantic similarity between each segment to be removed in the segment set to be removed and the retained segment set. The segments to be removed whose semantic similarity is less than the similarity threshold are moved to the reserved segment set. The method according to claim 2, wherein the preset condition includes that the set of segments to be removed is empty, and / or that the semantic similarity between all segments to be removed in the set of segments to be removed and the set of segments to be retained is less than the similarity threshold.
4. Determining the overall correlation between the scene video segments and the video to be processed, and the segment correlation between the scene video segments, based on the semantic relationships between the text descriptions, Each of the above text descriptions is feature-extracted to obtain the corresponding text features, For each text feature, calculate multiple text similarity scores between that text feature and other text features, and obtain the segment correlation between the scene video segments. The method according to claim 3, comprising obtaining an overall correlation between the scene video segment corresponding to each text feature and the video to be processed based on the average value of a plurality of text similarity scores.
5. Calculating the semantic similarity between each segment to be removed in the segment set to be removed and the retained segment set based on the segment correlation is: The method according to claim 4, further comprising determining the maximum value of the segment correlation between the segment to be removed and each of the reserved segments in the reserved segment set as the semantic similarity between the segment to be removed and the reserved segment set.
6. Dividing the video to be processed into multiple scene video segments based on the scene transition position is: The process involves determining the scene switching probability of video frames in the video being processed, The video frame in which the scene switching probability is equal to or greater than the switching threshold is determined as the scene switching position, The method according to claim 1, further comprising trimming the video to be processed at the scene switching position to obtain a plurality of scene video segments.
7. Determining a text description corresponding to each of the aforementioned scene video segments is: Determining a plurality of video frame text descriptions corresponding to the video frame based on the image content of each video frame in the aforementioned scene video segment, The process involves extracting features from multiple video frame text descriptions to obtain multiple frame text features, For each frame text feature, calculate the frame text similarity between that frame text feature and other frame text features, The method according to claim 1, further comprising determining the video frame text description corresponding to the frame text feature with the highest frame text similarity as the text description of the scene video segment.
8. A video processing device, An acquisition module for obtaining the video to be processed, A segmentation module for dividing the video to be processed into multiple scene video segments based on the scene transition position, A text module for determining a text description corresponding to each of the aforementioned scene video segments, A video processing apparatus comprising a deduplication module for removing scene video segments having duplicate semantics in the video to be processed based on the semantic relationships between the text descriptions, in order to obtain a target video.
9. An electronic device comprising memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 7 when executing the program.
10. A non-temporary computer-readable storage medium in which computer instructions are stored, wherein the computer instructions are used to cause a computer to perform the method according to any one of claims 1 to 7.
11. A computer program product which is tangibly stored in a computer storage medium and includes computer-executable instructions, wherein the computer-executable instructions cause a device to perform the method described in any one of claims 1 to 7 when executed by the device.