A video intelligent clipping method and system based on multi-modal semantic analysis

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The video intelligent editing method using multimodal semantic analysis collects audiovisual data to generate multimodal feature sets, constructs action states and causal logical relationships, solves the problem of fragmented video editing results, and achieves logically coherent and personalized video editing effects.

CN122248230APending Publication Date: 2026-06-19SHIJIAZHUANG YINGSHANG NETWORK TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHIJIAZHUANG YINGSHANG NETWORK TECHNOLOGY CO LTD
Filing Date: 2026-03-25
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing video editing technologies struggle to fully reflect the true semantic content of videos, resulting in fragmented editing results with poor logical coherence, making it difficult to meet the needs of batch and personalized video editing.

Method used

By collecting audiovisual data through multimodal semantic analysis, generating a multimodal spatiotemporal basic feature set, constructing an action state temporal transition vector set and a causal logic topological association index table, and combining a hierarchical audiovisual saliency weight matrix, the spatiotemporal logical relationship and causal association of video object actions are realized, enabling intelligent logical editing.

Benefits of technology

It enhances the logical coherence and narrative integrity of video editing, meets users' personalized needs, is suitable for the automated processing of large-scale video content, and significantly improves editing efficiency and quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122248230A_ABST

Patent Text Reader

Abstract

This invention relates to the fields of video processing and artificial intelligence, specifically to a video intelligent editing method and system based on multimodal semantic analysis. The method includes the following steps: collecting audiovisual data and simultaneously extracting coordinate, color, and spectral information to generate a multimodal spatiotemporal feature set; filtering displacement and direction compliance sequences and integrating start and end states to obtain an action time-series vector set; establishing trigger and result connections by calculating time difference, spatial overlap rate, and semantic distance to generate a causal logic topology table; calculating significance scores by comprehensively considering audiovisual coefficients and marking parent-child relationships to generate a hierarchical weight matrix; and finally traversing nodes to lock start and end positions and trimming and splicing the original data to establish an intelligent logic-edited video sequence. In this invention, by fusing video images, audio, and time-series information, collaborative analysis of multimodal features is achieved. Compared to single-modal editing methods, this approach can more comprehensively and accurately understand the semantic content of the video, reducing the probability of editing deviations.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of video processing and artificial intelligence technology, and in particular to a video intelligent editing method and system based on multimodal semantic analysis. Background Technology

[0002] With the rapid development of video-related industries such as short videos and live streaming, the scale of video content production is constantly expanding, which puts forward higher requirements for video editing efficiency and quality. Traditional video editing methods mainly rely on manual operation. Editors need to watch the original video frame by frame and select effective segments, add transition effects, match background music, etc. based on their personal experience. This method is not only time-consuming and labor-intensive with low editing efficiency, but the editing effect also depends to a large extent on the professional level of the editor, making it difficult to meet the needs of batch and personalized video editing.

[0003] To reduce manual editing costs and improve editing efficiency, several intelligent video editing solutions have emerged in the current technology. These solutions typically generate editing decisions by analyzing single-modal data of the video, such as selecting segments based solely on scene change information in image frames or solely on speech recognition results in audio. However, single-modal data analysis methods cannot fully reflect the true semantic content of the video, easily leading to biased editing results. For example, editing based solely on image scene changes may miss segments containing important audio information but with minimal visual changes; while editing based solely on audio information may struggle to accurately identify key visual events in the video.

[0004] Furthermore, existing intelligent editing technologies often focus on identifying isolated highlight segments when processing video content, while paying insufficient attention to the spatiotemporal logical connections between video frames and the development of video objects' actions. Due to the lack of refined modeling of the starting, process, and ending states of object actions, and the lack of systematic analysis of the causal relationships between audiovisual information, existing technologies struggle to understand video content from a holistic semantic perspective. The resulting editing often presents as fragmented splicing of segments with weak logical connections and an incomplete narrative structure, failing to meet users' needs for logical coherence and personalized expression in videos.

[0005] Therefore, there is an urgent need for a video editing technology solution that can perform multimodal deep fusion analysis of video content and construct the spatiotemporal logical relationships and causal connections of video object actions, thereby achieving a more intelligent, coherent and user-friendly video editing solution. Summary of the Invention

[0006] To address the technical problems existing in the prior art, embodiments of the present invention provide a video intelligent editing method and system based on multimodal semantic analysis, comprising the following steps:

[0007] To achieve the above objectives, the present invention adopts the following technical solution: a video intelligent editing method based on multimodal semantic analysis, comprising the following steps:

[0008] S1: Collect audiovisual data, simultaneously extract the target center coordinates, color histogram and frequency domain energy spectral density, associate timestamps and combine multi-dimensional information to generate a multimodal spatiotemporal basic feature set;

[0009] The audiovisual data consists of a sequence of frames and an audio track from the original video, and the multidimensional information includes spatial, visual, auditory, and temporal information.

[0010] S2: Call the multimodal spatiotemporal basic feature set, filter displacement vector and direction cosine compliance sequences, retain action nodes, integrate the preceding start and subsequent end state features to obtain the action state temporal transition vector set;

[0011] S3: Based on the action state time sequence transition vector set, calculate the timestamp difference to construct the trigger connection, and the spatial overlap rate and semantic distance to construct the result connection. After the threshold condition is met, generate the causal logic topology association index table.

[0012] S4: For the causal logic topological association index table, the saliency score is obtained by summing the visual bounding box and the auditory energy spectrum coefficient, and the difference in node scores is compared with the threshold to mark the parent-child relationship, generating a hierarchical audiovisual saliency weight matrix.

[0013] S5: Traverse the parent nodes according to the hierarchical audiovisual saliency weight matrix, lock the start and end positions of the editing for the trigger and result connection, perform cropping and splicing on the original audiovisual data, and establish an intelligent logic editing video sequence.

[0014] As a further aspect of the present invention, the multimodal spatiotemporal basic feature set includes overall representation information of target behavior, cross-modal association description information, temporal continuity structure information, and basic semantic identifiers of content.

[0015] The action state time sequence transition vector set includes a description of the action start-end correlation, a description of the state change trend, and action integrity determination information;

[0016] The causal logic topology association index table includes event trigger pointing relationship, event result correspondence relationship, and logical dependency path identifier;

[0017] The hierarchical audiovisual saliency weight matrix includes saliency strength classification information, node importance differentiation information, and hierarchical priority labeling information;

[0018] The intelligent logic-edited video sequence includes an event mainline segment structure, content coherence expression units, and semantically consistent output content.

[0019] As a further aspect of the present invention, the audiovisual data is a sequence of frame images and an audio track of the original video;

[0020] The multidimensional information includes spatial, visual, auditory, and temporal information.

[0021] As a further aspect of the present invention, the displacement vector is the change vector of the center coordinates of the target object in adjacent video frames, which is used to quantify the amount and direction of the target's movement in space;

[0022] The spatial overlap rate is the region intersection-union ratio of the spatial bounding boxes of the last frame of the action and the termination node, which is used to evaluate the degree of spatial overlap between the two regions.

[0023] The visual bounding box is a rectangular box that surrounds the target object in the video frame, used to define the spatial position and range of the target in the image;

[0024] The auditory energy spectral coefficient is the energy spectral density of the audio signal in the frequency domain minus the background noise, and is used to quantify the perceptual saliency of the audio.

[0025] As a further aspect of the present invention, the step of obtaining S1 is as follows:

[0026] The original video frame image sequence is acquired, the pixel matrix of the single frame image is traversed and the pixel distribution set of the target object is extracted, the plane horizontal and vertical coordinate values of the geometric center are calculated based on the extreme values of the set boundary, the intensity distribution frequency values of the pixels in each color channel are counted, and the coordinate values and frequency values are concatenated by the row dimension according to the frame number to establish the target space visual attribute data group.

[0027] The audio track data stream accompanying the original video is acquired. A discrete Fourier transform is performed on the time-domain acoustic wave sampling points to map them to the frequency domain. The square value of the amplitude mode of each frequency component is calculated to characterize the signal power. The power value is accumulated along the frequency axis and the energy spectral density value is obtained to generate an audio frequency domain energy feature sequence.

[0028] The target space visual attribute data set and the audio frequency domain energy feature sequence are called, the absolute timestamp values in the video metadata are parsed and a time index is constructed, and the visual coordinate values, color distribution values and auditory energy values are retrieved and aligned based on the time index. The data of each dimension are combined to obtain a multimodal spatiotemporal basic feature set.

[0029] As a further aspect of the present invention, the step of obtaining S2 is as follows:

[0030] The multimodal spatiotemporal basic feature set is invoked to extract the target center coordinate sequence of each frame, calculate the Euclidean distance between adjacent frame coordinates, compare the distance value with a preset static determination threshold, filter the frame index with the distance value greater than the static determination threshold and sort them by time to generate a non-static continuous dynamic frame sequence.

[0031] Based on the non-static continuous dynamic frame sequence, the displacement vectors of adjacent frames are calculated, the cosine value of the angle between continuous vectors is calculated using the vector dot product formula, and the cosine value is compared with the value of the preset direction consistency test threshold. The frame intervals where the cosine value is continuously higher than the test threshold are identified, and the frame index and associated vector direction data within the interval are extracted to establish a set of highly consistent action process segments.

[0032] For the set of highly consistent action process segments, the adjacent frame indices of the first frame before the segment and the last frame after the segment are locked. Based on the indices, the corresponding visual and auditory feature data are retrieved from the multimodal spatiotemporal basic feature set. The features of the preceding, current, and following segments are concatenated dimensionally to obtain the action state temporal transition vector set.

[0033] As a further aspect of the present invention, the step of obtaining S3 is as follows:

[0034] Based on the action state time sequence transition vector set, the timestamp values of the start and action nodes are extracted, the absolute time difference value is calculated, the difference value is compared with the preset causal judgment threshold, and the node combination with the difference value less than the causal judgment threshold is selected to generate the trigger time sequence associated node pair.

[0035] For the triggering time-series associated node pair, retrieve the spatial boundary and semantic features of the last frame of the action and the termination node, calculate the region intersection-union ratio and the vector cosine distance, compare the calculated value with the overlap rate separation threshold and the distance continuity threshold, retain the connection relationship that meets the dual threshold conditions, and obtain the result logical verification candidate set.

[0036] Based on the results, the candidate set is logically verified, the chain attributes are parsed, and the spatiotemporal nodes and connection relationships are mapped to graph structure vertices and directed edges, respectively. An index relationship reflecting the node evolution path is established, and a causal logical topological association index table is constructed.

[0037] As a further aspect of the present invention, the step of obtaining S4 is as follows:

[0038] For the causal logic topological association index table node, the multimodal spatiotemporal basic feature set is called to calculate the ratio of the bounding box area to the full frame image area and the histogram channel variance as the contrast value. The product of the proportion and the contrast, as well as the difference between the energy spectral density and the background noise, are calculated to obtain the visual and auditory coefficients respectively, and generate the audiovisual independent perception coefficient vector.

[0039] Based on the audiovisual independent perception coefficient vector, the visual and auditory coefficients of each node are linearly summed to obtain a comprehensive score representing the saliency of the node. The score data are arranged in the order of node index to establish a comprehensive saliency score sequence.

[0040] The comprehensive saliency score sequence and the causal logic topology association index table are called to calculate the absolute value of the score difference between the topology connection node pairs. The absolute value is compared with the preset hierarchical isolation threshold. The connection relationship that meets the threshold condition is marked with parent-child attributes to obtain the hierarchical audiovisual saliency weight matrix.

[0041] As a further aspect of the present invention, the step of obtaining S5 is as follows:

[0042] Based on the hierarchical audiovisual saliency weight matrix, the parent node hierarchical objects are traversed, the timestamp values of the mapping between the trigger connection attribute and the result connection attribute are extracted, the difference between the end timestamp and the start timestamp is calculated, and if the difference is greater than the preset minimum threshold for editing duration, the timestamp value is retained and multiplied by the video frame rate conversion coefficient to generate the editing timing positioning frame index interval.

[0043] The editing timing positioning frame index interval is called, the original video image sequence and audio track data are read, the start and end frame positions of the video stream and the start and end sampling point positions of the audio stream are located according to the index interval value, the RGB image matrix data and PCM audio sampling data in the corresponding interval are extracted, the synchronization deviation value between the image matrix timestamp and the audio sampling timestamp is detected, and if the deviation value is less than the synchronization fault tolerance threshold, the data is formatted and encoded to encapsulate, and the audiovisual segmented original data stream is established.

[0044] For the original audiovisual segment data stream, perform time-series arrangement, calculate the difference value of color histogram features between the last frame and the first frame of adjacent data streams and the change amplitude value of the zero-crossing rate of the audio waveform. If the difference value is less than the visual abrupt change threshold and the change amplitude value is less than the auditory abrupt change threshold, perform linear splicing operation to establish an intelligent logic editing video sequence.

[0045] A video intelligent editing system based on multimodal semantic analysis includes:

[0046] The multimodal feature acquisition module collects video frame sequences and audio track data, extracts the bounding box center coordinates, color histograms and audio energy spectral density, associates timestamps and combines spatial audiovisual information to generate a multimodal spatiotemporal basic feature set;

[0047] The action timing transfer module calls the multimodal spatiotemporal basic feature set, filters continuous frame sequences with displacement modulus greater than the static determination threshold, compares the cosine value of adjacent vector directions with the consistency check threshold, extracts action nodes and integrates start and end features to obtain the action state timing transfer vector set;

[0048] The causal topology association module calculates the difference in node timestamps based on the action state temporal transition vector set and compares it with the causal judgment threshold to establish a trigger connection. It performs comparisons with the spatial overlap rate and semantic feature cosine distance with the separation threshold and continuity threshold respectively to establish a result connection, and performs indexing and arrangement on the connected nodes to establish a causal logical topology association index table.

[0049] The audiovisual saliency weighting module calculates the difference between the bounding box histogram product and the energy spectrum noise for the nodes of the causal logic topology association index table to obtain a saliency score. It then compares the score difference with the hierarchical isolation threshold to generate a hierarchical audiovisual saliency weighting matrix.

[0050] The intelligent logic editing module traverses the parent node hierarchy according to the hierarchical audiovisual saliency weight matrix, locks the start and end frame positions of the editing, and performs cropping and splicing on the original video image and audio track to establish an intelligent logic edited video sequence.

[0051] Compared with the prior art, the advantages and positive effects of the present invention are as follows:

[0052] In this invention, by fusing video images, audio, and time-series information, collaborative analysis of multimodal features is achieved. Compared with single-modal editing methods, this approach can more comprehensively and accurately understand the semantic content of the video and reduce the probability of editing deviations.

[0053] By modeling the initial state, process, and termination state of video object actions, and constructing the spatiotemporal logic and causal relationships between action states, the editing decision-making process is transformed from isolated segment selection to a holistic analysis based on the action evolution process, effectively improving the logical coherence and narrative integrity of the editing results.

[0054] By introducing a hierarchical audiovisual saliency weight matrix, the importance of video segments can be quantitatively evaluated, which helps to highlight key content while ensuring semantic integrity, thereby improving the viewing experience and expressive focus of the generated video.

[0055] Furthermore, this invention can complete the video editing process without manual frame-by-frame intervention, making it suitable for automated processing of large-scale video content. It significantly improves the intelligence level and processing efficiency of video editing and, to a certain extent, meets users' personalized editing needs. Attached Figure Description

[0056] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0057] Figure 1 This is a schematic diagram of the steps of the present invention;

[0058] Figure 2 This is a comparison diagram of the time alignment deviation of the multimodal features of the present invention;

[0059] Figure 3 This is a comparison chart of visual feature differences between video stitching points according to the present invention.

[0060] Figure 4 This is a system module diagram of the present invention. Detailed Implementation

[0061] The technical solution of the present invention will now be described with reference to the accompanying drawings.

[0062] In embodiments of the present invention, words such as "exemplarily," "for example," etc., are used to indicate that something is an example, illustration, or description. Any embodiment or design described as "exemplary" in the present invention should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the word "exemplary" is intended to present the concept in a concrete manner. Furthermore, in embodiments of the present invention, the meaning expressed by "and / or" can be both, or either one.

[0063] In the embodiments of this invention, the terms "image" and "picture" may sometimes be used interchangeably. It should be noted that, without emphasizing the distinction between them, they convey the same meaning. Similarly, the terms "of," "corresponding (relevant)," and "corresponding" may sometimes be used interchangeably. It should be noted that, without emphasizing the distinction between them, they convey the same meaning.

[0064] In this embodiment of the invention, sometimes a subscript such as W1 may be written in a non-subscript form such as W1. When the difference is not emphasized, the meaning they express is the same.

[0065] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.

[0066] Please see Figures 1-3 This invention provides a video intelligent editing method based on multimodal semantic analysis, comprising the following steps:

[0067] S1: Acquire the frame image sequence and audio track of the original video, extract the center coordinates of the bounding box of the target object and the color histogram of the pixel region, convert the audio into a frequency domain signal and extract the energy spectral density, synchronously associate the frame timestamp, combine spatial, visual, auditory and temporal information, and generate a multimodal spatiotemporal basic feature set;

[0068] S2: Call the multimodal spatiotemporal basic feature set, calculate the spatial displacement vector magnitude of the center coordinates of adjacent frames of the target, filter the continuous frame sequence with a magnitude greater than the static determination threshold, calculate the direction cosine value of adjacent vectors in the sequence, retain the segments with cosine values greater than the consistency test threshold as action process nodes, and extract their preceding start state node and subsequent end state node respectively, integrate the feature data of the three to obtain the action state temporal transition vector set;

[0069] S3: Based on the action state time sequence transition vector set, calculate the timestamp difference between the starting state and the action process node. If it is less than the causal determination threshold, establish a trigger connection. Calculate the spatial overlap rate and semantic feature cosine distance between the action process and the termination state node. If the overlap rate separation threshold and distance continuity threshold conditions are met, establish a result connection and construct a causal logic topology association index table.

[0070] S4: For nodes in the causal logic topological association index table, calculate the product of the bounding box area ratio and the color histogram contrast based on the feature set as the visual coefficient, calculate the difference between the energy spectral density and the background noise level as the auditory coefficient, sum them to obtain the comprehensive saliency score, calculate the difference in scores of connected nodes and compare it with the hierarchical isolation threshold to mark the parent-child relationship, and generate a hierarchical audiovisual saliency weight matrix.

[0071] S5: Traverse the parent node hierarchy objects according to the hierarchical audiovisual saliency weight matrix, lock the start and end frame positions of the clip based on its trigger connection and result connection, perform cropping and splicing on the original video image sequence and audio track, and establish an intelligent logic clipped video sequence.

[0072] The multimodal spatiotemporal basic feature set includes overall representation information of target behavior, cross-modal association description information, temporal continuity structure information, and basic semantic identifiers of content;

[0073] The action state time sequence transition vector set includes descriptions of action start-end correlation, descriptions of state change trends, and action integrity determination information;

[0074] The causal logic topology association index table includes event triggering relationships, event result correspondence relationships, and logical dependency path identifiers.

[0075] The hierarchical audiovisual saliency weight matrix includes saliency strength classification information, node importance differentiation information, and hierarchical priority labeling information;

[0076] Intelligent logic editing of video sequences includes event-driven segment structure, content coherence expression units, and semantically consistent output content.

[0077] Please see Figures 1-2 The steps to obtain S1 are as follows:

[0078] The original video frame image sequence is acquired, the pixel matrix of the single frame image is traversed and the pixel distribution set of the target object is extracted, the plane horizontal and vertical coordinate values of the geometric center are calculated based on the extreme values of the set boundary, the intensity distribution frequency values of the pixels in each color channel are counted, and the coordinate values and frequency values are concatenated by the row dimension according to the frame number to establish the target space visual attribute data group.

[0079] For the acquired original video frame image sequence with a resolution of 1920x1080 pixels that includes a red moving sphere, read the sequence number of the video frame. The single-frame image data is initialized with an empty set to store the target pixel coordinates and three count arrays of length 256 to record color intensity. Row traversal variables are set. Increment from 1 to 1080, and set a column traversal variable. Increment from 1 to 1920, visiting each coordinate sequentially. The system reads the red, green, and blue channel values of each pixel, sets the red channel threshold to 200, and the green and blue channel threshold to 50. It then executes a pixel filtering logic based on these channel thresholds, i.e., the judgment formula. If the condition is met, then the pixel is determined to be part of the target red sphere, and the current coordinates that satisfy the condition are determined. Store the target pixel coordinates in a set. After traversing the set, sort the x-coordinates of all coordinates in the set in ascending order and select the first value as the minimum horizontal boundary value and the last value as the maximum horizontal boundary value. Sort the y-coordinates of all coordinates in the set in ascending order and select the first value as the minimum vertical boundary value and the last value as the maximum vertical boundary value. Apply the geometric center calculation formula, the specific calculation formula is as follows:

[0080] ,

[0081] in, The coordinates of the geometric center are For the boundary minimum value, It is the boundary maximum value;

[0082] Assumption:

[0083] lateral boundary minimum =800;

[0084] Maximum value of horizontal boundary =1000;

[0085] Vertical boundary minimum value =400;

[0086] Maximum value of vertical boundary =600;

[0087] Substitute the boundary values of the horizontal and vertical coordinates into the formula to calculate the horizontal and vertical coordinates of the geometric center, respectively.

[0088] Geometric center x-coordinate: ;

[0089] Geometric center ordinate: ;

[0090] The results show that the abscissa of the geometric center is 900, and the ordinate of the geometric center is... ;

[0091] The advantage of the formula is that by using the arithmetic mean of boundary extrema to replace the torque calculation of all pixels, the computational power consumption of image processing is significantly reduced, and the real-time performance of the system in processing high frame rate video is improved while ensuring positioning accuracy.

[0092] Next, the pixels in the set are traversed again, and the intensity value of each pixel in the red channel is read. For example, if the intensity value is 255, the value at index 255 in the red count array is incremented by 1. After the statistics are completed, if the count value at index 255 is 500 and the total number of pixels in the set is 1000, the frequency calculation logic is executed. The count value of a specific intensity level is divided by the total number of pixels in the set, and the distribution frequency of that intensity is 0.5. The same statistical operation is performed on the green and blue channels. The calculated horizontal and vertical coordinate values of 900 and 500 are placed at the beginning of the feature vector. Then, the frequency values of 256 intensity values of each of the red, green and blue channels are concatenated in sequence. The feature vector is associated with the current frame number and stored to establish the target space visual attribute data group.

[0093] The audio track data stream accompanying the original video is acquired. A discrete Fourier transform is performed on the time-domain acoustic wave sampling points to map them to the frequency domain. The square value of the amplitude mode of each frequency component is calculated to characterize the signal power. The power value is accumulated along the frequency axis and the energy spectral density value is obtained to generate an audio frequency domain energy feature sequence.

[0094] For the audio track data stream with a sampling rate of 44100 Hz accompanying the original video, a time-domain window size of 1024 sampling points was set. The data stream was truncated according to a window movement step of 512 sampling points to obtain the values of 1024 time-domain acoustic sampling points within the specified time window. A discrete Fourier transform operation was performed on this set of discrete data to convert the time-domain signal into a frequency-domain complex number sequence. The complex number corresponding to the tenth component on the frequency axis was obtained. Assuming that the corresponding complex number has a real part of 3 and an imaginary part of 4, the real part value 3 and the imaginary part value 4 of the complex number were read, and the signal power calculation logic was executed, that is, the real part value was calculated separately. The imaginary part is squared and summed to obtain a value of 25 to represent the signal power of the frequency component. All frequency components within the window from 0 Hz to the cutoff frequency of 22050 Hz are traversed, and the above power calculation steps are repeated to obtain the corresponding power value set. The energy spectral density accumulator is initialized to 0, and the energy spectral density integration logic is executed. The power values of all frequency components in the set are added to the accumulator in sequence. If the final accumulation result is 50000, then the value 50000 is the energy spectral density value of the current time window. The value is arranged in chronological order to generate an audio frequency domain energy feature sequence.

[0095] The target space visual attribute data set and audio frequency domain energy feature sequence are called, the absolute timestamp value in the video metadata is parsed and a time index is constructed, and the visual coordinate value, color distribution value and auditory energy value are retrieved and aligned based on the time index. The data of each dimension are combined to obtain a multimodal spatiotemporal basic feature set.

[0096] For the target spatial visual attribute data set and audio frequency domain energy feature sequence, the frame rate recorded in the video metadata is read as 30 frames per second. The absolute timestamp value corresponding to the specified frame video image is extracted, assuming an absolute time of 0.333 seconds. The start timestamp value corresponding to the specified data point in the audio frequency domain energy feature sequence is read, assuming an absolute time of 0.330 seconds. The allowable deviation threshold for time alignment is set to 0.02 seconds. This threshold is set based on half of the video frame duration of 0.033 seconds to prevent misalignment. The time synchronization judgment logic is executed, and the absolute value of the difference between the video timestamp and the audio timestamp is calculated, yielding a result of 0.003 seconds. This difference is then compared with the deviation threshold of 0. Numerical comparison is performed every 0.2 seconds. If the difference is less than the threshold, the two are determined to belong to the same spatiotemporal segment. The geometric center coordinates of the corresponding frame in the video data group (900, 500) and the color distribution frequency sequence are extracted. The energy spectral density of the corresponding data point in the audio sequence (50000) is extracted. A multimodal feature vector container is constructed. The visual coordinates and color distribution values are used as the first half of the vector, and the auditory energy value (50000) is concatenated to the end of the vector to form a joint feature vector with expanded dimensions. The above retrieval, difference calculation, judgment and concatenation operations are repeated for each frame image in the video sequence and its adjacent audio window to obtain the multimodal spatiotemporal basic feature set.

[0097] Please see Figure 1 The steps to obtain S2 are as follows:

[0098] The multimodal spatiotemporal basic feature set is called to extract the target center coordinate sequence of each frame, calculate the Euclidean distance between the coordinates of adjacent frames, compare the distance value with the preset static judgment threshold, filter the frame index with the distance value greater than the static judgment threshold and sort them by time to generate a non-static continuous dynamic frame sequence.

[0099] The multimodal spatiotemporal feature set is invoked, and an empty integer list is initialized to store the time indices of non-still frames. A loop variable is set to iterate through the sequence numbers of all video frames in the feature set. For the currently iterated frame, the planar geometric center coordinates of the target object are parsed from the feature set. Assuming the x-coordinate is 900 pixels and the y-coordinate is 500 pixels, the geometric center coordinates of the next adjacent frame are parsed simultaneously. If the x-coordinate is 903 pixels and the y-coordinate is 504 pixels, the Euclidean distance calculation logic is executed. First, the difference between the x-coordinate and the y-coordinate are calculated separately. Then, the squares of the two differences are added together and the square root is taken. Specifically, 903 minus 900 equals 3, and 504 minus 500 equals... 4. Squaring and adding the results gives 25, and taking the square root gives the Euclidean distance value 5. Set a static judgment threshold parameter. This parameter is calibrated based on the pixel density of the acquisition device and the amplitude of environmental micro-vibrations. Assuming the threshold is set to 2.5 pixels, compare the calculated Euclidean distance value 5 with the static judgment threshold 2.5. If the distance value 5 is greater than the threshold 2.5, it is determined that the target in the frame has undergone substantial displacement, and the index value of the frame is stored in an integer list. If the distance value is less than or equal to the threshold 2.5, it is determined to be background noise or a static state, and the frame is ignored. After traversing all frames, sort the index values in the list in ascending order of timestamp to generate a non-static continuous dynamic frame sequence.

[0100] Based on the non-static continuous dynamic frame sequence, the displacement vectors of adjacent frames are calculated. The cosine value of the angle between continuous vectors is calculated using the vector dot product formula. The cosine value is compared with the value of the preset direction consistency test threshold. The frame interval where the cosine value is continuously higher than the test threshold is identified. The frame index and associated vector direction data within the interval are extracted to establish a set of highly consistent action process segments.

[0101] Based on a non-static continuous dynamic frame sequence, a sliding window approach is used to sequentially read adjacent time indices, extracting the target center coordinates corresponding to the current index and the coordinates corresponding to the next index. The displacement vector at the current moment is calculated by subtracting the corresponding coordinate components. Similarly, the displacement vector between the next adjacent moments is calculated. Assuming the horizontal component is 4 and the vertical component is 3 at the current moment, and the horizontal component is 3 and the vertical component is 4 at the adjacent moment, the magnitude is calculated for each vector, i.e., the square root of the sum of the squares of the components is obtained. The result is 5. The cosine of the vector angle is then calculated by dividing the dot product of the two vectors by the product of the two magnitudes. Specifically, the product of the horizontal components (12) and the product of the vertical components (12) are compared. The dot product is 24, which is then divided by the product of the modulus and length, 25, to obtain the cosine value of the angle between consecutive vectors, 0.96. A direction consistency test threshold is set, which is based on the tolerance range of the smoothness of the motion trajectory. Assuming the corresponding angle is 30 degrees, the cosine value is 0.866. The calculated cosine value of 0.96 is compared with the test threshold of 0.866. If the cosine value is continuously greater than the threshold, it is determined that these frames are in the same continuous motion process. The start index and end index of the cosine value that are continuously higher than the threshold are identified and recorded. Assuming from index 10 to index 50, the frame sequence and the corresponding vector direction data within the index interval are extracted to establish a set of highly consistent motion process segments.

[0102] For a set of highly consistent action process segments, the indices of adjacent frames before the first frame and after the last frame of the segment are locked. Based on the indices, the corresponding visual and auditory feature data are retrieved from the multimodal spatiotemporal basic feature set. The features of the preceding, current, and following segments are concatenated to obtain a set of action state temporal transition vectors.

[0103] For each independent segment in the set of highly consistent action process segments, the first frame index (10) and the last frame index (50) of that segment are precisely located. The number of preceding and following sampled frames for capturing the action context is set; assuming a sampling number of 3 frames, the index sequence preceding the first frame is calculated as 7, 8, and 9, and the index sequence following the last frame is calculated as 51, 52, and 53. Based on these determined index values, data retrieval requests are initiated one by one to the multimodal spatiotemporal basic feature set. The visual geometric center coordinates, color distribution frequency values, and audio frequency domain energy spectral density values corresponding to each index are extracted. For example, reading the index... The coordinates (880, 480) and audio energy value 48000 of index 7 are used. All feature data from indices 10 to 50 are read, and feature data from indices 51 to 53 are read. The feature sequence splicing operation is performed. According to the time sequence order, the feature vectors of the previous 3 frames are used as the matrix head, the feature vectors of the current segment 41 frames are used as the matrix middle, and the feature vectors of the next 3 frames are used as the matrix tail. The data is spliced in an ordered manner in the row dimension to form a complete feature sequence containing 47 time steps. This operation is performed on all segments in the segment set to obtain the action state time sequence transition vector set.

[0104] Please see Figure 1 The steps to obtain S3 are as follows:

[0105] Based on the action state time sequence transition vector set, the timestamp values of the start and action nodes are extracted, the absolute time difference value is calculated, the difference value is compared with the preset causal judgment threshold, and the node combination with the difference value less than the causal judgment threshold is selected to generate the trigger time sequence associated node pair.

[0106] Based on the action state temporal transition vector set, a temporary container is initialized to store potential causal relationship nodes. A loop traversal pointer is set to sequentially read a specified action sequence and its associated subsequent state nodes in the vector set. The action start timestamp and action end or state change timestamp are extracted from the feature header of the action sequence and the feature of the potential result node, respectively. Assuming the read values are 12.5 seconds and 12.8 seconds in the video, respectively, a temporal difference operation is then performed to calculate the absolute value of the difference between the end timestamp and the start timestamp, yielding a result of 0.3 seconds. A causal determination threshold parameter is set, which is based on the statistical value of the average delay time from action to result in the physical world. Set a threshold, for example, 1.5 seconds. Compare the calculated difference value of 0.3 seconds with the causal determination threshold of 1.5 seconds. Execute the threshold filtering logic. If the difference value is less than the threshold, it indicates that the two time points have a close causal relationship in terms of time sequence. Then, form a tuple with the index ID of the current action sequence and the index ID of the result node. If the difference value is greater than or equal to the threshold, treat it as an independent event and exclude it from the current associated process. Traverse all possible combinations of time nodes in the vector set and repeat the above extraction, difference calculation and threshold filtering steps. Store all tuples that meet the conditions in a list in chronological order to generate triggering time sequence associated node pairs.

[0107] For the triggering time-series associated node pairs, retrieve the spatial boundary and semantic features of the last frame of the action and the termination node, calculate the region intersection-union ratio and the vector cosine distance, compare the calculated value with the overlap rate separation threshold and the distance continuity threshold, retain the connection relationship that meets the dual threshold conditions, and obtain the result logical verification candidate set.

[0108] For each pair of binary indexes in the triggering time-series associated node pair, the spatial boundary coordinates and a 128-dimensional semantic feature vector of the last frame at the end of the action segment are retrieved from the multimodal feature library. Simultaneously, the spatial boundary coordinates and semantic feature vector of the starting time of the result node are also retrieved. Here, the coordinate range of the spatial boundary coordinates of the last frame is set to 100,100 to 200,200, and the coordinate range of the spatial boundary coordinates of the node's starting time is set to 150,150 to 250,250. Then, the intersection-over-union (IoU) calculation logic is executed. First, the overlapping area of the two rectangles is calculated, which is the product of width 50 and height 50, 2500. Then, the union area of the two rectangles is calculated, which is the area of the two rectangles. The sum of the products minus the overlapping area yields 17500. Finally, the overlapping area is divided by the union area to obtain a value of 0.14. Simultaneously, the cosine distance calculation logic is executed to calculate the cosine similarity of the two semantic vectors and subtract this similarity from 1 to obtain a distance value of 0.2. The overlap rate separation threshold of 0.05 and the distance continuity threshold of 0.3 are set. The calculated IoU value of 0.14 is compared with the threshold of 0.05 to show a greater than relationship, and the cosine distance value of 0.2 is compared with the threshold of 0.3 to show a less than relationship. Only when both comparison results are satisfied, that is, there is spatial interaction and semantic continuity, is the node pair determined to have a strong logical association. The node pairs that meet the above double verification and their intermediate calculation parameters are retained to obtain the result logical verification candidate set.

[0109] Based on the results, the candidate set is verified logically, the chain attributes are parsed, and the spatiotemporal nodes and connection relationships are mapped to graph structure vertices and directed edges respectively. An index relationship reflecting the evolution path of the nodes is established, and a causal logical topological association index table is constructed.

[0110] Based on the result logic, the data items in the candidate set are verified. A directed graph data structure object is initialized. Each verified node pair in the candidate set is traversed. The attribute information of the action node as the source node is parsed, including its time interval, spatial center trajectory, and semantic category label. It is mapped to a vertex in the graph structure and a unique memory address pointer is allocated. The attribute information of the result node as the target node is parsed and mapped to a vertex in the graph structure. A directed edge object is created to connect the source node to the target node. The weight aggregation calculation logic is executed. The IoU value and cosine distance value calculated in the previous steps are weighted and normalized to obtain a weight value of 0.85. In the adjacency matrix of the graph database, the cell with the source node index as the row and the target node index as the column is set to this weight value. If there are multiple connection paths, a linked list structure is built to store the attributes of multiple edges. After completing the mapping of all candidate node pairs, the graph structure is traversed for connectivity. A complete evolution path index from the initial action to the final result is established. All vertex information, edge weights, and topological connection relationships are serialized and stored, and a causal logic topological association index table is constructed.

[0111] Please see Figure 1 The steps to obtain S4 are as follows:

[0112] For the causal logic topological association index table node, the multimodal spatiotemporal basic feature set is called to calculate the ratio of the bounding box area to the full frame image area and the histogram channel variance as the contrast value. The product of the proportion and the contrast, as well as the difference between the energy spectral density and the background noise, are calculated to obtain the visual and auditory coefficients respectively, and generate the audiovisual independent perceptual coefficient vector.

[0113] For each graph structure node recorded in the causal logic topological association index table, a traversal program is initiated to sequentially access the metadata associated with the node. From the multimodal spatiotemporal basic feature set, the video frame number corresponding to the node and the spatial bounding box coordinates of the target object within the frame are parsed. The top-left and bottom-right coordinates are read from the coordinate parameters. Assuming the top-left coordinates are 200, 300 and the bottom-right coordinates are 600, 500, the area ratio calculation logic is executed on both. First, the product of the target region's width and height is calculated to obtain the target area of 80,000 pixels squared. Then, the total area of 2,073,600 pixels squared is calculated from the full frame resolution. The target area is divided by the full frame area to obtain the ratio value of 0.0386. Next, the image data in the feature set is called, and the pixel matrix within the bounding box is extracted. The red, green, and blue color channels are separated, and the arithmetic of the brightness of all pixels in each channel is calculated. For each pixel, the squared difference between its brightness and the mean is calculated. All squared differences are summed and divided by the total number of pixels to obtain the variance value for each channel. The contrast calculation logic is then executed, calculating the average of the three channel variances as the histogram channel variance contrast value. Assuming the histogram channel variance contrast value is 1500, the visual coefficient calculation logic is executed, multiplying the area ratio by the contrast value (0.0386 multiplied by 1500) to obtain a visual coefficient value of 57.9. Subsequently, based on the timestamp information associated with the nodes, the corresponding energy spectral density value of 65000 is retrieved from the audio feature sequence. Assuming the background noise baseline parameter is 5000, the auditory coefficient calculation logic is executed, calculating the difference between the energy spectral density and the background noise to obtain 60000. Finally, the calculated visual coefficient of 57.9 and the auditory coefficient of 60000 are combined sequentially to generate an audiovisual independent perceptual coefficient vector.

[0114] Based on the independent visual and auditory perception coefficient vector, the visual and auditory coefficients of each node are linearly summed to obtain a comprehensive score representing the significance of the node. The score data are arranged in the order of node index to establish a comprehensive significance score sequence.

[0115] Based on the independent perceptual coefficient vectors of visual and auditory senses, the visual and auditory coefficients of each node are linearly summed. A floating-point array is initialized to store the calculation results. Visual and auditory weight coefficients are set. These two coefficients are set to balance the difference in magnitude between the two modalities and are adjusted according to the application scenario. For example, considering that the visual coefficient is about tens and the auditory coefficient is about tens of thousands, the visual weight is set to 100 to amplify the numerical magnitude of visual features, and the auditory weight is set to 0.01 to compress the numerical magnitude of auditory features. The vector generated in the previous steps is read, and the linear weighted fusion calculation logic is executed, multiplying the visual coefficients by... Visual weighting involves multiplying the auditory coefficient by the auditory weight and summing the two products. Specifically, 57.9 is multiplied by 100 to get 5790, and 60000 is multiplied by 0.01 to get 600. The sum is then used to obtain a comprehensive score of 6390. This weighted summation process is repeated for each node in the index table. For example, for another node, a comprehensive score of 4500 is calculated. All the calculated comprehensive scores are arranged according to the original ID order of the nodes in the topology index table, ensuring that the array index corresponds one-to-one with the node ID. After traversal, the array structure containing the scores of all nodes is solidified to establish a comprehensive significance score sequence.

[0116] Call the comprehensive saliency score sequence and the causal logic topological association index table, calculate the absolute value of the score difference between the topological connection node pairs, compare the absolute value with the preset hierarchical isolation threshold, mark the parent-child attributes of the connection relationship that meets the threshold condition, and obtain the hierarchical audiovisual saliency weight matrix.

[0117] The comprehensive saliency score sequence and causal logic topological association index table are invoked. A two-dimensional matrix structure is initialized to store hierarchical weight information. Each directed edge connection in the index table is traversed, identifying the start and end nodes of the current edge. Assuming the start node ID is 0 and the end node ID is 1, the comprehensive score of the start node (6390) and the comprehensive score of the end node (4500) are extracted from the score sequence using index addressing. The saliency hierarchical difference calculation logic is executed, calculating the absolute value of the difference between the two scores. The result is 1890. The hierarchical isolation threshold parameter is set to 1000, which is based on the statistical distribution of the score sequence. The criteria are set by comparing the calculated absolute value of 1890 with the preset hierarchical isolation threshold of 1000. If the absolute value is greater than the threshold, it is determined that there is a significant hierarchical span between the two nodes. The parent-child relationship is determined according to the score, that is, the node with the higher score is marked as the parent attribute and the node with the lower score is marked as the child attribute. The difference of 1890 is filled into the corresponding coordinate position of the weight matrix as the weight value, and the parent-child hierarchical identifier is added. If the absolute value is less than or equal to the threshold, only the difference is filled in but the hierarchical attribute is not marked, and it is regarded as a same-level association. After traversing all connection relationships and completing the matrix filling, the hierarchical audiovisual saliency weight matrix is obtained.

[0118] Please see Figure 1 , Figure 3 The steps to obtain S5 are as follows:

[0119] Based on the hierarchical audiovisual saliency weight matrix, the parent node hierarchical objects are traversed, the timestamp values of the mapping between the trigger connection attribute and the result connection attribute are extracted, the difference between the end timestamp and the start timestamp is calculated, and if the difference is greater than the preset minimum threshold of the editing duration, the timestamp value is retained and multiplied by the video frame rate conversion coefficient to generate the editing time sequence positioning frame index interval.

[0120] Based on a hierarchical audiovisual saliency weight matrix, a traversal pointer is initialized to scan all objects marked as parent nodes in the matrix. The unique identifier ID of the currently traversed parent node object is read, and the metadata storage area is indexed based on this ID. The start action timestamp value in the trigger connection attribute corresponding to the node and the termination state timestamp value mapped by the result connection attribute are extracted. Assuming that the start action timestamp value in the trigger connection attribute corresponding to the node is 15.200 seconds in the video and the termination state timestamp value mapped by the result connection attribute is 22.500 seconds in the video, the editing duration calculation logic is executed on both, that is, the difference between the termination timestamp and the start timestamp is calculated to be 7.300 seconds, and the minimum threshold parameter for editing duration is set to 2. At 0 seconds, the calculated time difference is compared with the minimum threshold. If the difference is less than the threshold, the segment is discarded. If the difference is greater than or equal to the threshold, the time segment is considered valid. The frame rate conversion coefficient of the original video is read as 60 frames per second. The frame index mapping logic is executed, multiplying the timestamp value by the frame rate coefficient and rounding it. Specifically, the starting timestamp 15.200 is multiplied by 60 to obtain the starting frame index 912, and the ending timestamp 22.500 is multiplied by 60 to obtain the ending frame index 1350. A closed interval set containing the starting value 912 and the ending value 1350 is constructed. The above extraction, calculation and judgment process is repeated for all parent node objects in the matrix that meet the conditions. All generated interval sets are stored in a dynamic array to generate the editing time sequence positioning frame index interval.

[0121] The system calls the timing of the editing frame index interval, reads the original video image sequence and audio track data, locates the start and end frame positions of the video stream and the start and end sampling point positions of the audio stream based on the index interval value, extracts the RGB image matrix data and PCM audio sampling data in the corresponding interval, detects the synchronization deviation value between the image matrix timestamp and the audio sampling timestamp, and performs formatted encoding and encapsulation on the data if the deviation value is less than the synchronization fault tolerance threshold, and establishes the original data stream of audiovisual segments.

[0122] The code calls the specified interval data within the frame index range of the editing timing positioning frame. Assuming the index range is 912 to 1350, a file stream read handle is established pointing to the original video image sequence file and the corresponding audio track data file stored on the hard drive. Based on the starting frame index 912, byte offset addressing is performed in the video stream to accurately locate the storage start address of the image data of that frame. The code continues reading until the ending frame index 1350 corresponds to the end address of the image data. This RGB image matrix data containing 439 frames is loaded into the memory buffer. Simultaneously, based on the ratio of video frame rate to audio sampling rate (48000 Hz divided by 60 frames per second equals 800 sampling points per frame), the audio sampling point positioning logic is executed, calculating the starting sampling point as 729600 and the ending sampling point as 1080000. The corresponding sampling point is then extracted. The PCM audio sampling data is read from the hardware acquisition timestamp of 15.205 seconds recorded at the beginning of the 912th frame of the video and the timestamp of 15.200 seconds associated with the 729600th audio sampling point. The audio-visual synchronization deviation calculation logic is executed, and the absolute value of the difference between the two is calculated to be 0.005 seconds. The synchronization fault tolerance threshold is set to 0.04 seconds. The deviation value of 0.005 seconds is compared with the threshold of 0.04 seconds. If the deviation value is less than the threshold, it means that the audio-visual synchronization is good. The RGB matrix is compressed and encoded using an H.264 encoder, and the PCM data is compressed using an AAC encoder. Both are encapsulated into an MP4 container format. If the deviation value exceeds the threshold, the audio data is corrected by microsecond-level time-domain shift before encapsulation. The operation is repeated for all index intervals to establish the original data stream of audio-visual segments.

[0123] For the original data stream of audiovisual segments, perform temporal arrangement, calculate the difference value of color histogram features between the last frame and the first frame of adjacent data streams and the change amplitude of the zero-crossing rate of audio waveform. If the difference value is less than the visual abrupt threshold and the change amplitude value is less than the auditory abrupt threshold, perform linear splicing operation to establish an intelligent logic editing video sequence.

[0124] For the original audiovisual segment data stream containing multiple independent video segment files, the absolute start time information of each segment's metadata is read. A quicksort algorithm is used to rearrange the segments in chronological order, obtaining the sorted preceding and succeeding segments. The color histogram feature vector of the last frame of the preceding segment is extracted. This color histogram feature vector contains 256 grayscale pixel statistical values. Histogram feature difference calculation logic is performed between this vector and the color histogram feature vector of the first frame of the succeeding segment. The difference between the two vectors is calculated using chi-square distance or Bach distance algorithms, with a result of 0.15. A visual abrupt change threshold of 0.4 is set. Simultaneously, the zero-crossing rate of the audio waveform in the last 50 milliseconds of the preceding segment (120 times per second) is extracted and compared with the zero-crossing rate in the first 50 milliseconds of the succeeding segment. The rate value is set at 130 times per second. The audio jump amplitude calculation logic is executed, and the absolute value of the difference between the zero-crossing rate values is calculated to be 10. The auditory jump threshold is set to 30. The visual difference value of 0.15 is compared with the visual abrupt change threshold of 0.4, and the audio change amplitude of 10 is compared with the auditory jump threshold of 30. If both conditions are met, it indicates that the two segments have continuity in audiovisual perception and there is no abrupt jump. Then, the binary data streams of the two segments are directly linearly and seamlessly spliced. If either condition is not met, a 0.5-second fade-in and fade-out transition effect frame is inserted at the seam for smoothing. All the connection points of adjacent segments are traversed and processed. Finally, a complete single video file is merged and output, establishing an intelligent logic video editing sequence.

[0125] Please see Figure 4 A video intelligent editing system based on multimodal semantic analysis includes:

[0126] The multimodal feature acquisition module collects video frame sequences and audio track data, extracts the bounding box center coordinates, color histograms and audio energy spectral density, associates timestamps and combines spatial audiovisual information to generate a multimodal spatiotemporal basic feature set;

[0127] The action temporal transfer module calls the multimodal spatiotemporal basic feature set, filters continuous frame sequences with displacement modulus greater than the static determination threshold, compares the cosine value of adjacent vector directions with the consistency check threshold, extracts action nodes and integrates start and end features to obtain the action state temporal transfer vector set;

[0128] The causal topology association module calculates the difference in node timestamps based on the action state temporal transition vector set and compares it with the causal judgment threshold to establish trigger connections. It performs comparisons with the separation threshold and continuity threshold for spatial overlap rate and semantic feature cosine distance, respectively, to establish result connections. It also performs indexing and arrangement on the connected nodes to establish a causal logical topology association index table.

[0129] The audiovisual saliency weighting module calculates the difference between the bounding box histogram product and the energy spectrum noise for the nodes of the causal logic topological association index table to obtain a saliency score. It then compares the score difference with the hierarchical isolation threshold to generate a hierarchical audiovisual saliency weighting matrix.

[0130] The intelligent logic editing module traverses the parent node hierarchy based on the hierarchical audiovisual saliency weight matrix, locks the start and end frame positions of the editing, and performs cropping and splicing on the original video images and audio tracks to establish an intelligent logic edited video sequence.

[0131] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A video intelligent editing method based on multimodal semantic analysis, characterized in that, Includes the following steps: S1: Collect audiovisual data, simultaneously extract the target center coordinates, color histogram and frequency domain energy spectral density, associate timestamps and combine multi-dimensional information to generate a multimodal spatiotemporal basic feature set; S2: Call the multimodal spatiotemporal basic feature set, filter displacement vector and direction cosine compliance sequences, retain action nodes, integrate the preceding start and subsequent end state features to obtain the action state temporal transition vector set; S3: Based on the action state time sequence transition vector set, calculate the timestamp difference to construct the trigger connection, and the spatial overlap rate and semantic distance to construct the result connection. After the threshold condition is met, generate the causal logic topology association index table. S4: For the causal logic topological association index table, the saliency score is obtained by summing the visual bounding box and the auditory energy spectrum coefficient, and the difference in node scores is compared with the threshold to mark the parent-child relationship, generating a hierarchical audiovisual saliency weight matrix. S5: Traverse the parent nodes according to the hierarchical audiovisual saliency weight matrix, lock the start and end positions of the editing for the trigger and result connection, perform cropping and splicing on the original audiovisual data, and establish an intelligent logic editing video sequence.

2. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The multimodal spatiotemporal basic feature set includes overall representation information of target behavior, cross-modal association description information, temporal continuity structure information, and basic semantic identifiers of content. The action state time sequence transition vector set includes a description of the action start-end correlation, a description of the state change trend, and action integrity determination information; The causal logic topology association index table includes event trigger pointing relationship, event result correspondence relationship, and logical dependency path identifier; The hierarchical audiovisual saliency weight matrix includes saliency strength classification information, node importance differentiation information, and hierarchical priority labeling information; The intelligent logic-edited video sequence includes an event mainline segment structure, content coherence expression units, and semantically consistent output content.

3. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The audiovisual data consists of a sequence of frames from the original video and an audio track. The multidimensional information includes spatial, visual, auditory, and temporal information.

4. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The displacement vector is the change vector of the center coordinates of the target object in adjacent video frames, used to quantify the amount and direction of the target's movement in space; The spatial overlap rate is the region intersection-union ratio of the spatial bounding boxes of the last frame of the action and the termination node, which is used to evaluate the degree of spatial overlap between the two regions. The visual bounding box is a rectangular box that surrounds the target object in the video frame and is used to define the spatial position and range of the target in the image; The auditory energy spectral coefficient is the energy spectral density of the audio signal in the frequency domain minus the background noise, and is used to quantify the perceptual saliency of the audio.

5. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The steps for obtaining S1 are as follows: The original video frame image sequence is acquired, the pixel matrix of the single frame image is traversed and the pixel distribution set of the target object is extracted, the plane horizontal and vertical coordinate values of the geometric center are calculated based on the extreme values of the set boundary, the intensity distribution frequency values of the pixels in each color channel are counted, and the coordinate values and frequency values are concatenated by the row dimension according to the frame number to establish the target space visual attribute data group. The audio track data stream accompanying the original video is acquired. A discrete Fourier transform is performed on the time-domain acoustic wave sampling points to map them to the frequency domain. The square value of the amplitude mode of each frequency component is calculated to characterize the signal power. The power value is accumulated along the frequency axis and the energy spectral density value is obtained to generate an audio frequency domain energy feature sequence. The target space visual attribute data set and the audio frequency domain energy feature sequence are called, the absolute timestamp values in the video metadata are parsed and a time index is constructed, and the visual coordinate values, color distribution values and auditory energy values are retrieved and aligned based on the time index. The data of each dimension are combined to obtain a multimodal spatiotemporal basic feature set.

6. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The steps for obtaining S2 are as follows: The multimodal spatiotemporal basic feature set is invoked to extract the target center coordinate sequence of each frame, calculate the Euclidean distance between adjacent frame coordinates, compare the distance value with a preset static determination threshold, filter the frame index with the distance value greater than the static determination threshold and sort them by time to generate a non-static continuous dynamic frame sequence. Based on the non-static continuous dynamic frame sequence, the displacement vectors of adjacent frames are calculated, the cosine value of the angle between continuous vectors is calculated using the vector dot product formula, and the cosine value is compared with the value of the preset direction consistency test threshold. The frame intervals where the cosine value is continuously higher than the test threshold are identified, and the frame index and associated vector direction data within the interval are extracted to establish a set of highly consistent action process segments. For the set of highly consistent action process segments, the adjacent frame indices of the first frame before the segment and the last frame after the segment are locked. Based on the indices, the corresponding visual and auditory feature data are retrieved from the multimodal spatiotemporal basic feature set. The features of the preceding, current, and following segments are concatenated dimensionally to obtain the action state temporal transition vector set.

7. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The steps for obtaining S3 are as follows: Based on the action state time sequence transition vector set, the timestamp values of the start and action nodes are extracted, the absolute time difference value is calculated, the difference value is compared with the preset causal judgment threshold, and the node combination with the difference value less than the causal judgment threshold is selected to generate the trigger time sequence associated node pair. For the triggering time-series associated node pair, retrieve the spatial boundary and semantic features of the last frame of the action and the termination node, calculate the region intersection-union ratio and the vector cosine distance, compare the calculated value with the overlap rate separation threshold and the distance continuity threshold, retain the connection relationship that meets the dual threshold conditions, and obtain the result logical verification candidate set. Based on the results, the candidate set is logically verified, the chain attributes are parsed, and the spatiotemporal nodes and connection relationships are mapped to graph structure vertices and directed edges, respectively. An index relationship reflecting the node evolution path is established, and a causal logical topological association index table is constructed.

8. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The steps for obtaining S4 are as follows: For the causal logic topological association index table node, the multimodal spatiotemporal basic feature set is called to calculate the ratio of the bounding box area to the full frame image area and the histogram channel variance as the contrast value. The product of the proportion and the contrast, as well as the difference between the energy spectral density and the background noise, are calculated to obtain the visual and auditory coefficients respectively, and generate the audiovisual independent perception coefficient vector. Based on the audiovisual independent perception coefficient vector, the visual and auditory coefficients of each node are linearly summed to obtain a comprehensive score representing the saliency of the node. The score data are arranged in the order of node index to establish a comprehensive saliency score sequence. The comprehensive saliency score sequence and the causal logic topology association index table are called to calculate the absolute value of the score difference between the topology connection node pairs. The absolute value is compared with the preset hierarchical isolation threshold. The connection relationship that meets the threshold condition is marked with parent-child attributes to obtain the hierarchical audiovisual saliency weight matrix.

9. The video intelligent editing method based on multimodal semantic analysis according to claim 1, characterized in that, The steps for obtaining S5 are as follows: Based on the hierarchical audiovisual saliency weight matrix, the parent node hierarchical objects are traversed, the timestamp values of the mapping between the trigger connection attribute and the result connection attribute are extracted, the difference between the end timestamp and the start timestamp is calculated, and if the difference is greater than the preset minimum threshold for editing duration, the timestamp value is retained and multiplied by the video frame rate conversion coefficient to generate the editing timing positioning frame index interval. The editing timing positioning frame index interval is called, the original video image sequence and audio track data are read, the start and end frame positions of the video stream and the start and end sampling point positions of the audio stream are located according to the index interval value, the RGB image matrix data and PCM audio sampling data in the corresponding interval are extracted, the synchronization deviation value between the image matrix timestamp and the audio sampling timestamp is detected, and if the deviation value is less than the synchronization fault tolerance threshold, the data is formatted and encoded to encapsulate, and the audiovisual segmented original data stream is established. For the original audiovisual segment data stream, perform time-series arrangement, calculate the difference value of color histogram features between the last frame and the first frame of adjacent data streams and the change amplitude value of the zero-crossing rate of the audio waveform. If the difference value is less than the visual abrupt change threshold and the change amplitude value is less than the auditory abrupt change threshold, perform linear splicing operation to establish an intelligent logic editing video sequence.

10. A video intelligent editing system based on multimodal semantic analysis, characterized in that, A video intelligent editing method based on multimodal semantic analysis according to any one of claims 1-9, the system comprising: The multimodal feature acquisition module collects video frame sequences and audio track data, extracts the bounding box center coordinates, color histograms and audio energy spectral density, associates timestamps and combines spatial audiovisual information to generate a multimodal spatiotemporal basic feature set; The action timing transfer module calls the multimodal spatiotemporal basic feature set, filters continuous frame sequences with displacement modulus greater than the static determination threshold, compares the cosine value of adjacent vector directions with the consistency check threshold, extracts action nodes and integrates start and end features to obtain the action state timing transfer vector set; The causal topology association module calculates the difference in node timestamps based on the action state temporal transition vector set and compares it with the causal judgment threshold to establish a trigger connection. It performs comparisons with the spatial overlap rate and semantic feature cosine distance with the separation threshold and continuity threshold respectively to establish a result connection, and performs indexing and arrangement on the connected nodes to establish a causal logical topology association index table. The audiovisual saliency weighting module calculates the difference between the bounding box histogram product and the energy spectrum noise for the nodes of the causal logic topology association index table to obtain a saliency score. It then compares the score difference with the hierarchical isolation threshold to generate a hierarchical audiovisual saliency weighting matrix. The intelligent logic editing module traverses the parent node hierarchy according to the hierarchical audiovisual saliency weight matrix, locks the start and end frame positions of the editing, and performs cropping and splicing on the original video image and audio track to establish an intelligent logic edited video sequence.