Audio-visual video semantic parsing method and system suitable for cross-media information retrieval

By constructing an audiovisual feature distance matrix and using a graph matching algorithm to generate a minimum distance matching sequence, the time deviation and audio track replacement problems introduced by hardware acquisition in cross-media information retrieval systems are solved, thereby improving the accuracy and anti-interference ability of retrieval.

CN122240879APending Publication Date: 2026-06-19SHAANXI GUOBO ZHENGTONG INFORMATION TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHAANXI GUOBO ZHENGTONG INFORMATION TECH CO LTD
Filing Date
2026-05-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing cross-media information retrieval systems suffer from inaccuracies when dealing with audio-visual time discrepancies and audio track replacements introduced by hardware acquisition, making it difficult to effectively match video frames and audio slices.

Method used

By extracting video frame sequences and audio slice sequences of equal length, mapping them to a shared feature space, constructing an audiovisual feature distance matrix, and using a graph matching algorithm to generate a minimum distance matching sequence, the storage path is determined by combining the audio-visual feature matching score, thus eliminating time deviation interference caused by hardware acquisition.

Benefits of technology

It effectively eliminates minor audio-visual delays caused by hardware acquisition, accurately quantifies semantic errors caused by manual replacement of irrelevant background music, improves the accuracy of cross-media retrieval, and prevents cross-contamination of non-same-source data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240879A_ABST
    Figure CN122240879A_ABST
Patent Text Reader

Abstract

This application relates to a semantic parsing method and system for audiovisual videos suitable for cross-media information retrieval. The method includes: extracting video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed; mapping the video frame sequences and audio slice sequences to a shared feature space; constructing an audiovisual feature distance matrix based on the feature differences and temporal deviations between the video frame sequences and audio slice sequences; transforming the audiovisual feature distance matrix into an undirected fully bipartite graph; solving the graph matching algorithm to generate a minimum distance matching sequence; determining the audiovisual feature matching score based on the degree of temporal inversion of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix; and determining the feature storage path of the video frame sequences and audio slice sequences based on the comparison results of the audiovisual feature matching scores with a preset judgment benchmark. This application can improve the accuracy of cross-media retrieval.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of cross-media information retrieval technology, and in particular to an audiovisual video semantic parsing method and system applicable to cross-media information retrieval. Background Technology

[0002] Cross-media information retrieval systems are primarily used for the management and querying of massive media assets. Their typical function is to enable cross-modal interactive queries, such as users inputting text commands to search for videos containing specific scenes. Current conventional implementations extract the video's visual and audio features separately, then force a concatenation operation based on their shared time sequence number, transforming them into a joint feature vector which is stored in the retrieval database. However, in actual media asset processing, this mechanism, which relies on absolute time sequence alignment, has shortcomings. Hardware-introduced audio-visual timing discrepancies and misalignment issues such as audio track replacement can all interfere with the accuracy of cross-media retrieval. Summary of the Invention

[0003] To address the aforementioned technical problems, the purpose of this application is to provide a semantic parsing method and system for audiovisual video suitable for cross-media information retrieval. The specific technical solution adopted is as follows: Firstly, a semantic parsing method for audiovisual videos suitable for cross-media information retrieval is provided, the method comprising: Extract video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and map the video frame sequences and audio slice sequences to a shared feature space; Based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence, an audiovisual feature distance matrix is ​​constructed; The audiovisual feature distance matrix is ​​transformed into an undirected complete bipartite graph, and the minimum distance matching sequence is generated by solving the graph matching algorithm. The audio-visual feature matching score is determined based on the degree of temporal inversion of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix; the spatial cost matrix is ​​determined based on the feature differences between the video frame sequence and the audio slice sequence. Based on the comparison results of the audio-visual feature matching scores and the preset judgment criteria, the feature storage paths of the video frame sequence and the audio slice sequence are determined.

[0004] Optionally, extracting video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and mapping the video frame sequences and audio slice sequences to a shared feature space, including: For the video data stream of the audiovisual video to be processed within a preset time window, the video frame sequence is extracted according to the frame rate of the video data stream, and the total number of video frames contained in the video frame sequence is determined as the base time sequence node number. For the audio data stream of the audio-visual video to be processed within the same preset time window, the time interval of the audio slice of the audio-visual video to be processed is determined according to the ratio between the preset time window and the number of reference time nodes, and the audio data stream is sliced ​​at equal intervals according to the time interval to obtain the audio slice sequence. The original features of the video frame sequence and the audio slice sequence are extracted respectively. A pre-trained linear mapping layer is called to reduce the dimensionality of the original features and project them into a shared feature space of the same dimension.

[0005] Optionally, an audiovisual feature distance matrix is ​​constructed based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence, including: Iterate through each video feature vector in the video frame sequence and each audio feature vector in the audio slice sequence. Based on the inner product value of each video feature vector and each audio feature vector in the shared feature space, obtain the spatial distance cost value between each video feature vector and each audio feature vector, and construct a spatial cost matrix based on the corresponding spatial distance cost value. Calculate the ratio between the absolute difference between the time sequence number of each video frame and the time sequence number of each audio slice and the reference timing node number, and determine the time deviation between each video frame and each audio slice based on the product of the ratio and a preset timing offset constant. The sum of the spatial distance cost corresponding to the same pair of video frames and audio slices and the time deviation is calculated to obtain the values ​​of each matrix element at the corresponding position in the audiovisual feature distance matrix.

[0006] Optionally, the spatial distance cost between each video feature vector and each audio feature vector is obtained based on the inner product value of each video feature vector and each audio feature vector in the shared feature space, including: The video feature vectors and audio feature vectors mapped to the shared feature space are then subjected to L2 norm normalization. Calculate the inner product value between each normalized video feature vector and each audio feature vector, and multiply the inner product value between each video feature vector and each audio feature vector by a preset value to obtain the corresponding first value. Based on the square root of the difference between the preset value and the corresponding first value, obtain the spatial distance cost value between each video feature vector and each audio feature vector.

[0007] Optionally, the audiovisual feature distance matrix is ​​transformed into an undirected complete bipartite graph, and the minimum distance matching sequence is generated by solving the graph matching algorithm, including: Map each row index of the audiovisual feature distance matrix to a video vertex, map each column index of the audiovisual feature distance matrix to an audio vertex, and use the values ​​of each matrix element in the audiovisual feature distance matrix as the weights of the edges connecting the corresponding video vertices and audio vertices to construct an undirected complete bipartite graph. Based on the Hungarian algorithm, the optimization objective is to select the set of edges that minimizes the sum of the weights of the selected edges, and the solution is obtained to obtain a matching scheme containing multiple vertex pairs; the graph matching algorithm includes the Hungarian algorithm; each vertex pair contains one video vertex and one audio vertex; Based on the time-order index of the video frame corresponding to the video vertex, the time sequence number of the audio vertex corresponding to each video vertex in the matching scheme is extracted sequentially, and the time sequence number is reorganized into a one-dimensional array to obtain the minimum distance matching sequence; wherein, the position index of the minimum distance matching sequence indicates: the time sequence number of the video frame corresponding to the video vertex, and the value stored in the minimum distance matching sequence indicates: the time sequence number of the audio slice corresponding to the audio vertex that matches the video vertex.

[0008] Optionally, the length of the minimum distance matching sequence is equal to the number of reference time nodes.

[0009] Optionally, the audio-visual feature matching score is determined based on the degree of temporal reversal in the minimum distance matching sequence and the corresponding numerical values ​​of each node in the minimum distance matching sequence in the spatial cost matrix, including: The total number of inversion events in the minimum distance matching sequence is counted. The temporal inversion ratio is determined by the ratio between the total number of inversion events and the number of inversion combinations corresponding to the minimum distance matching sequence. An inversion event indicates an event in the minimum distance matching sequence where the audio slice number matched by the video frame with the earlier position index is greater than the audio slice number matched by the video frame with the later position index. The number of inversion combinations corresponding to the minimum distance matching sequence is calculated as: the product of the length of the minimum distance matching sequence and the length of the minimum distance matching sequence minus one, divided by two. The audio-visual feature matching score is determined based on the temporal inversion ratio of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the audiovisual feature distance matrix.

[0010] Optionally, the audio-visual feature matching score is determined based on the temporal inversion ratio of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix, including: Based on the position indexes and stored audio sequence numbers in the minimum distance matching sequence, the matrix element values ​​of the corresponding positions are extracted from the spatial cost matrix, and the arithmetic mean of all extracted matrix element values ​​is calculated to obtain the mean of the matching feature distance. The temporal inversion ratio and the mean distance of the matching features are weighted and fused. The weighted fused result is then subjected to exponential decay to obtain the audio-visual feature matching score.

[0011] Optionally, based on the comparison results between the audio-visual feature matching score and a preset judgment benchmark, the feature storage paths of the video frame sequence and the audio slice sequence are determined, including: When the audio-visual feature matching score is less than the preset judgment benchmark, the video feature vector corresponding to the video frame sequence is stored in the visual retrieval sub-library, and the audio feature vector corresponding to the audio slice sequence is stored in the audio retrieval sub-library. When the audio-visual feature matching score is not less than the preset judgment benchmark, the video feature vector corresponding to each video frame in the video frame sequence is concatenated with the audio feature vector corresponding to the matched audio slice according to the minimum distance matching sequence to generate a fused feature sequence. The fused feature sequence is then subjected to global average pooling, and the joint feature vector obtained after global average pooling is stored in the joint retrieval library.

[0012] Secondly, a semantic parsing system for audiovisual videos suitable for cross-media information retrieval is provided, the system comprising: The extraction module is used to extract video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and to map the video frame sequences and audio slice sequences to a shared feature space. The module is used to construct an audiovisual feature distance matrix based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence; The solution module is used to transform the audiovisual feature distance matrix into an undirected complete bipartite graph, solve it according to the graph matching algorithm, and generate the minimum distance matching sequence. The determination module is used to determine the audio-visual feature matching score based on the degree of temporal reversal of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix; the spatial cost matrix is ​​determined based on the feature differences between the video frame sequence and the audio slice sequence. The comparison module is used to determine the feature storage path of the video frame sequence and the audio slice sequence based on the comparison results of the audio-visual feature matching score and the preset judgment benchmark.

[0013] Based on common knowledge in the field, the above-mentioned preferred conditions can be combined arbitrarily to obtain various preferred embodiments of this application.

[0014] This application offers the following advantages: It constructs an audiovisual feature distance matrix and integrates feature differences and temporal deviations, effectively eliminating numerical interference from minor audio-visual delays caused by hardware acquisition while abandoning temporal monotonicity constraints, thus avoiding misjudgments of native, source-specific videos. It extracts the temporal inversion degree of the minimum distance matching sequence, accurately quantifying the degree of global semantic distortion caused by manually replacing irrelevant background music. Based on the comparison results of audio-visual feature matching scores and preset judgment benchmarks, it determines the feature storage paths of video frame sequences and audio slice sequences, fundamentally preventing cross-contamination of cross-modal joint retrieval libraries caused by non-source data, significantly improving the accuracy of cross-media retrieval. Attached Figure Description

[0015] To more clearly illustrate the technical solutions and advantages in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a flowchart of an audiovisual video semantic parsing method applicable to cross-media information retrieval in one embodiment; Figure 2 This is a schematic diagram of the structure of an audiovisual video semantic parsing system applicable to cross-media information retrieval in one embodiment; Figure 3 This is a schematic diagram of the structure of an electronic device in one embodiment. Detailed Implementation

[0017] To further illustrate the technical means and effects adopted by this application to achieve the intended purpose of the invention, the following, in conjunction with the accompanying drawings and preferred embodiments, details the specific implementation, structure, features, and effects of an audiovisual video semantic parsing method and system suitable for cross-media information retrieval proposed in this application. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, specific features, structures, or characteristics in one or more embodiments can be combined in any suitable form.

[0018] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.

[0019] The following, with reference to the accompanying drawings, details a specific scheme for an audiovisual video semantic parsing method applicable to cross-media information retrieval provided in this application. For example... Figure 1 As shown, the method includes: S11. Extract video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and map the video frame sequences and audio slice sequences to a shared feature space.

[0020] In the audiovisual video to be processed, there is an inherent numerical difference between the frame rate of the video frame and the sampling frequency of the audio stream. If these two types of data are extracted separately according to the default frequency, two sets of data sequences with unequal numbers of nodes will be generated, making it impossible for subsequent steps to perform feature combination and traversal calculations at the matrix dimension. Therefore, based on a fixed time window and the number of frames, the audio segmentation step size is derived in reverse to ensure that the two features are strictly aligned in absolute quantity.

[0021] In one embodiment, extracting a sequence of video frames and an audio slice sequence of equal length from the audiovisual video to be processed, and mapping the video frame sequence and the audio slice sequence to a shared feature space, includes: For the video data stream of the audiovisual video to be processed within a preset time window, the video frame sequence is extracted according to the frame rate of the video data stream, and the total number of video frames contained in the video frame sequence is determined as the base time sequence node number. For the audio data stream of the audio-visual video to be processed within the same preset time window, the time interval of the audio slice of the audio-visual video to be processed is determined according to the ratio between the preset time window and the number of reference time nodes, and the audio data stream is sliced ​​at equal intervals according to the time interval to obtain the audio slice sequence. The original features of the video frame sequence and the audio slice sequence are extracted respectively. A pre-trained linear mapping layer is called to reduce the dimensionality of the original features and project them into a shared feature space of the same dimension.

[0022] The system receives the input audio-visual video to be processed, calls the multimedia demultiplexer to perform a separation operation on it, and then sets a fixed time window length, for example, the time window length is set to 3.0 seconds, so as to extract the video data stream and audio data stream within the time window from the audio-visual video to be processed.

[0023] For the video data stream within the time window, a segmentation operation is performed according to the inherent frame rate of the audiovisual video segment (e.g., 25 frames / second) to extract a set of discrete video frame sequences. The total number of video frames in the video frame sequence is counted, and this total number of video frames is determined as the baseline time sequence node number. With a time window length of 3.0 seconds and a frame rate of 25 frames per second, the baseline number of timing nodes... The value is 75.

[0024] To prevent program crashes and deadlocks caused by a zero denominator during subsequent mathematical calculations (such as calculating permutation and combination ratios or performing arithmetic average operations), a baseline time-series node number is obtained. Next, a zero-prevention boundary check is performed to determine the baseline timing node number. Is it greater than or equal to 2? Specifically: When N=1, extract the current single video frame and single audio slice, perform single-modal pooling, and store them in the database respectively. When N=0 (e.g., the length is less than one frame or data is lost), directly discard the data stream of this timing window or throw an exception to skip the current processing and end this round of operation.

[0025] When the judgment result is When this occurs, the subsequent processing computation flow is triggered. The total length of the time window is divided by the number of baseline timing nodes. Perform a division operation to calculate the time interval of the audio slice. and according to time intervals The audio data stream is sliced ​​at equal intervals to generate an audio slice sequence. A reverse proportional truncation method is used to ensure that the total number of elements in both the generated video frame sequence and the audio slice sequence equals the base time sequence node number. .

[0026] definition This is the natural temporal ascending index of each video frame in the video frame sequence, where... .definition This is the natural time-order index of each audio slice in the audio slice sequence, where... The sequence of indexed video frames is represented as { , ,... ,... },in Let the i-th video frame be represented as {} , ,... ,... },in This represents the j-th audio slice. It should be noted that all subsequent indices... and All of these correspond to the index definition here, that is... Always pointing to the first Each video frame and its associated visual features, Always pointing to the first An audio slice and its associated audio features.

[0027] Load the pre-trained unimodal visual encoder and unimodal audio encoder. Input all video frames in the video frame sequence into the unimodal visual encoder one by one to extract the original visual features. Input all audio slices in the audio slice sequence into the unimodal audio encoder one by one to extract the original acoustic features. The original features include original visual features and original acoustic features.

[0028] After extracting the original features, two independent linear mapping layers are invoked. To ensure the physical validity of the subsequent Euclidean distance calculation, these two linear mapping layers are not randomly initialized, but are pre-trained jointly using a large-scale audio-video dataset and multimodal contrastive learning (such as the InfoNCE loss function). Using these two pre-trained linear mapping layers, dimensionality reduction projection operations are performed on the original visual and acoustic features respectively, to uniformly transform the heterogeneous features of these two modalities into a shared feature space with the same dimension, ensuring they are at the same semantic scale. This shared dimension is denoted as . dimension.

[0029] S12. Construct an audiovisual feature distance matrix based on the feature differences and time deviations between the video frame sequence and the audio slice sequence.

[0030] In one embodiment, an audiovisual feature distance matrix is ​​constructed based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence, including: By traversing each video feature vector in the video frame sequence and each audio feature vector in the audio slice sequence, the spatial distance cost between each video feature vector and each audio feature vector is obtained based on the inner product value of each video feature vector and each audio feature vector in the shared feature space. Calculate the ratio between the absolute difference between the time sequence number of each video frame and the time sequence number of each audio slice and the reference timing node number, and determine the time deviation between each video frame and each audio slice based on the product of the ratio and a preset timing offset constant. The sum of the spatial distance cost corresponding to the same pair of video frames and audio slices and the time deviation is calculated to obtain the values ​​of each matrix element at the corresponding position in the audiovisual feature distance matrix.

[0031] Specifically, the spatial distance cost between each video feature vector and each audio feature vector is obtained based on the inner product value of each video feature vector and each audio feature vector in the shared feature space, including: The video feature vectors and audio feature vectors mapped to the shared feature space are then subjected to L2 norm normalization. Calculate the inner product value between each normalized video feature vector and each audio feature vector, and multiply the inner product value between each video feature vector and each audio feature vector by a preset value to obtain the corresponding first value. Based on the square root of the difference between the preset value and the corresponding first value, obtain the spatial distance cost value between each video feature vector and each audio feature vector. Construct a spatial cost matrix based on the corresponding spatial distance cost value. The row index of the spatial cost matrix is ​​the time sequence number of the video frame, the column index is the time sequence number of the audio slice, and the matrix element value is the corresponding spatial distance cost value.

[0032] The preset value can be set according to the actual situation, for example, 2.

[0033] Convert the original visual features and original acoustic features to After the shared feature space of dimensions, for all For each eigenvector, perform L2 norm normalization to force the length modulus of all eigenvectors to be constrained to 1. This yields the video feature vector sequence corresponding to the video frame sequence and the audio feature vector sequence corresponding to the audio slice sequence. For this... visual features and For each pair of permutations of audio features, a nested loop is used to calculate the spatial distance between each pair of nodes. The outer loop pointer is set. from Increment to This is used to sequentially extract the normalized video feature vectors. Set the inner loop pointer from Increment to This is used to sequentially extract the normalized audio feature vectors. Among them, the loop pointer The value corresponds to the first The natural time sequence number of each video frame, and the loop pointer. The value corresponds to the first The natural time sequence number of each audio slice.

[0034] In each iteration of the nested loop, the following two mathematical operations are performed to generate the value of each element of the audiovisual feature distance matrix. : For video feature vectors and audio feature vectors Calculate using the dot product algorithm and The inner product value. Due to the pre-existing L2 normalization constraint, this inner product value is necessarily restricted to... Within the closed interval, calculate the product of the inner product between each video feature vector and each audio feature vector and a preset value to obtain the corresponding first value. The video feature vector is obtained by taking the square root of the difference between the preset value and the corresponding first value. With audio feature vectors Spatial distance value Spatial distance value The calculation formula is: ; The spatial distance cost obtained through the above calculations is limited to a certain range. Within the closed interval. The closer The greater the semantic scale difference between the video feature vector and the audio feature vector corresponding to the sequence number, the greater the difference.

[0035] If the audiovisual video to be processed contains long periods of still images or continuous, steady background noise (such as white noise), the spatial distance value... The graph will contain a large number of extremely small, completely equal values. When subsequent graph matching algorithms process these equally weighted feature pairs, the algorithm will get stuck in a deadlock and output randomized time sequence numbers, severely interfering with the subsequent statistics of reverse events used to determine audio track tampering. To eliminate this problem, for each calculated spatial distance value... Introducing time bias based on time span.

[0036] Specifically, calculate video frames Time sequence number With audio slices The absolute difference between time sequence number j and the number of reference time nodes The ratio between them, and based on the ratio and the preset timing bias constant. The product of the two elements determines the video frame. With audio slices The time deviation, the expression for the time deviation is: Among them, the absolute difference term Represents video frames With audio slices The physical time interval spanned between them, numbered in steps. (Proportional term) This indicates the relative proportion of the physical time interval number to the total time window length; the proportion term... It is in consecutive floating-point numbers within the range, This represents the system's preset timing bias constant, which, as an empirical value for an embodiment, is set to... Time deviation The larger the value, the higher the video feature vector. With audio feature vectors Although they are similar in numerical value, the greater the distance between them on the actual timeline.

[0037] The sum of the spatial distance cost and temporal deviation corresponding to the same pair of video frames and audio slices is calculated to obtain the matrix element values ​​at the corresponding positions in the audiovisual feature distance matrix. Matrix element values The calculation formula is: ; in, The distance matrix of audiovisual features is the first... line, number The matrix element values ​​of the column. Video feature vector With audio feature vectors Spatial distance value For video frames With audio slices Time deviation.

[0038] By adding a penalty cost, i.e., a time deviation, the value of the generated matrix elements is ensured. This not only reflects the semantic content differences between feature vectors but also includes a weak numerical penalty for finding matches across time. When two audio slices are semantically identical to the same video frame, the audio slice that is further apart in time will be penalized due to… Larger values ​​will be assigned a higher penalty value. This correction successfully guides subsequent algorithms to prioritize matching the node with the closest time sequence when faced with identical static steady-state signals, eliminating the risk of out-of-order output caused by stationary segments.

[0039] After completing all nested loop traversal calculations, the calculated matrix element values ​​will be... Based on the respective outer and inner index positions, they are combined and concatenated into a single dimension. A two-dimensional numerical array. This two-dimensional numerical array is defined as an audiovisual feature distance matrix. In the audiovisual feature distance matrix, the row index... Represents the time sorting of video frames, column index The matrix element values ​​of the audio slice temporal ordering and audiovisual feature distance matrix. This represents the comprehensive matching value between audiovisual nodes.

[0040] S13. Transform the audiovisual feature distance matrix into an undirected complete bipartite graph, solve it using a graph matching algorithm, and generate the minimum distance matching sequence.

[0041] Existing cross-media retrieval systems generally extract audio and video data that are in the same natural time sequence (i.e., when determining whether they match) The visual and audio features of the audio signal are analyzed, and the distance between them is calculated. Based on the objective laws of multimedia data processing in this field, this method of direct alignment along the physical timeline has a fatal flaw: real-world video capture devices and processing links typically introduce audio-visual time delays of several milliseconds to tens of milliseconds. If a one-to-one correspondence between time nodes is forced, it will be impossible to distinguish between the slight capture delay of the original audio due to the device and the complete replacement of non-original background music by the creator—two drastically different phenomena. Both will result in huge spatial differences due to node misalignment under this strongly bound calculation method.

[0042] To eliminate the time misalignment interference caused by hardware acquisition delays, in one embodiment, the audiovisual feature distance matrix is ​​transformed into an undirected complete bipartite graph. A minimum distance matching sequence is generated by solving the graph matching algorithm, including: Map each row index of the audiovisual feature distance matrix to a video vertex, map each column index of the audiovisual feature distance matrix to an audio vertex, and use the values ​​of each matrix element in the audiovisual feature distance matrix as the weights of the edges connecting the corresponding video vertices and audio vertices to construct an undirected complete bipartite graph. Based on the Hungarian algorithm, the optimization objective is to select the set of edges that minimizes the sum of the weights of the selected edges, and the solution is obtained to obtain a matching scheme containing multiple vertex pairs; the graph matching algorithm includes the Hungarian algorithm; each vertex pair contains one video vertex and one audio vertex; Based on the time-order index of the video frame corresponding to the video vertex, the time sequence number of the audio vertex corresponding to each video vertex in the matching scheme is extracted sequentially, and the time sequence number is reorganized into a one-dimensional array to obtain the minimum distance matching sequence; wherein, the position index of the minimum distance matching sequence indicates: the time sequence number of the video frame corresponding to the video vertex, and the value stored in the minimum distance matching sequence indicates: the time sequence number of the audio slice corresponding to the audio vertex that matches the video vertex.

[0043] In one embodiment, the length of the minimum distance matching sequence is equal to the number of baseline time-series nodes.

[0044] After obtaining the audiovisual feature distance matrix, index each row of the audiovisual feature distance matrix. Mapping to video vertices, so that the set of row indices of the audiovisual feature distance matrix is ​​mapped to the set of video vertices, and each column index of the audiovisual feature distance matrix is ​​mapped to a video vertex set. Mapping to audio vertices, so that the set of column indices of the audiovisual feature distance matrix is ​​mapped to the set of audio vertices.

[0045] Establish an undirected connection edge between every vertex in the video vertex set and the audio vertex set. For any connection... The video vertex and the first Undirected edges at the i-th audio vertex are extracted from the audiovisual feature distance matrix located at the i-th vertices. line, number Column matrix element values and the matrix element values As the weight values ​​of the undirected edge, construct the topology of an undirected complete bipartite graph.

[0046] This code calls the classic Hungarian algorithm from graph theory to process an undirected complete bipartite graph. The Hungarian algorithm is a combinatorial optimization instruction set specifically designed for finding minimum weight complete matchings in a bipartite graph. When executing the Hungarian algorithm, the code ignores the natural temporal index attribute carried by the vertices (i.e., it ignores...). and (regarding the order of size), its sole optimization objective is set as: selecting from all undirected connected edges... An edge is formed such that every video vertex is connected to exactly one audio vertex (i.e., a perfectly injective one-to-one matching), and this... The total weight of the selected edges reaches the global minimum.

[0047] After executing the solver command, the Hungarian algorithm outputs a set of... A one-to-one mapping relationship between vertices, that is, the output contains Matching schemes for each vertex pair. Extract the matching schemes and index them according to the natural temporal order of the video vertices. (in from Increment to ), sequentially extract the time sequence number of the audio vertex corresponding to each video vertex in the matching scheme, and reassemble the time sequence numbers into a string of length . Given a one-dimensional array of numbers, the reassembled one-dimensional array is defined as a minimum distance matching sequence. Understandably, the length L of the minimum distance matching sequence equals the number of reference time-series nodes N. The position index of the minimum distance matching sequence... (in Indicator: The video frame corresponding to video vertex i The sequence numbers are arranged in ascending order of natural time; the minimum distance matching sequence is located at the position index. The element value stored at this location indicates the time sequence number of the audio slice corresponding to the audio vertex that matches video vertex i, calculated using the Hungarian algorithm. ; where video vertex i and the matched audio vertex form a vertex pair.

[0048] S14. Determine the audio-visual feature matching score based on the degree of temporal reversal in the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix.

[0049] The audio-visual feature matching score represents the degree of matching between the video frame sequence and the audio slice sequence, that is, the confidence level of homology between the video images and audio content in the audiovisual video to be processed. A higher audio-visual feature matching score indicates that the audio and video are more likely to be natively associated, while a lower score indicates that the audio track is more likely to have been replaced.

[0050] In one embodiment, the audio-visual feature matching score is determined based on the degree of temporal reversal in the minimum distance matching sequence and the corresponding numerical values ​​of each node in the minimum distance matching sequence in the spatial cost matrix, including: The total number of inversion events in the minimum distance matching sequence is counted. The temporal inversion ratio is determined by the ratio between the total number of inversion events and the number of inversion combinations corresponding to the minimum distance matching sequence. An inversion event indicates an event in the minimum distance matching sequence where the audio slice number matched by the video frame with the earlier position index is greater than the audio slice number matched by the video frame with the later position index. The number of inversion combinations corresponding to the minimum distance matching sequence is calculated as: the product of the length of the minimum distance matching sequence and the length of the minimum distance matching sequence minus one, divided by two. The audio-visual feature matching score is determined based on the temporal inversion ratio of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix.

[0051] Specifically, the audio-visual feature matching score is determined based on the temporal inversion ratio of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix, including: Based on the position indexes and stored audio sequence numbers in the minimum distance matching sequence, the matrix element values ​​of the corresponding positions are extracted from the spatial cost matrix, and the arithmetic mean of all extracted matrix element values ​​is calculated to obtain the mean of the matching feature distance. The temporal inversion ratio and the mean distance of the matching features are weighted and fused. The weighted fused result is then subjected to exponential decay to obtain the audio-visual feature matching score.

[0052] To count the total number of inversion events in the minimum distance matching sequence, specifically, a nested loop with two pointers is used to perform the inversion pair counting operation: Set the first traversal pointer First traversal pointer The range of values ​​is from Increment to Set the second traversal pointer Second traversal pointer The range of values ​​is from Increment to .because The value is always greater than This nested relationship ensures that the first The video frame is absolutely later than the first one on the natural timeline. Each video frame occurs. In each nested loop comparison, the pointer from the first traversal is extracted from the minimum distance matching sequence. The time sequence number of the audio slice at the corresponding i-th position index Simultaneously extract the pointer from the second traversal. The corresponding number Time sequence number of the audio slice at each location index .like , indicating the video footage that occurred later (the first) The video frame actually matched the earlier audio slice (the first video frame). The time sequence number of each audio slice is used to record this time sequence reversal as a reversal event. After completing all nested loop traversals, the total number of reversal events is accumulated. .

[0053] Calculate the length of the current minimum distance matching sequence. The maximum number of inversion combinations theoretically exists, which is equivalent to calculating the number of inversion combinations corresponding to the minimum distance matching sequence. The expression for the number of inversion combinations corresponding to the minimum distance matching sequence is: The temporal inversion ratio is determined by the ratio between the total number of inversion events and the number of inversion combinations corresponding to the minimum distance matching sequence. The calculation formula is: ; in, For the time sequence inversion ratio, This represents the total number of inversion events in the minimum distance matching sequence. The length of the minimum distance matching sequence, where the divisor is a theoretical upper limit, and the time sequence inversion ratio. The value is strictly converged to Within the closed interval, The larger the value, the higher the frequency of disordered arrangement caused by the algorithm in order to find similar features, which in turn reflects the greater possibility that the audiovisual video to be processed is replaced with non-original background music.

[0054] Extracting the position index from the minimum distance matching sequence With its stored audio sequence number ,make from Increment to The following data traceability instructions are executed repeatedly: for each ,Will The combination serves as a two-dimensional spatial coordinate pointer. Substituting this pointer into the spatial cost matrix, the position is located at the first position in the spatial cost matrix. line, number Column, and extract the first line, number Matrix element values ​​at column , Characterized video feature vectors Its assigned optimal audio feature vector The minimum spatial distance cost in the optimal matching state with time constraints removed.

[0055] Calculate the arithmetic mean of all extracted matrix element values ​​to obtain the mean matching feature distance. The calculation formula is: ; in, The baseline time series number of nodes, To match the mean distance of features, The first of the spatial cost matrix line, number The matrix element values ​​of the column, that is, the i-th video feature vector and the i-th... The spatial distance value between audio feature vectors. It is a single scalar value that characterizes the lower bound of cross-modal numerical differences. The larger the value, the greater the objective difference between audio and video in semantic space, even if the temporal constraints are removed and the optimal combination is found, which means that the content itself is extremely unrelated.

[0056] Obtain a preset spatial distance weighting factor. As an empirical parameter for an example, the value of the spatial distance weighting factor is set as follows: Due to the mean distance of the matching features The theoretical upper limit of the value range is The time sequence inversion ratio The theoretical upper limit of the value range is Using product coefficients Will The numerical influence range is proportionally compressed to This ensures that the temporal misalignment component and the spatial difference component contribute equally and evenly to the final score. Then, a weighted fusion is performed on the temporal inversion ratio and the mean distance to the matching features. An exponential decay operation is then applied to the weighted fusion result to obtain the audio-visual feature matching score. The calculation formula is: ; in, It is an exponentially decaying function with the natural constant e as its base. This represents the total penalty value, which is the linear sum of the temporal out-of-order penalty (temporal inversion ratio) and the spatial difference penalty (mean distance of matching features). Because... The maximum value is , The maximum value is The minimum value of the negative exponent term is By implementing the above exponential decay mapping operation, the generated audio-visual feature matching score is ensured. Mapped to a closed interval A series of consecutive floating-point numbers within the range. The closer the audio-visual feature matching score is to... (That is, the smaller the total penalty value), the higher the objective homology between the video image and the accompanying audio in terms of temporal arrangement coordination and spatial semantic similarity; the closer the audio-visual feature matching score is to... The higher the probability, the greater the likelihood that the content has been tampered with or replaced.

[0057] S15. Based on the comparison results of the audio-visual feature matching scores and the preset judgment criteria, determine the feature storage paths of the video frame sequence and the audio slice sequence.

[0058] The preset judgment criteria are obtained as follows: A pre-recorded set of historical video samples confirmed as original audio is extracted. Using the aforementioned processing flow, the audio-visual feature matching scores of all video samples in the historical video sample set are calculated. After calculation, all audio-visual feature matching scores are sorted from low to high, and the [number]th [score] is extracted. The score at the percentile. This percentile value is determined as the judgment criterion. Using statistical quantiles as a benchmark ensures that the vast majority of real, original audio-visual videos score above this benchmark.

[0059] In one embodiment, the feature storage paths of the video frame sequence and the audio slice sequence are determined based on the comparison results between the audio-visual feature matching score and a preset judgment benchmark, including: When the audio-visual feature matching score is less than the preset judgment benchmark, the video feature vector corresponding to the video frame sequence is stored in the visual retrieval sub-library, and the audio feature vector corresponding to the audio slice sequence is stored in the audio retrieval sub-library. When the audio-visual feature matching score is not less than the preset judgment benchmark, the video feature vector corresponding to each video frame in the video frame sequence is concatenated with the audio feature vector corresponding to the matched audio slice according to the minimum distance matching sequence to generate a fused feature sequence. The fused feature sequence is flattened in the time dimension to obtain a joint feature vector. Alternatively, the fused feature sequence is input into a bidirectional long short-term memory network to extract the hidden state at the last moment as the joint feature vector, and the joint feature vector is stored in the joint retrieval library.

[0060] 1. When the audio-visual feature matching score is... This means that the video within the time window exhibits significant tampering attributes in its underlying feature distribution and temporal arrangement. The video within the time window is determined to be a non-originating audiovisual video with a replaced audio track. The video feature vector sequence corresponding to the video frame sequence and the audio feature vector sequence corresponding to the audio slice sequence, generated after dimensionality reduction and normalization processing, are retrieved. Specifically: ①. For the video feature vector sequence, index along the natural time dimension. Perform global average pooling. Specifically, it will... The values ​​of each feature dimension of each video frame are summed and averaged. The result is a video frame of length [missing information]. The sequence data is compressed into a single one-dimensional feature vector without temporal dimension information, which is defined as a pure video feature vector. . Pure video feature vectors Stored in the single-modal visual retrieval sub-library of the underlying storage system.

[0061] ②. For the audio feature vector sequence, index along the natural time dimension. After independently performing the global average pooling operation described above to obtain a single one-dimensional feature vector, this vector is defined as a pure audio feature vector. . Pure audio feature vector Stored in the single-modal audio retrieval sub-library of the underlying storage system.

[0062] 2. When When the video and audio within the time window maintain continuity and consistency in temporal coordination and spatial values, the video within the time window is determined to be a natively related audiovisual video, and a cross-modal joint library building instruction branch is triggered.

[0063] The system retrieves the video feature vector sequence corresponding to the video frame sequence after dimensionality reduction and normalization, and simultaneously retrieves the minimum distance matching sequence. It then uses the calculated globally optimal correspondence to perform the concatenation, rather than mechanically relying on natural time sequences. The system iterates through the time sequence index. (in , that is During each traversal extraction, the following concatenation instructions are executed: Extracting the first feature vector from the video feature vector sequence Video feature vectors with time sequence numbers Read the minimum distance matching sequence at the position index Audio time sequence number stored at [location] Accurately extract position indices from audio feature vector sequences. Corresponding audio feature vector , video feature vector Its corresponding optimal audio feature vector , perform element concatenation operation along the feature dimension.

[0064] After completing all traversal and element concatenation operations, a string of length is generated. However, the feature dimension of the fused feature sequence is doubled due to splicing. The fused feature sequence has been corrected by the topology matching algorithm for the slight audio-visual time misalignment caused by hardware acquisition.

[0065] Index along the time dimension Global average pooling is performed on the fused feature sequences, by taking the same feature dimension across all... The average value at each node compresses the fused feature sequence into a single one-dimensional feature vector, which is defined as the joint feature vector. and the joint feature vector Stored in the cross-media joint retrieval library built by the underlying system.

[0066] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0067] This application also provides an audiovisual video semantic parsing system suitable for cross-media information retrieval, such as... Figure 2 As shown, the system includes: Extraction module 21 is used to extract video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and map the video frame sequences and audio slice sequences to a shared feature space; Module 22 is used to construct an audiovisual feature distance matrix based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence; Solving module 23 is used to transform the audiovisual feature distance matrix into an undirected complete bipartite graph, solve it according to the graph matching algorithm, and generate the minimum distance matching sequence. The determination module 24 is used to determine the audio-visual feature matching score based on the degree of temporal reversal of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix; the spatial cost matrix is ​​determined based on the feature differences between the video frame sequence and the audio slice sequence. The comparison module 25 is used to determine the feature storage path of the video frame sequence and the audio slice sequence based on the comparison result of the audio-visual feature matching score and the preset judgment benchmark.

[0068] For the system embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs.

[0069] Figure 3 This is a schematic diagram of the structure of an electronic device according to an example embodiment of this application. The electronic device includes a memory, a processor, and a computer program stored in the memory and used to run on the processor. When the processor executes the computer program, it implements the method described in any of the above embodiments. Figure 3 The electronic device 30 shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.

[0070] like Figure 3 As shown, the electronic device 30 can be manifested as a general-purpose computing device, such as a server device. The components of the electronic device 30 may include, but are not limited to: at least one processor 31, at least one memory 32, and a bus 33 connecting different system components (including memory 32 and processor 31).

[0071] Bus 33 includes a data bus, an address bus, and a control bus.

[0072] The memory 32 may include volatile memory, such as random access memory 321 and / or cache memory 322, and may further include read-only memory 323.

[0073] The memory 32 may also include a program tool 325 (or utility) having a set (at least one) program module 324, such program module 324 including but not limited to: an operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.

[0074] The processor 31 executes various functional applications and data processing, such as the methods provided in any of the above embodiments, by running computer programs stored in the memory 32.

[0075] Electronic device 30 can also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). This communication can be made through input / output interface 35. Furthermore, electronic device 30 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with other modules of electronic device 30 via bus 33. It should be understood that, although not shown in the figure, other hardware and / or software modules can be used in conjunction with electronic device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems.

[0076] It should be noted that although several units / modules or sub-units / modules of the electronic device have been mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more units / modules described above can be embodied in one unit / module. Conversely, the features and functions of one unit / module described above can be further divided and embodied by multiple units / modules.

[0077] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method provided in any of the above embodiments.

[0078] The readable storage medium may be more specifically adopted, including but not limited to: portable disk, hard disk, random access memory, read-only memory, erasable programmable read-only memory, optical storage device, magnetic storage device, or any suitable combination thereof.

[0079] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0080] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the method described in any of the above embodiments.

[0081] The program code for executing the computer program product of this application can be written in any combination of one or more programming languages. The program code can be executed entirely on the user device, partially on the user device, as a standalone software package, partially on the user device and partially on a remote device, or entirely on a remote device.

[0082] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0083] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these modifications and improvements all fall within the protection scope of this application.

[0084] The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.

Claims

1. A semantic parsing method for audiovisual videos suitable for cross-media information retrieval, characterized in that, The method includes: Extract video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and map the video frame sequences and audio slice sequences to a shared feature space; Based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence, an audiovisual feature distance matrix is ​​constructed; The audiovisual feature distance matrix is ​​transformed into an undirected complete bipartite graph, and the minimum distance matching sequence is generated by solving the graph matching algorithm. The audio-visual feature matching score is determined based on the degree of temporal inversion of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix; the spatial cost matrix is ​​determined based on the feature differences between the video frame sequence and the audio slice sequence. Based on the comparison results of the audio-visual feature matching scores and the preset judgment criteria, the feature storage paths of the video frame sequence and the audio slice sequence are determined.

2. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 1, characterized in that, The step of extracting video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and mapping the video frame sequences and audio slice sequences to a shared feature space, includes: For the video data stream of the audiovisual video to be processed within a preset time window, the video frame sequence is extracted according to the frame rate of the video data stream, and the total number of video frames contained in the video frame sequence is determined as the base time sequence node number. For the audio data stream of the audio-visual video to be processed within the same preset time window, the time interval of the audio slice of the audio-visual video to be processed is determined according to the ratio between the preset time window and the number of reference time nodes, and the audio data stream is sliced ​​at equal intervals according to the time interval to obtain the audio slice sequence. The original features of the video frame sequence and the audio slice sequence are extracted respectively. A pre-trained linear mapping layer is called to reduce the dimensionality of the original features and project them into a shared feature space of the same dimension.

3. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 1, characterized in that, The step of constructing an audiovisual feature distance matrix based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence includes: Iterate through each video feature vector in the video frame sequence and each audio feature vector in the audio slice sequence. Based on the inner product value of each video feature vector and each audio feature vector in the shared feature space, obtain the spatial distance cost value between each video feature vector and each audio feature vector, and construct a spatial cost matrix based on the corresponding spatial distance cost value. Calculate the ratio between the absolute difference between the time sequence number of each video frame and the time sequence number of each audio slice and the reference timing node number, and determine the time deviation between each video frame and each audio slice based on the product of the ratio and a preset timing offset constant. The sum of the spatial distance cost corresponding to the same pair of video frames and audio slices and the time deviation is calculated to obtain the values ​​of each matrix element at the corresponding position in the audiovisual feature distance matrix.

4. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 3, characterized in that, The step of obtaining the spatial distance cost between each video feature vector and each audio feature vector based on the inner product value of each video feature vector and each audio feature vector in the shared feature space includes: The video feature vectors and audio feature vectors mapped to the shared feature space are then subjected to L2 norm normalization. Calculate the inner product value between each normalized video feature vector and each audio feature vector, and multiply the inner product value between each video feature vector and each audio feature vector by a preset value to obtain the corresponding first value. Based on the square root of the difference between the preset value and the corresponding first value, obtain the spatial distance cost value between each video feature vector and each audio feature vector.

5. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 1, characterized in that, The process of transforming the audiovisual feature distance matrix into an undirected complete bipartite graph, solving it using a graph matching algorithm, and generating a minimum distance matching sequence includes: Map each row index of the audiovisual feature distance matrix to a video vertex, map each column index of the audiovisual feature distance matrix to an audio vertex, and use the values ​​of each matrix element in the audiovisual feature distance matrix as the weights of the edges connecting the corresponding video vertices and audio vertices to construct an undirected complete bipartite graph. Based on the Hungarian algorithm, the optimization objective is to select the set of edges that minimizes the sum of the weights of the selected edges, and the solution is obtained by solving the problem to obtain a matching scheme containing multiple vertex pairs. The graph matching algorithm includes the Hungarian algorithm. Each vertex pair contains one video vertex and one audio vertex. Based on the time-order index of the video frame corresponding to the video vertex, the time sequence number of the audio vertex corresponding to each video vertex in the matching scheme is extracted sequentially, and the time sequence number is reorganized into a one-dimensional array to obtain the minimum distance matching sequence; wherein, the position index of the minimum distance matching sequence indicates: the time sequence number of the video frame corresponding to the video vertex, and the value stored in the minimum distance matching sequence indicates: the time sequence number of the audio slice corresponding to the audio vertex that matches the video vertex.

6. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 1, characterized in that, The length of the minimum distance matching sequence is equal to the number of reference time nodes.

7. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 1, characterized in that, The step of determining the audio-visual feature matching score based on the degree of temporal reversal in the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix includes: The total number of inversion events in the minimum distance matching sequence is counted. The temporal inversion ratio is determined by the ratio between the total number of inversion events and the number of inversion combinations corresponding to the minimum distance matching sequence. An inversion event indicates an event in the minimum distance matching sequence where the audio slice number matched by the video frame with the earlier position index is greater than the audio slice number matched by the video frame with the later position index. The number of inversion combinations corresponding to the minimum distance matching sequence is calculated as: the product of the length of the minimum distance matching sequence and the length of the minimum distance matching sequence minus one, divided by two. The audio-visual feature matching score is determined based on the temporal inversion ratio of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix.

8. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 7, characterized in that, The step of determining the audio-visual feature matching score based on the temporal inversion ratio of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix includes: Based on the position indexes and stored audio sequence numbers in the minimum distance matching sequence, the matrix element values ​​of the corresponding positions are extracted from the spatial cost matrix, and the arithmetic mean of all extracted matrix element values ​​is calculated to obtain the mean of the matching feature distance. The temporal inversion ratio and the mean distance of the matching features are weighted and fused. The weighted fused result is then subjected to exponential decay to obtain the audio-visual feature matching score.

9. The audiovisual video semantic parsing method applicable to cross-media information retrieval as described in claim 1, characterized in that, The step of determining the feature storage paths of the video frame sequence and the audio slice sequence based on the comparison results of the audio-visual feature matching scores and the preset judgment criteria includes: When the audio-visual feature matching score is less than the preset judgment benchmark, the video feature vector corresponding to the video frame sequence is stored in the visual retrieval sub-library, and the audio feature vector corresponding to the audio slice sequence is stored in the audio retrieval sub-library. When the audio-visual feature matching score is not less than the preset judgment benchmark, the video feature vector corresponding to each video frame in the video frame sequence is concatenated with the audio feature vector corresponding to the matched audio slice according to the minimum distance matching sequence to generate a fused feature sequence. The fused feature sequence is then subjected to global average pooling, and the joint feature vector obtained after global average pooling is stored in the joint retrieval library.

10. An audiovisual video semantic parsing system suitable for cross-media information retrieval, characterized in that, The system includes: The extraction module is used to extract video frame sequences and audio slice sequences of equal length from the audiovisual video to be processed, and to map the video frame sequences and audio slice sequences to a shared feature space. The module is used to construct an audiovisual feature distance matrix based on the feature differences and temporal deviations between the video frame sequence and the audio slice sequence; The solution module is used to transform the audiovisual feature distance matrix into an undirected complete bipartite graph, solve it according to the graph matching algorithm, and generate the minimum distance matching sequence. The determination module is used to determine the audio-visual feature matching score based on the degree of temporal reversal of the minimum distance matching sequence and the corresponding values ​​of each node in the minimum distance matching sequence in the spatial cost matrix; the spatial cost matrix is ​​determined based on the feature differences between the video frame sequence and the audio slice sequence. The comparison module is used to determine the feature storage path of the video frame sequence and the audio slice sequence based on the comparison results of the audio-visual feature matching score and the preset judgment benchmark.