Intelligent management method for multimedia archives based on artificial intelligence

By employing multi-dimensional feature analysis, cross-modal association algorithms, and dynamic knowledge graphs, the problems of independent and static storage of modal data in multimedia archive management have been solved, enabling dynamic management and full-cycle adaptation of multimedia archives.

CN122309767APending Publication Date: 2026-06-30SICHUAN SHENSHI TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SICHUAN SHENSHI TECHNOLOGY CO LTD
Filing Date
2026-05-21
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing multimedia archive management, a single-dimensional information parsing and processing mode is often adopted, resulting in independent modal data, lack of deep interaction, inability to unify semantic integration, static storage structure that cannot adapt to dynamic changes in archives, and rigid management methods that are difficult to adapt to diverse scenarios.

Method used

An AI-based multimedia archive intelligent management method is adopted. It generates a primary structured description through multi-dimensional feature analysis, uses an improved cross-modal association algorithm for feature interaction and alignment, constructs a dynamic knowledge graph, and combines an archive lifecycle status tracking model to generate an intelligent management operation sequence.

Benefits of technology

It achieves deep integration of multi-dimensional features, dynamically monitors changes in archive status, adapts to the full lifecycle management of multimedia archives, enhances the linkage performance of multi-source heterogeneous data, and adapts to diverse management and control scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309767A_ABST
    Figure CN122309767A_ABST
Patent Text Reader

Abstract

This invention relates to the field of intelligent archives management technology, specifically to an artificial intelligence-based intelligent management method for multimedia archives. The method includes: collecting raw input data from multimedia archives; performing multi-dimensional feature analysis through visual element recognition, audio content transcription, and text semantic extraction; and outputting a preliminary structured description. An improved cross-modal association algorithm incorporating a multi-layer attention mechanism is introduced to achieve interactive alignment of visual, audio, and text features, generating a fused semantic index. Based on the index, a dynamic knowledge graph is constructed with entities as nodes and relationships as edges. Evolutionary analysis is conducted using an archive lifecycle state tracking model to obtain real-time state vectors and state transition probabilities. A customized management operation sequence is generated by combining a pre-set management strategy rule base. This method effectively achieves deep fusion of multi-modal archive information, enabling dynamic and refined intelligent management and control of archives.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent archives management technology, and in particular to a multimedia archives intelligent management method based on artificial intelligence. Background Technology

[0002] In the current conventional management model of multimedia archives, a single-dimensional information parsing and processing mode is often adopted, which independently completes the tasks of visual content recognition, audio-to-text conversion and text content extraction. The archive information is organized by simply labeling, and the modal data are independent of each other. Conventional cross-modal processing methods only perform shallow data splicing, lack feature-deep interaction logic, and it is difficult to unify the semantic integration of multi-source heterogeneous multimedia archive information.

[0003] Traditional archival storage and management systems are mostly statically built, recording only basic physical information of archives. They cannot uncover the semantic, spatiotemporal, and logical relationships between archival entities, making it difficult to form a dynamically adjustable relational architecture. In the daily management and control of archives, it is impossible to track the entire lifecycle of changes in archives, capture the dynamic changes in archival relationships, and quantify changes in archival status. Management and execution methods are rigid and singular, making it difficult to adapt to the diverse management and control scenarios of multimedia archives.

[0004] Multimedia archives are characterized by complex data types and rich information dimensions. Single parsing and shallow fusion methods can lead to information fragmentation. Static archive storage structures cannot adapt to the dynamic changes that occur during long-term use of archives. It is necessary to optimize the multimodal feature fusion and matching methods, strengthen the dynamic monitoring capabilities of archives throughout their entire lifecycle, and make up for the shortcomings of traditional management models in terms of semantic association and dynamic analysis. Summary of the Invention

[0005] The purpose of this invention is to address the shortcomings of existing technologies by proposing an intelligent multimedia archive management method based on artificial intelligence.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: an intelligent multimedia archive management method based on artificial intelligence, comprising: The system receives raw input data from a multimedia file, performs multi-dimensional feature analysis on the raw input data, and generates a primary structured description of the multimedia file. The multi-dimensional feature analysis includes visual element recognition, audio content transcription, and text semantic extraction. An improved cross-modal association algorithm is invoked to process the primary structured description and generate a fused semantic index for multimedia archives. The improved cross-modal association algorithm constructs the interaction and alignment between visual, audio and text features based on a multi-layer attention mechanism. Based on the fused semantic index, a dynamic knowledge graph of multimedia archives is constructed. The dynamic knowledge graph uses archive entities as nodes and semantic, spatiotemporal and logical relationships between entities as edges. By using the archive lifecycle state tracking model, the dynamic knowledge graph is analyzed for state evolution to generate real-time state vectors and state transition probabilities of the archives. By combining the real-time state vector, state transition probability, and predefined management strategy rule base, an intelligent management operation sequence for a specific multimedia file is generated.

[0007] As a further aspect of the present invention, the step of receiving the original input data of the multimedia file and performing multi-dimensional feature parsing on the original input data to generate a preliminary structured description of the multimedia file includes: For raw input data of image or video type, multi-scale visual features are extracted by pre-trained deep convolutional neural network, and visual elements and their spatial relationships are identified by object detection and scene recognition technology to generate visual element recognition results. For raw audio input data, the speech recognition engine converts its content into text, and acoustic feature analysis identifies audio tracks, speakers, background sounds and emotional tendencies to generate audio content transcription and acoustic analysis results. For raw text input data, natural language processing techniques are used to perform word segmentation, part-of-speech tagging, named entity recognition, and dependency parsing to extract key entities, themes, and sentiments, and generate text semantic extraction results. The visual element recognition results, audio content transcription and acoustic analysis results, and text semantic extraction results of the same multimedia file are aligned and associated according to timestamps or logical chapters, and packaged to generate a primary structured description file of the multimedia file.

[0008] As a further aspect of the present invention, the step of invoking an improved cross-modal association algorithm to process the primary structured description and generate a fused semantic index for the multimedia archive includes: The visual feature vector set, audio feature vector set, and text feature vector set are separated from the primary structured description; The visual feature vector set, audio feature vector set, and text feature vector set are simultaneously input into the multi-head cross-modal attention layer of the improved cross-modal association algorithm; In the multi-head cross-modal attention layer, for each modality's feature vector set, it is calculated as itself as a query vector, and the attention weight distribution is used with the feature vector sets of all other modalities as key vectors and value vectors; Based on the attention weight distribution, the feature vector sets from different modalities are weighted, fused, and aggregated to generate a set of intermediate feature representations containing cross-modal interaction information; The intermediate feature representation is subjected to dimensionality reduction and normalization to form a fusion feature vector with a unified dimension, and the fusion feature vector is bound to the original multimedia file identifier as the fusion semantic index.

[0009] As a further aspect of the present invention, the improved cross-modal association algorithm constructs the interaction and alignment between visual, audio, and text features based on a multi-layer attention mechanism, including: The improved cross-modal association algorithm comprises at least three sequential multi-layer attention mechanism processing stages; In the first multi-layer attention mechanism processing stage, the visual feature vector set is used as the dominant query to calculate the attention distribution of visual features on audio features and text features, and aggregate relevant information to generate a visually enhanced contextual representation. In the second multi-layer attention mechanism processing stage, the audio feature vector set is used as the dominant query to calculate the contextual representation of audio features for visual enhancement and the attention distribution of text features, and aggregate relevant information to generate the contextual representation of audiovisual enhancement. In the third multi-layer attention mechanism processing stage, the text feature vector set is used as the dominant query to calculate the attention distribution of text features on the context representation of visual enhancement and the context representation of audiovisual enhancement, and aggregate relevant information to generate the final cross-modal fusion representation. Each processing stage includes a residual connection and layer normalization operation to ensure the stability of the information flow and the convergence of the training process.

[0010] As a further aspect of the present invention, a dynamic knowledge graph of multimedia archives is constructed based on the fused semantic index, including: The fused semantic index is parsed to identify named entities, event elements, spatiotemporal markers, and attribute descriptions contained in the fused semantic index, forming a list of candidate entities and relationships; Based on a predefined domain ontology model, the candidate entities and relation list are classified, disambiguated, and normalized to determine standardized archive entity nodes and standardized relation labels. Based on the standardized relation tags, directed relation edges are established between the standardized archive entity nodes. The attributes of the relation edges include relation type, confidence level, and relation generation basis extracted from the fused semantic index. Each standardized archival entity node is supplemented with its timestamp, spatial location information, and source archive identifier in the multimedia archive, forming a dynamic knowledge graph with spatiotemporal and source context; The dynamic knowledge graph is stored in the form of a graph database, and a globally unique identifier is assigned to each node and edge.

[0011] As a further aspect of the present invention, the step of performing state evolution analysis on the dynamic knowledge graph through the archive lifecycle state tracking model to generate real-time state vectors and state transition probabilities of the archive includes: Extract the historical snapshot sequence of each archive entity node and its associated relation edges over time from the dynamic knowledge graph; The historical snapshot sequence is input into the archive lifecycle status tracking model, which is a time-series prediction model based on graph neural networks; In the archive lifecycle state tracking model, the graph structure in each historical snapshot is encoded into a low-dimensional dense graph state vector through a graph embedding layer; The graph state vectors of continuous time steps are input into the temporal modeling layer to capture the evolution patterns and trends of the graph structure over time. At the end of the temporal modeling layer, the model outputs a prediction of the state vector of each archive entity node at a future preset time point, which serves as its real-time state vector. It also outputs the probability distribution of the archive entity node's transition from the current state to other states, which serves as the state transition probability.

[0012] As a further aspect of the present invention, by combining the real-time state vector, state transition probability, and predefined management strategy rule base, an intelligent management operation sequence for a specific multimedia file is generated, including: The key state dimension values ​​of the archive entity nodes are parsed from the real-time state vector. The key state dimension values ​​include popularity value, integrity measure, relevance and risk score. The key state dimension values ​​and the state transition probabilities are input into the policy matching engine; The strategy matching engine traverses the predefined management strategy rule base, where each rule consists of a state condition, a transition probability condition, and a sequence of operations to be executed. The rule is activated when the key state dimension value satisfies the state condition of a certain rule and the state transition probability satisfies the transition probability condition of the rule. The strategy matching engine integrates all the operation sequences to be executed corresponding to the activated rules into an ordered, conflict-free intelligent management operation sequence according to the preset conflict resolution and priority sorting logic.

[0013] As a further aspect of the present invention, the policy matching engine integrates all the operation sequences to be executed corresponding to the activated rules into an ordered, conflict-free intelligent management operation sequence according to a preset conflict resolution and priority sorting logic, including: Compare the operations suggested by different activation rules and identify mutually exclusive or contradictory operation pairs in terms of operation objects, operation types, or operation parameters; For mutually exclusive or contradictory operation pairs, the decision is made based on the preset priority value in the source rule, retaining the operation corresponding to the high priority rule and removing or modifying the operation corresponding to the low priority rule. For operations that have the same target and compatible operation types, their operation parameters are merged or the optimal value is taken. All operations that have undergone conflict resolution are sorted chronologically according to the urgency of the archive entities they affect and the logical dependencies between the operations. The sorted list of operations is encapsulated into a sequence of intelligent management operations that can be scheduled and executed by the document management system.

[0014] As a further aspect of the present invention, in the multi-head cross-modal attention layer, for each modality's feature vector set, the attention weight distribution is calculated as a query vector for itself and as key and value vectors with the feature vector sets of all other modalities, including: Extract the corresponding visual query vector, audio query vector, and text query vector from the visual feature vector set, audio feature vector set, and text feature vector set, respectively; The visual query vector is calculated together with the audio key vector and audio value vector from the audio feature vector set and the text key vector and text value vector from the text feature vector set to generate the attention weight distribution of the visual modality on the audio modality and the text modality. The audio query vector is calculated together with the visual key vector and visual value vector from the visual feature vector set and the text key vector and text value vector from the text feature vector set to generate the attention weight distribution of the audio modality on the visual modality and the text modality. The text query vector is calculated together with the visual key vector and visual value vector from the visual feature vector set and the audio key vector and audio value vector from the audio feature vector set to generate the attention weight distribution of the text modality on the visual modality and the audio modality. The attention weight distribution calculated for the feature vector set of each modality is used to guide the weighted fusion of information from the feature vector sets of other modalities.

[0015] As a further aspect of the present invention, the fused semantic index is parsed to identify named entities, event elements, spatiotemporal markers, and attribute descriptions contained in the fused semantic index, forming a candidate entity and relationship list, including: The fusion feature vector contained in the fusion semantic index is decoded to restore the structured description fragment; Run a named entity recognition model on the structured description fragment to identify personal names, place names, organization names, and domain-specific terms in the fragment as candidate named entities; An event extraction model is run on the structured description fragment to identify the action, participant, time, and location elements in the fragment and combine them into candidate event elements. A spatiotemporal expression parser is run on the structured description fragment to identify and standardize the explicit or implicit time points, time periods, geographic coordinates, and spatial range information in the fragment as candidate spatiotemporal markers; The dependency parsing and semantic role labeling tool is run on the structured description fragment to extract phrases describing entity attributes or states as candidate attribute descriptions; All identified candidate named entities, candidate event elements, candidate spatiotemporal markers, and candidate attribute descriptions, along with their co-occurrence relationships, referential relationships, and syntactic dependencies, are organized into a structured list of candidate entities and relationships.

[0016] Compared with the prior art, the advantages and positive effects of the present invention are as follows: A multi-layered attention mechanism is integrated into the cross-modal association algorithm architecture to build an interaction structure for three heterogeneous features: visual, audio, and text, achieving precise alignment and deep association of different feature types. It abandons the shallow stitching of conventional data, establishing interconnected links between multi-dimensional data to deeply integrate the structured data obtained from multi-dimensional feature parsing, unifying and integrating scattered archival information content, and constructing an integrated semantic indexing system. This reduces information fragmentation caused by independent processing of heterogeneous information, improves the semantic association logic between multi-modal data, fully preserves the original features of multimedia archives across all dimensions, strengthens the collaborative performance capabilities of multi-source heterogeneous data, and achieves deep fusion of different types of archival data.

[0017] The archival lifecycle status tracking model continuously analyzes the evolution of the dynamic knowledge graph, updating its overall architecture and internal logic in sync with the real-time changes in archival entity nodes and various relationships. It presents the real-time operational status of archives in a quantitative form, reflects the dynamic trends of archival status based on state transition probabilities, and records the entire lifecycle of archival circulation changes digitally. Breaking free from the limitations of static information recording, it updates the relationships between entities following the actual circulation process of archives, synchronously adjusting the logical arrangement of the knowledge graph to align with the objective changing patterns of the entire archival lifecycle. It adapts to dynamic management logic based on quantitative data, and accommodates various dynamic changes that occur during the daily circulation of multimedia archives. Attached Figure Description

[0018] Figure 1 This is a flowchart of the intelligent multimedia archive management method based on artificial intelligence as described in this invention; Figure 2 A flowchart for generating a fused semantic index; Figure 3 A flowchart illustrating the multi-layer attention mechanism processing for the improved cross-modal association algorithm. Detailed Implementation

[0019] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0020] In the description of this invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships, are based on the orientation or positional relationships shown in the accompanying drawings and are only for the convenience of describing the invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the invention. Furthermore, in the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0021] See Figure 1The overall implementation scheme of the AI-based intelligent management method for multimedia archives is as follows: The system receives the raw input data of multimedia archives and performs multi-dimensional feature analysis on the raw input data. The multi-dimensional feature analysis includes visual element recognition, audio content transcription, and text semantic extraction. A preliminary structured description of the multimedia archive is generated through multi-dimensional feature analysis. The system calls an improved cross-modal association algorithm to process the preliminary structured description. The improved cross-modal association algorithm constructs the interaction and alignment between visual, audio, and text features based on a multi-layer attention mechanism, generating a fused semantic index for the multimedia archive. Based on the fused semantic index, the system constructs a dynamic knowledge graph for the multimedia archive. The dynamic knowledge graph uses archive entities as nodes and semantic, spatiotemporal, and logical relationships between entities as edges. The system performs state evolution analysis on the dynamic knowledge graph through an archive lifecycle state tracking model, generating real-time state vectors and state transition probabilities for the archives. The system combines the real-time state vectors, state transition probabilities, and a predefined management strategy rule base to generate an intelligent management operation sequence for a specific multimedia archive.

[0022] In one embodiment of the present invention, for raw input data of image or video type, multi-scale visual features are extracted using a pre-trained deep convolutional neural network, and visual elements and their spatial relationships are identified using object detection and scene recognition technologies. The visual element recognition results include the identified objects, scene categories, and their spatial location information. For raw input data of audio type, the content is converted into text using a speech recognition engine, and audio tracks, speakers, background sounds, and sentiment are identified through acoustic feature analysis. The audio content transcription and acoustic analysis results include the transcribed text, separated audio tracks, identified speaker identity, background sound category, and sentiment label. For raw input data of text type, word segmentation, part-of-speech tagging, named entity recognition, and dependency parsing are performed using natural language processing technologies to extract key entities, topics, and sentiment. The text semantic extraction results include word segmentation lists, entity lists, topic distribution, and sentiment polarity. The visual element recognition results, audio content transcription and acoustic analysis results, and text semantic extraction results of the same multimedia file are aligned and associated according to timestamps or logical chapters. The alignment operation matches data from different sources based on a unified timeline or chapter markers, and packages them to generate a primary structured description file of the multimedia file. The primary structured description file encapsulates all aligned parsing results in a structured data format.

[0023] In practice, the process of receiving raw input data from multimedia files and performing multi-dimensional feature parsing to generate a preliminary structured description of the multimedia files involves the collaborative work of multiple technical components. For raw input data of image or video types, multi-scale visual features are extracted using a pre-trained deep convolutional neural network. This deep convolutional neural network, for example, employs a residual network model trained on a large image dataset. The multi-layered structure of this model can progressively extract hierarchical feature maps from the input data, from low-level edges to high-level semantics. Then, object detection and scene recognition technologies are used to identify the visual elements and their spatial relationships. The object detection technology uses a single-stage detector based on anchor boxes to scan the feature maps, outputting the bounding box coordinates and class probabilities of each target of interest in the image. The scene recognition technology models the contextual information of the entire image through a global pooling layer and a fully connected classifier, thereby predicting the scene category to which the image belongs. The visual element recognition results are finally organized into structured data, recording the category, confidence level, location coordinates, and overall scene label of each detected target. The spatial relationships between targets can be inferred by parsing the relative geometric positions of the bounding boxes.

[0024] In some embodiments, when processing video data, the system samples keyframes of the video stream at equal intervals or based on shot transitions. A deep convolutional neural network and object detection model are independently applied to each frame, and detected targets with similar appearances and motion trajectories in consecutive frames are associated to form cross-frame target trajectories. Scene recognition can be performed on keyframes. Both object detection results and scene recognition results are annotated with precise timestamps. Spatial relationships of visual elements are calculated between different targets within the same frame. A feature vector of a video scene is generated. This can be constructed by aggregating the features of the main targets detected in all keyframes within the scene. The calculation can be expressed as a weighted average of the feature vectors of each target, without using the same symbols as in other embodiments. For raw audio input data, the content is converted into text by a speech recognition engine. The speech recognition engine employs an end-to-end automatic speech recognition model based on connection-based temporal classification loss, which directly maps the Mel-frequency spectral feature sequence of the audio to the corresponding character or word sequence. Acoustic feature analysis identifies the audio track, speaker, background noise, and emotional tendency. The acoustic feature analysis module extracts acoustic feature sequences, including Mel-frequency cepstral coefficients, spectral centroid, and zero-crossing rate, from the raw audio waveform. Speaker recognition uses a deep neural network-based voiceprint embedding extraction and clustering method. Background noise recognition classifies non-speech segments using a pre-trained sound event classification model. Emotional tendency analysis models the acoustic features of speech segments using a classifier to predict their emotional category label. The audio content transcription and acoustic analysis results are integrated into a structured output that includes timestamped transcribed text, speaker identification tag sequence, background sound category tag sequence, and sentiment score.

[0025] In practical implementation, speech recognition and acoustic feature analysis can be performed in parallel or sequentially. When using the sequential method, the system first performs speech activity detection to segment speech segments and non-speech segments. Speech segments are sent to the speech recognition engine and speaker recognition module, while non-speech segments are sent to the background sound recognition module. Timestamp information is accurately recorded to ensure that each transcribed text segment, each identified speaker fragment, and each background sound is aligned with the timeline of the original audio. The generated audio content transcription and acoustic analysis result files organize data in time intervals as the basic unit. For the original text input data, natural language processing techniques are used for word segmentation, part-of-speech tagging, named entity recognition, and dependency parsing. Word segmentation and part-of-speech tagging employ a word segmenter and part-of-speech tagger based on a combination of dictionary and statistical models. Named entity recognition uses a sequence labeling model based on a combination of bidirectional long short-term memory networks and conditional random fields. Dependency parsing uses a graph-based neural network parser. Key entities, topics, and sentiment are extracted. Key entities are directly obtained from the results of named entity recognition. Topic extraction uses a Latent Dirichlet Allocation (LDA) model to model the document's topics to obtain topic distribution. Sentiment analysis calculates the text's sentiment polarity using rule-based methods based on a sentiment dictionary or classification methods based on a pre-trained language model. The text semantic extraction results are organized into a data structure containing the original text, a word segmentation list, a part-of-speech tag sequence, a named entity list, a dependency syntax tree, a topic distribution vector, and a sentiment score. In some embodiments, the original input data of the text type may come from a text document accompanying an archive, a subtitle file, or text generated from audio transcription. The system selects the corresponding natural language processing model based on the encoding format and language type of the input data. For documents containing multiple paragraphs, processing can be performed paragraph by paragraph, preserving the structural information between paragraphs. The text semantic extraction results record the positional offset of each identified entity and extracted topic in the document.

[0026] In practice, the visual element recognition results, audio content transcription and acoustic analysis results, and text semantic extraction results of the same multimedia file are aligned and associated according to timestamps or logical chapters. For video files and their accompanying audio, both the visual element recognition results and the audio content transcription results have precise timestamps. The system associates visual targets, transcribed text statements, identified speakers, and background sounds that appear at the same time or time interval based on the same time interval. For multimedia files organized by logical chapters, the system groups and packages the visual, audio, and text analysis results belonging to the same chapter according to preset chapter tags. The association operation is achieved by establishing a cross-modal index table. For example, in the primary structured description file, the identifier of a video frame can be linked to all visual targets detected within that frame, and simultaneously linked to the audio transcription text fragments and acoustic tags corresponding to that frame's time point. The primary structured description file for the multimedia archive is generated by packaging. The primary structured description file adopts the Extensible Markup Language or JavaScript object representation format. The file is divided into a metadata area and a content data area. The content data area is organized according to the timeline or logical chapter sequence. Each data unit encapsulates the aligned structured data obtained from multi-dimensional feature parsing.

[0027] In one embodiment of the present invention, see [reference] Figure 2The visual feature vector set, audio feature vector set, and text feature vector set are separated from the primary structured description. This separation operation is performed by reading the feature data stored in the primary structured description file. These three sets are then simultaneously input into the multi-head cross-modal attention layer of the improved cross-modal association algorithm. In this multi-head cross-modal attention layer, each modality's feature vector set is used as a query vector, and its attention weight distribution is calculated with the feature vector sets of all other modalities as key and value vectors. The corresponding visual query vector, audio query vector, and text query vector are extracted from the visual, audio, and text feature vector sets, respectively. The visual query vector is then used in conjunction with the audio key and value vectors from the audio feature vector set and the text key and value vectors from the text feature vector set to generate the attention weight distribution of the visual modality on the audio and text modalities. The audio query vector is computed with visual key vectors and visual value vectors from the visual feature vector set and text key vectors and text value vectors from the text feature vector set to generate an attention weight distribution for the audio modality on the visual and text modalities. The text query vector is then computed with visual key vectors and visual value vectors from the visual feature vector set and audio key vectors and audio value vectors from the audio feature vector set to generate an attention weight distribution for the text modality on the visual and audio modalities. The attention weight distribution computed for each modality's feature vector set guides the weighted fusion of information from feature vector sets of other modalities. Based on the attention weight distribution, weighted fusion and information aggregation are performed on feature vector sets from different modalities to generate a set of intermediate feature representations containing cross-modal interaction information. The intermediate feature representations are then subjected to dimensionality reduction and normalization to form a unified-dimensional fused feature vector, which is then bound to the original multimedia file identifier as the fused semantic index.

[0028] In practice, visual feature vector sets, audio feature vector sets, and text feature vector sets are separated from the primary structured description file of the multimedia archive. This separation is accomplished by parsing the data structure of the primary structured description file. The visual feature vector set originates from multi-scale visual features extracted by a pre-trained deep convolutional neural network, potentially existing as multi-layer feature maps or global description vectors. The audio feature vector set originates from Mel-spectrum feature sequences, speaker embedding vectors, or sound event classification vectors extracted by the acoustic feature analysis module. The text feature vector set originates from word embedding sequences, sentence embedding vectors, or named entity encoding vectors output by the natural language processing model. These three sets are simultaneously input into the multi-head cross-modal attention layer of the improved cross-modal association algorithm. Before input, each feature vector set passes through an independent linear projection layer to map features from different modalities to a shared latent space dimension, ensuring that subsequent attention calculations are performed in the same vector space.

[0029] In the multi-head cross-modal attention layer, for each modality's feature vector set, the attention weight distribution is calculated using itself as the query vector and the feature vector sets of all other modalities as key and value vectors. This calculation involves defining learnable query, key, and value linear transformation matrices for each modality. Specifically, visual query vectors, audio query vectors, and text query vectors are extracted from the visual feature vector set, audio feature vector set, and text feature vector set, respectively. This extraction is achieved by applying a query linear transformation to each feature vector set. The visual query vector is then used to calculate the audio key and value vectors from the audio feature vector set, and the text key and value vectors from the text feature vector set. The audio key and value vectors are obtained by applying key and value linear transformations to the audio feature vector set, and the text key and value vectors are obtained by applying key and value linear transformations to the text feature vector set. This process generates the attention weight distribution of the visual modality on the audio and text modalities. The audio query vector is computed with visual key vectors and visual value vectors from the visual feature vector set, and text key vectors and text value vectors from the text feature vector set to generate the attention weight distribution of the audio modality on the visual and text modalities. Similarly, the text query vector is computed with visual key vectors and visual value vectors from the visual feature vector set, and audio key vectors and audio value vectors from the audio feature vector set to generate the attention weight distribution of the text modality on the visual and audio modalities.

[0030] Based on the calculated attention weight distribution, weighted fusion and information aggregation are performed on feature vector sets from different modalities. The aggregation operation involves weighted summation of value vectors from other modalities using attention weights. The attention weight distribution calculated for each modal feature vector set guides the weighted fusion of information from feature vector sets from other modalities. For example, the final cross-modal contextual representation of visual features is obtained by combining the weighted sum of the visual query vector and the audio value vector, the weighted sum of the visual query vector and the text value vector, with the original visual feature vector. This process is executed in parallel multiple times along the dimension of the attention heads, each using different linear transformation parameters to capture cross-modal association patterns in different subspaces. The outputs of all attention heads are concatenated and passed through an output linear projection layer to generate a set of intermediate feature representations containing cross-modal interaction information.

[0031] In some embodiments, the calculation of attention weight distribution can employ a scaled dot product attention mechanism. The formula for calculating the attention score between the visual modality query vector and the audio modality key vector is as follows:

[0032] in: Represents the visual query vector matrix. Represents the audio key vector matrix. This is the dimension of the key vector, used to scale the dot product result. The calculated fraction matrix. After normalization using the softmax function, the visual attention weight distribution to the audio is obtained. This weight distribution is then used to weight the audio value vector matrix. The intermediate feature representations undergo dimensionality reduction and normalization. Dimensionality reduction can be achieved by using a fully connected layer to map the concatenated high-dimensional vector to a lower dimension, while layer normalization is employed. A unified fusion feature vector is formed, and this fusion feature vector is bound to the original multimedia file identifier as a fusion semantic index. This binding operation is achieved by establishing a mapping relationship between the unique multimedia file identifier and the fusion feature vector in the index database.

[0033] In some embodiments, to address the issue of inconsistent feature sequence lengths, longer feature sequences can be max-pooled or self-attention pooled before attention computation to obtain fixed-length vector representations. It is understood that the design of multi-head cross-modal attention layers allows the model to simultaneously focus on multiple different interactive aspects from different modal information; for example, one attention head might focus on aligning a visual object with the text words describing it, while another attention head might focus on associating sound events with actions occurring in a video. Optionally, after generating intermediate feature representations, a cross-modal contrastive learning loss function can be introduced to force fused representations from different modalities within the same multimedia archive to be closer to each other in the latent space, while representations from different multimedia archives are further apart, thereby enhancing the discriminative power of the fused semantic index. In a specific implementation, the generated fused semantic index is a fixed-length dense floating-point vector that encapsulates the semantic information of the multimedia archive across visual, audio, and text modalities and the complex relationships between them. It is understood that this fused semantic index, as a unified semantic representation of the multimedia archive, can support content-based precise retrieval, similarity comparison, and high-level semantic understanding tasks. Optionally, the system can attach a lightweight metadata header to each fused semantic index to record the model version, feature dimensions, and timestamp information used to generate the index, so as to ensure the version traceability and compatibility of the index.

[0034] In one embodiment of the invention, the improved cross-modal association algorithm comprises at least three sequential multi-layer attention mechanism processing stages. See also... Figure 3In the first multi-layer attention mechanism processing stage, the visual feature vector set is used as the dominant query. The attention distribution of visual features on audio and text features is calculated. The calculation of the attention distribution uses the visual feature vector set as the query, and the audio and text feature vector sets as keys and values. Relevant information is aggregated, and the aggregation operation performs a weighted summation of the value vectors according to attention weights to generate a visually enhanced contextual representation. In the second multi-layer attention mechanism processing stage, the audio feature vector set is used as the dominant query. The attention distribution of audio features on the visually enhanced contextual representation and text features is calculated. The calculation of the attention distribution uses the audio feature vector set as the query, and the visually enhanced contextual representation and text feature vector set as keys and values. Relevant information is aggregated to generate an audiovisual enhanced contextual representation. In the third multi-layer attention mechanism processing stage, the text feature vector set is used as the dominant query. The attention distribution of text features on the visually enhanced contextual representation and the audiovisual enhanced contextual representation is calculated. The calculation of the attention distribution uses the text feature vector set as the query, and the visually enhanced contextual representation and audiovisual enhanced contextual representation as keys and values. Relevant information is aggregated to generate the final cross-modal fusion representation. Each processing stage includes a residual connection and a layer normalization operation. The residual connection adds the input of the processing stage to the output of the cross-modal attention module, and the layer normalization operation normalizes the result after addition.

[0035] In its implementation, the improved cross-modal association algorithm comprises at least three sequential, multi-layered attention mechanism processing stages. Each stage receives the output of the previous stage and generates an enhanced contextual representation. Taking an input containing a set of visual feature vectors, an audio feature vector set, and a text feature vector set as an example, the first stage uses the visual feature vector set as the dominant query, calculating the attention distribution of visual features on the original audio and text features. In this stage, attention scores are calculated for the visual query vector, audio key vector, and text key vector, respectively. These attention scores are normalized to form weights, which are used to perform a weighted summation of the audio and text value vectors. The aggregated cross-modal information is then fused with the original visual feature vectors to generate a visually enhanced contextual representation. This process can be formally represented as a context enhancement operation for each visual feature element.

[0036] In its implementation, the second multi-layer attention mechanism processing stage uses the audio feature vector set as the primary query, calculating the attention distribution of audio features on the visually enhanced contextual representation and text features. The input to this stage includes the visually enhanced contextual representation output from the first stage, the original audio feature vector set, and the original text feature vector set. The audio query vector interacts with the key and value vectors from the visually enhanced contextual representation and the original text feature vector set. The calculated attention weights are used to retrieve relevant information from the visual and text modalities. The aggregated information is then fused with the original audio features to generate the audiovisually enhanced contextual representation. In this stage, the text features still use their original representation to ensure the freshness of the information.

[0037] In its implementation, the third multi-layer attention mechanism processing stage uses the text feature vector set as the dominant query, calculating the attention distribution of text features on the context representations of visual enhancement and audiovisual enhancement. The input to this stage includes the context representations of visual and audiovisual enhancement, as well as the original text feature vector set. The text query vector is used in conjunction with the key and value vectors from the outputs of the first two stages to extract information from both the visual context enhanced by audio information and the audio context enhanced by visual information. Finally, this information is aggregated to generate a cross-modal fusion representation containing all three modalities of deep interaction information. The data flow and core operations of different processing stages can be summarized in a table, see Table 1.

[0038] Table 1: Data Flow in the Processing Stages of the Multi-Level Attention Mechanism

[0039] Each processing stage includes a residual connection and a layer normalization operation. The residual connection adds element-wise the original input or pre-projection representation of the query dominating the current processing stage to the output after cross-modal attention weighted summation. The layer normalization operation standardizes the summed result to stabilize the scale of intermediate features. The output of this operation... It can be represented as:

[0040] in: This represents which processing stage. The input representing the residual connection, , , These represent the query, key, and value vectors for this stage, respectively. `MultiHeadCrossAttn` represents the multi-head cross-modal attention computation function, and `LayerNorm` represents the layer normalization function. Residual connections and layer normalization operations help alleviate the vanishing gradient problem in deep networks and make multi-stage training processes more stable.

[0041] In some embodiments, the multi-head cross-modal attention computation in each processing stage can include an independent number of attention heads and projection dimensions. For example, the first stage may have 4 attention heads, the second stage 6 attention heads, and the third stage 8 attention heads. This sequential, modality-driven, progressive fusion structure allows the model to establish intermodal alignments step-by-step and with focus, rather than processing all complex interactions of all modalities at once. Optionally, relative position encoding or modality type embedding can be introduced in the attention computation of each processing stage to help the model distinguish which processing stage and modality the information comes from. In some embodiments, the visually enhanced contextual representation generated in the first stage, in addition to being passed to the second stage, can also be directly passed to the third stage via a skip connection, serving as a source of keys and values ​​along with the output of the second stage. This provides richer contextual information for the final fusion. It is understood that the three stages are sequentially dependent, with the output of the previous stage being the key input for the computation of the next stage. This design forces the model to establish an ordered, hierarchical cross-modal understanding path. Optionally, within each processing stage, in addition to the cross-modal attention layer, a feedforward neural network layer can be inserted. This layer performs a non-linear transformation on the output of the attention layer to enhance the expressive power of the model. The feedforward neural network layer is also followed by residual connections and layer normalization operations.

[0042] In one embodiment of the present invention, the fused semantic index is parsed to identify named entities, event elements, spatiotemporal markers, and attribute descriptions contained in the fused semantic index, forming a candidate entity and relationship list. The fused feature vectors contained in the fused semantic index are decoded using a pre-trained decoder neural network to reconstruct a structured description fragment. A named entity recognition model is run on the structured description fragment to identify personal names, place names, organization names, and domain-specific terms as candidate named entities. An event extraction model is run on the structured description fragment to identify actions, participants, time, and location elements, combining them into candidate event elements. A spatiotemporal expression parser is run on the structured description fragment to identify and standardize explicitly stated or implicit time points, time periods, geographical coordinates, and spatial range information as candidate spatiotemporal markers. A dependency parsing and semantic role labeling tool is run on the structured description fragment to extract phrases describing entity attributes or states as candidate attribute descriptions. All identified candidate named entities, candidate event elements, candidate spatiotemporal markers, and candidate attribute descriptions, along with their co-occurrence relationships, referential relationships, and syntactic dependencies, are organized into a structured list of candidate entities and relationships. Based on a predefined domain ontology model, this list of candidate entities and relationships is classified, disambiguated, and normalized to determine standardized archive entity nodes and standardized relationship labels. Based on these standardized relationship labels, directed relationship edges are established between the standardized archive entity nodes. The attributes of these relationship edges include relationship type, confidence level, and the relationship generation basis extracted from the fused semantic index. Each standardized archive entity node is supplemented with its timestamp, spatial location information, and source archive identifier within the multimedia archive, forming a dynamic knowledge graph with spatiotemporal and source context. This dynamic knowledge graph is stored in a graph database, and each node and edge is assigned a globally unique identifier.

[0043] The historical snapshot sequence of each archival entity node and its associated relation edges over time is extracted from the dynamic knowledge graph. This historical snapshot sequence is obtained by sampling the dynamic knowledge graph at consecutive time points. The historical snapshot sequence is input into an archival lifecycle state tracking model, which is a temporal prediction model based on a graph neural network. In this model, a graph embedding layer encodes the graph structure in each historical snapshot into a low-dimensional, dense graph state vector. The graph embedding layer uses a graph neural network to process the node and edge information of each snapshot. The graph state vectors at consecutive time steps are input into a temporal modeling layer, which uses a recurrent neural network or a transformer model to capture the evolution patterns and trends of the graph state vector sequence. At the end of the temporal modeling layer, the model outputs a prediction of the state vector of each archival entity node at a future preset time point, as its real-time state vector, and outputs the probability distribution of the archival entity node's transition from its current state to other states, as the state transition probability.

[0044] In practice, the fused semantic index is parsed to identify the named entities, event elements, spatiotemporal markers, and attribute descriptions contained within it. The fused feature vector contained in the fused semantic index is a high-dimensional dense vector, which needs to be decoded by a pre-trained decoder network. The decoder network is usually a multilayer perceptron or deconvolutional network, which maps the fused feature vector back to a structured text or symbol sequence. This sequence is the reconstructed structured description fragment, which is a compact and coherent text summary or semantic representation of the original multimedia file content. A named entity recognition model is run on the structured description fragment. The named entity recognition model adopts a sequence labeling architecture that combines a Transformer model based on bidirectional encoder representation with a conditional random field. This model can identify names of people, places, organizations, and domain-specific terms in the fragment. All identified entities are collected and labeled with their types as candidate named entities. An event extraction model is run on a structured description fragment. The event extraction model is built on a pre-trained language model and identifies action trigger words, participants, time and location elements of the event in the fragment through sequence labeling or reading comprehension. These identified elements are combined according to predefined event patterns to form structured candidate event elements.

[0045] A spatiotemporal expression parser is run on the structured description fragment. This parser incorporates regular expression rules and a time and space common sense library, enabling it to identify and standardize explicit or implicit time points, time periods, geographic coordinates, and spatial range information within the fragment. For example, "Autumn 2023" is parsed into a standardized time interval "September 1, 2023 to November 30, 2023," and "Chengdong Development Zone" is parsed into a geographic coordinate range or associated with a standardized geographic region identifier. The parsing results serve as candidate spatiotemporal markers. A dependency parsing and semantic role labeling tool is then run on the structured description fragment. This tool performs syntactic parsing to obtain the modification and dependency relationships between words and labels the semantic roles of core predicates, such as agent, patient, time, and location. From this, phrases describing entity attributes or states are extracted, such as "wearing a red coat" as a candidate attribute description for entity "Person A," and "running quickly" as a candidate state description for entity "Person A." All identified candidate named entities, candidate event elements, candidate spatiotemporal markers, and candidate attribute descriptions, along with their co-occurrence relationships, referential relationships, and syntactic dependencies, are organized into a structured list of candidate entities and relationships, as shown in Table 2. The list can be organized in tabular form, with each row recording detailed information about an entity or relationship.

[0046] Table 2: List of Candidate Entities and Relationships

[0047] Based on a predefined domain ontology model, candidate entities and relation lists are classified, disambiguated, and normalized. The domain ontology model defines standard entity types, relation types, and attribute constraints within a specific domain (e.g., government affairs, healthcare, film and television). Classification maps candidate entities and relations to categories defined in the ontology model; disambiguation resolves the issue of entities with the same name referring to different objects; and normalization unifies inconsistent entity names into standard names, for example, normalizing "Zhang Laosan" and "Sange" to "Zhang San." This process determines standardized archival entity nodes and standardized relation labels. Based on these standardized relation labels, directed relation edges are established between standardized archival entity nodes. The attributes of these relation edges include relation type, confidence level, and relation generation basis extracted from the fused semantic index. The confidence level comes from the predicted probability of upstream tools such as the named entity recognition model and event extraction model, and the relation generation basis can be linked back to the original text in the original structured description fragment. For each standardized archival entity node, its occurrence timestamp, spatial location information, and source archive identifier within the multimedia archive are supplemented. The timestamp is obtained from candidate spatiotemporal markers, the spatial location information is obtained from geocoding results, and the source archive identifier records which specific multimedia archive file the entity originates from, thus forming a dynamic knowledge graph with spatiotemporal and source context. This dynamic knowledge graph is stored in the form of a graph database, with each node and edge assigned a globally unique identifier. The graph database uses a native graph database, whose storage structure directly reflects the graph model of nodes, edges, and their attributes, supporting efficient graph traversal and relation lookup.

[0048] In some embodiments, the archive lifecycle state tracking model is constructed based on a temporal prediction model using a graph neural network, employing an encoder-predictor architecture. Historical snapshot sequences of the changes in each archive entity node and its associated relation edges over time are extracted from the dynamic knowledge graph. This extraction is performed by querying the graph state in the graph database at fixed time intervals (e.g., daily, weekly). Each snapshot is a subgraph of the knowledge graph at a specific time point, containing all active entity nodes, relation edges, and their attributes at that time. The historical snapshot sequences are input into the archive lifecycle state tracking model. In this model, the graph structure in each historical snapshot is encoded into a low-dimensional, dense graph state vector through a graph embedding layer. The graph embedding layer can employ a graph convolutional network, a graph attention network, or a graph transformer, and its operation can be formalized as follows:

[0049] in: It is the feature matrix of all nodes in the knowledge graph at time step t. It is the adjacency matrix at time step t. Represents graph neural network functions. This is the encoded graph state vector. The graph state vectors at consecutive time steps are input into the temporal modeling layer. This layer can employ recurrent neural networks, long short-term memory networks, gated recurrent units, or Transformer encoders to capture the evolution patterns and trends inherent in the graph state vector sequence. At the end of the temporal modeling layer, the model outputs a prediction of the state vector for each archive entity node at a future preset time point, serving as its real-time state vector. It also outputs the probability distribution of the archive entity node's transition from its current state to other predefined states (such as "active," "archived," or "destroyed"), serving as the state transition probability. The real-time state vector is a multi-dimensional vector, and its different dimensions can be interpreted as quantitative indicators such as node popularity, integrity, and association strength.

[0050] Optionally, the encoding of the graph state vector can include not only structural information but also node attributes, relationship types, and timestamp information as part of the initial node features. It can be understood that by combining the modeling ability of graph neural networks for spatial (graph structure) dependencies with the modeling ability of sequence models for temporal dependencies, the archive lifecycle state tracking model can learn patterns from the evolutionary history of the dynamic knowledge graph and predict the future state of nodes. In some embodiments, the loss function used for model training can combine the mean squared error of state vector prediction and the cross-entropy loss of state transition probability prediction to simultaneously optimize regression and classification tasks. Optionally, for newly emerging archive entity nodes lacking historical snapshots, an inductive graph neural network method based on their initial attributes and the states of their network neighbors can be used to generate their initial state vector estimates.

[0051] In one embodiment of the present invention, key state dimension values ​​of archive entity nodes are parsed from the real-time state vector. These key state dimension values ​​include popularity, integrity metrics, relevance, and risk scores. The parsing operation is implemented by reading specific dimension values ​​from the state vector. The key state dimension values ​​and the state transition probabilities are input into a strategy matching engine. The strategy matching engine traverses a predefined management strategy rule base. Each rule in the management strategy rule base consists of a state condition, a transition probability condition, and a sequence of operations to be executed. When the key state dimension value satisfies the state condition of a rule, and the state transition probability satisfies the transition probability condition of the rule, the rule is activated. The strategy matching engine integrates all the sequence of operations to be executed corresponding to the activated rules into an ordered, conflict-free intelligent management operation sequence according to a preset conflict resolution and priority sorting logic. The operations suggested by different activated rules are compared to identify mutually exclusive or contradictory operation pairs in terms of operation object, operation type, or operation parameters. For mutually exclusive or contradictory operation pairs, a decision is made based on the preset priority values ​​in their source rules, retaining the operation corresponding to the higher priority rule and removing or modifying the operation corresponding to the lower priority rule. For operations with the same target and compatible operation types, their operation parameters are merged or optimized. All operations that have undergone conflict resolution are then sorted chronologically according to the urgency of the affected archival entities and the logical dependencies between operations. The sorted list of operations is then encapsulated into an intelligent management operation sequence that can be scheduled and executed by the archival management system.

[0052] In practice, key state dimension values ​​of archival entity nodes are parsed from the real-time state vector output by the archival lifecycle state tracking model. The real-time state vector is a multi-dimensional numerical vector, with each dimension corresponding to a predefined state indicator. The parsing operation is achieved by reading the values ​​of specific dimensions within the vector. Key state dimension values ​​include popularity, integrity metric, relevance, and risk score. Popularity reflects the quantified frequency of an archival entity being accessed, cited, or associated over a period of time. Integrity metric assesses the completeness of information related to the archival entity. Relevance characterizes the tightness of the connection between the archival entity and other entities in the dynamic knowledge graph. The risk score is a quantified value of potential risk calculated based on factors such as the sensitivity of the entity's content, its preservation status, and compliance. The parsed key state dimension values, along with the state transition probabilities output by the archival lifecycle state tracking model, are input into the strategy matching engine. The state transition probability is a probability distribution vector representing the probability that an archival entity node will migrate from its current state to another predefined state.

[0053] In practice, the strategy matching engine traverses a predefined management strategy rule base, which is a structured database or rule file. Each rule consists of a state condition, a transition probability condition, and a sequence of operations to be executed. The state condition is a logical expression about one or more key state dimension values, such as "heat value > 0.7 and risk score < 0.3". The transition probability condition is a logical expression about the state transition probability, such as "probability of migrating to the 'pending destruction' state > 0.6". The sequence of operations to be executed is an ordered list of atomic operations, each defining a specific management action for the file entity, such as "create backup copy", "add access permissions", "initiate security level review", and "migrate to low-speed storage". When a key state dimension value satisfies the state condition of a rule, and the state transition probability satisfies the transition probability condition of the same rule, this rule is activated by the strategy matching engine. The strategy matching engine integrates all the sequences of operations to be executed corresponding to the activated rules into an ordered, conflict-free intelligent management operation sequence according to a preset conflict resolution and priority sorting logic. The integration process needs to handle potential conflicts and overlaps between operations suggested by different rules.

[0054] In some embodiments, the operations suggested by different activated rules are compared to identify mutually exclusive or contradictory operation pairs in terms of operation object, operation type, or operation parameters. Mutually exclusive operation objects refer to two operations targeting the same file entity but with completely opposite goals; for example, operation A is "add access permissions," and operation B is "restrict all access." Contradictory operation types may manifest as exclusive occupation of the same system resources; for example, operation C is "start full encryption," and operation D is "start content analysis," both of which may significantly consume computing resources. Contradictory operation parameters manifest as different settings for the same attribute; for example, operation E sets the number of storage copies to 3, and operation F sets the number of storage copies to 1. For mutually exclusive or contradictory operation pairs, a decision is made based on the preset priority value in their source rules. Each rule has a numeric priority field in the management policy rule base; a higher value indicates a higher priority. The decision logic is to retain the operation corresponding to the high-priority rule and remove or modify the operation corresponding to the low-priority rule. Removing an operation directly discards the conflicting operation suggested by the low-priority rule. Modifying an operation may involve adjusting the operation parameters to be compatible with the high-priority operation, such as taking stricter parameter values.

[0055] In some embodiments, for operations with the same target and compatible operation types, their operation parameters are merged or optimized. Compatible operations refer to multiple operations that can be executed sequentially or in parallel without conflict. For example, operation G "Generate Text Summary" and operation H "Generate Keyframe Preview" can be merged and arranged into a group of tasks to be executed in parallel. When multiple operations are suggested to set the same parameter, optimizing the value may mean taking the maximum value, the minimum value, or a priority value defined according to business logic. For example, for the "Number of Copies to Save" parameter, if one rule suggests 2 and another suggests 5, the maximum value of 5 can be taken to ensure data security. All operations after conflict resolution are sorted chronologically according to the urgency of the archive entities they affect and the logical dependencies between operations. The urgency can be calculated based on the state conditions in the triggering rules. For example, the higher the risk score or the greater the probability of migrating to a high-risk state, the higher the urgency. Logical dependencies refer to the fact that some operations can only be executed after other operations are completed. For example, the "Data Decryption" operation can only be executed after the "Integrity Verification" operation is successfully completed. The sorting logic can be formalized as a topological sorting problem of a directed acyclic graph, where nodes represent operations and edges represent dependencies. The sorted list of operations is encapsulated into a sequence of intelligent management operations that can be scheduled and executed by the file management system. The encapsulation format can be a JSON or XML document, which clearly specifies the execution order, target entity, operation type, specific parameters, and expected preconditions and postconditions for each operation in the sequence.

[0056] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.

Claims

1. A multimedia archive intelligent management method based on artificial intelligence, characterized in that: include: The system receives raw input data from a multimedia file, performs multi-dimensional feature analysis on the raw input data, and generates a primary structured description of the multimedia file. The multi-dimensional feature analysis includes visual element recognition, audio content transcription, and text semantic extraction. An improved cross-modal association algorithm is invoked to process the primary structured description and generate a fused semantic index for multimedia archives. The improved cross-modal association algorithm constructs the interaction and alignment between visual, audio and text features based on a multi-layer attention mechanism. Based on the fused semantic index, a dynamic knowledge graph of multimedia archives is constructed. The dynamic knowledge graph uses archive entities as nodes and semantic, spatiotemporal and logical relationships between entities as edges. By using the archive lifecycle state tracking model, the dynamic knowledge graph is analyzed for state evolution to generate real-time state vectors and state transition probabilities of the archives. By combining the real-time state vector, state transition probability, and predefined management strategy rule base, an intelligent management operation sequence for a specific multimedia file is generated.

2. The intelligent multimedia archive management method based on artificial intelligence according to claim 1, characterized in that, The process of receiving raw input data from a multimedia file, performing multi-dimensional feature analysis on the raw input data, and generating a preliminary structured description of the multimedia file includes: For raw input data of image or video type, multi-scale visual features are extracted by pre-trained deep convolutional neural network, and visual elements and their spatial relationships are identified by object detection and scene recognition technology to generate visual element recognition results. For raw audio input data, the speech recognition engine converts its content into text, and acoustic feature analysis identifies audio tracks, speakers, background sounds and emotional tendencies to generate audio content transcription and acoustic analysis results. For raw text input data, natural language processing techniques are used to perform word segmentation, part-of-speech tagging, named entity recognition, and dependency parsing to extract key entities, themes, and sentiments, and generate text semantic extraction results. The visual element recognition results, audio content transcription and acoustic analysis results, and text semantic extraction results of the same multimedia file are aligned and associated according to timestamps or logical chapters, and packaged to generate a primary structured description file of the multimedia file.

3. The intelligent multimedia archive management method based on artificial intelligence according to claim 1, characterized in that, The improved cross-modal association algorithm is invoked to process the primary structured description and generate a fused semantic index for the multimedia archive, including: The visual feature vector set, audio feature vector set, and text feature vector set are separated from the primary structured description; The visual feature vector set, audio feature vector set, and text feature vector set are simultaneously input into the multi-head cross-modal attention layer of the improved cross-modal association algorithm; In the multi-head cross-modal attention layer, for each modality's feature vector set, it is calculated as itself as a query vector, and the attention weight distribution is used with the feature vector sets of all other modalities as key vectors and value vectors; Based on the attention weight distribution, the feature vector sets from different modalities are weighted, fused, and aggregated to generate a set of intermediate feature representations containing cross-modal interaction information; The intermediate feature representation is subjected to dimensionality reduction and normalization to form a fusion feature vector with a unified dimension, and the fusion feature vector is bound to the original multimedia file identifier as the fusion semantic index.

4. The intelligent multimedia archive management method based on artificial intelligence according to claim 3, characterized in that, The improved cross-modal association algorithm constructs the interaction and alignment between visual, audio, and text features based on a multi-layer attention mechanism, including: The improved cross-modal association algorithm comprises at least three sequential multi-layer attention mechanism processing stages; In the first multi-layer attention mechanism processing stage, the visual feature vector set is used as the dominant query to calculate the attention distribution of visual features on audio features and text features, and aggregate relevant information to generate a visually enhanced contextual representation. In the second multi-layer attention mechanism processing stage, the audio feature vector set is used as the dominant query to calculate the contextual representation of audio features for visual enhancement and the attention distribution of text features, and aggregate relevant information to generate the contextual representation of audiovisual enhancement. In the third multi-layer attention mechanism processing stage, the text feature vector set is used as the dominant query to calculate the attention distribution of text features on the context representation of visual enhancement and the context representation of audiovisual enhancement, and aggregate relevant information to generate the final cross-modal fusion representation. Each processing stage includes a residual connection and layer normalization operation to ensure the stability of the information flow and the convergence of the training process.

5. The intelligent multimedia archive management method based on artificial intelligence according to claim 1, characterized in that, Based on the fused semantic index, a dynamic knowledge graph of the multimedia archive is constructed, including: The fused semantic index is parsed to identify named entities, event elements, spatiotemporal markers, and attribute descriptions contained in the fused semantic index, forming a list of candidate entities and relationships; Based on a predefined domain ontology model, the candidate entities and relation list are classified, disambiguated, and normalized to determine standardized archive entity nodes and standardized relation labels. Based on the standardized relation tags, directed relation edges are established between the standardized archive entity nodes. The attributes of the relation edges include relation type, confidence level, and relation generation basis extracted from the fused semantic index. Each standardized archival entity node is supplemented with its timestamp, spatial location information, and source archive identifier in the multimedia archive, forming a dynamic knowledge graph with spatiotemporal and source context; The dynamic knowledge graph is stored in the form of a graph database, and a globally unique identifier is assigned to each node and edge.

6. The intelligent multimedia archive management method based on artificial intelligence according to claim 5, characterized in that, The step of using the archive lifecycle state tracking model to perform state evolution analysis on the dynamic knowledge graph and generate real-time state vectors and state transition probabilities for the archives includes: Extract the historical snapshot sequence of each archive entity node and its associated relation edges over time from the dynamic knowledge graph; The historical snapshot sequence is input into the archive lifecycle status tracking model, which is a time-series prediction model based on graph neural networks; In the archive lifecycle state tracking model, the graph structure in each historical snapshot is encoded into a low-dimensional dense graph state vector through a graph embedding layer; The graph state vectors of continuous time steps are input into the temporal modeling layer to capture the evolution patterns and trends of the graph structure over time. At the end of the temporal modeling layer, the model outputs a prediction of the state vector of each archive entity node at a future preset time point, which serves as its real-time state vector. It also outputs the probability distribution of the archive entity node's transition from the current state to other states, which serves as the state transition probability.

7. The intelligent multimedia archive management method based on artificial intelligence according to claim 6, characterized in that, Combining the real-time state vector, state transition probabilities, and a predefined management strategy rule base, an intelligent management operation sequence for a specific multimedia file is generated, including: The key state dimension values ​​of the archive entity nodes are parsed from the real-time state vector. The key state dimension values ​​include popularity value, integrity measure, relevance and risk score. The key state dimension values ​​and the state transition probabilities are input into the policy matching engine; The strategy matching engine traverses the predefined management strategy rule base, where each rule consists of a state condition, a transition probability condition, and a sequence of operations to be executed. The rule is activated when the key state dimension value satisfies the state condition of a certain rule and the state transition probability satisfies the transition probability condition of the rule. The strategy matching engine integrates all the operation sequences to be executed corresponding to the activated rules into an ordered, conflict-free intelligent management operation sequence according to the preset conflict resolution and priority sorting logic.

8. The intelligent multimedia archive management method based on artificial intelligence according to claim 7, characterized in that, The policy matching engine integrates all the operation sequences to be executed corresponding to the activated rules into an ordered, conflict-free intelligent management operation sequence according to a preset conflict resolution and priority sorting logic, including: Compare the operations suggested by different activation rules and identify mutually exclusive or contradictory operation pairs in terms of operation objects, operation types, or operation parameters; For pairs of operations that are mutually exclusive or contradictory, the decision is made based on the priority value preset in the source rule, retaining the operation corresponding to the higher priority rule and removing or modifying the operation corresponding to the lower priority rule. For operations that have the same target and compatible operation types, their operation parameters are merged or the optimal value is taken. All operations that have undergone conflict resolution are sorted chronologically according to the urgency of the archive entities they affect and the logical dependencies between the operations. The sorted list of operations is encapsulated into a sequence of intelligent management operations that can be scheduled and executed by the document management system.

9. The intelligent multimedia archive management method based on artificial intelligence according to claim 3, characterized in that, In the multi-head cross-modal attention layer, for each modality's feature vector set, the attention weight distribution is calculated as follows: The feature vector set of each modality is used as the query vector, and the feature vector sets of all other modalities are used as the key and value vectors. Extract the corresponding visual query vector, audio query vector, and text query vector from the visual feature vector set, audio feature vector set, and text feature vector set, respectively; The visual query vector is calculated together with the audio key vector and audio value vector from the audio feature vector set and the text key vector and text value vector from the text feature vector set to generate the attention weight distribution of the visual modality on the audio modality and the text modality. The audio query vector is calculated together with the visual key vector and visual value vector from the visual feature vector set and the text key vector and text value vector from the text feature vector set to generate the attention weight distribution of the audio modality on the visual modality and the text modality. The text query vector is calculated together with the visual key vector and visual value vector from the visual feature vector set and the audio key vector and audio value vector from the audio feature vector set to generate the attention weight distribution of the text modality on the visual modality and the audio modality. The attention weight distribution calculated for the feature vector set of each modality is used to guide the weighted fusion of information from the feature vector sets of other modalities.

10. The intelligent multimedia archive management method based on artificial intelligence according to claim 5, characterized in that, The fused semantic index is parsed to identify named entities, event elements, spatiotemporal markers, and attribute descriptions contained within it, forming a candidate entity and relationship list, including: The fusion feature vector contained in the fusion semantic index is decoded to restore the structured description fragment; Run a named entity recognition model on the structured description fragment to identify personal names, place names, organization names, and domain-specific terms in the fragment as candidate named entities; An event extraction model is run on the structured description fragment to identify the action, participant, time, and location elements in the fragment and combine them into candidate event elements. A spatiotemporal expression parser is run on the structured description fragment to identify and standardize the explicit or implicit time points, time periods, geographic coordinates, and spatial range information in the fragment as candidate spatiotemporal markers; The dependency parsing and semantic role labeling tool is run on the structured description fragment to extract phrases describing entity attributes or states as candidate attribute descriptions; All identified candidate named entities, candidate event elements, candidate spatiotemporal markers, and candidate attribute descriptions, along with their co-occurrence relationships, referential relationships, and syntactic dependencies, are organized into a structured list of candidate entities and relationships.