Method and apparatus for searching multimedia files
By generating structured descriptions of multimedia files and using multi-dimensional feature fusion, the problem of low accuracy in searching multimedia files in electronic devices is solved, enabling fast and accurate file retrieval and improving operational efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- VIVO MOBILE COMM CO LTD
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240569A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of multimedia file processing technology, specifically relating to a multimedia file search method and apparatus. Background Technology
[0002] As the frequency of photo and video recording functions increases, the number of multimedia files on electronic devices is also growing, making it increasingly difficult for users to find the multimedia files they need. Current electronic devices have relatively low accuracy in searching for multimedia files, requiring users to manually open each file one by one for confirmation, a cumbersome and inefficient process. Summary of the Invention
[0003] The purpose of this application is to provide a multimedia file search method and apparatus that can quickly and accurately find the multimedia files needed by the user.
[0004] In a first aspect, embodiments of this application provide a multimedia file retrieval method, including: Receive the first input; In response to the first input, output the target multimedia file; The target multimedia file is obtained by matching the retrieval element corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and the structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
[0005] Secondly, embodiments of this application provide a multimedia file retrieval device, including: The receiving module is used to receive the first input; The display module is used to output the target multimedia file in response to the first input; The target multimedia file is obtained by matching the retrieval element corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and the structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
[0006] Thirdly, embodiments of this application provide an electronic device including a processor and a memory, the memory storing a program or instructions executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the method as described in the first aspect.
[0007] Fourthly, embodiments of this application provide a readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the steps of the method as described in the first aspect.
[0008] Fifthly, embodiments of this application provide a chip, which includes a processor and a communication interface, the communication interface and the processor being coupled together, the processor being used to run programs or instructions to implement the steps of the method as described in the first aspect.
[0009] In a sixth aspect, embodiments of this application provide a computer program product stored in a storage medium, which, when executed by at least one processor, implements the steps of the method described in the first aspect.
[0010] This application embodiment pre-extracts at least two categories of semantic features, environmental features, and voice user features from each multimedia file in the multimedia file library. Through complementary fusion of multi-dimensional features, the structured description information corresponding to each multimedia file can be accurately extracted. After receiving the user's first input, the search element corresponding to the first input can be directly matched with the structured description information corresponding to each multimedia file, eliminating the need to perform repeated processing on each multimedia file. This improves the accuracy and efficiency of multimedia file retrieval. Attached Figure Description
[0011] Figure 1 A flowchart illustrating a multimedia file search method provided in this application embodiment; Figure 2 A flowchart illustrating another multimedia file search method provided in this application embodiment; Figure 3 This is a schematic diagram of the structure of a semantic encoder provided in an embodiment of this application; Figure 4 A flowchart illustrating another multimedia file search method provided in this application embodiment; Figure 5 This is a schematic diagram of the structure of a voice user encoder provided in an embodiment of this application; Figure 6 A flowchart illustrating another multimedia file search method provided in this application embodiment; Figure 7 A schematic diagram illustrating the process of determining structured description information provided in an embodiment of this application; Figure 8 This is a schematic diagram of the decoding process of a decoder provided in an embodiment of this application; Figure 9 A schematic diagram of a multimedia file playback window provided in an embodiment of this application; Figure 10This is a schematic diagram of the structure of a multimedia file search device provided in an embodiment of this application; Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application; Figure 12 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0012] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.
[0013] The terms "first," "second," etc., used in this application's specification are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects; for example, a first object can be one or more. Furthermore, in the specification, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects have an "or" relationship.
[0014] Currently, most multimedia file search functions in electronic devices suffer from limited modal perception and shallow semantic understanding. In particular, when faced with rich environmental sounds such as ocean waves and traffic noise, non-semantic acoustic features such as tone and melody, or abstract time scene descriptions, it is difficult to retrieve multimedia files that meet the user's needs.
[0015] For example, when a user enters a complex natural language command in the photo album search bar, such as "a clip of singing 'Happy Birthday' at last year's birthday party," the existing system can only perform simple optical character recognition (OCR) or automatic speech recognition (ASR) for matching. It cannot associate the visual scene of "birthday party" with the content of "singing 'Happy Birthday,'" resulting in low accuracy of search results from electronic devices. Consequently, users have to click on each audio and video clip individually to confirm, which is cumbersome.
[0016] For example, a user might want to find videos based on sound features, such as searching for "videos with running water sounds" or "dance videos with only music." However, because the videos lack dialogue, the ASR (Automatic Sound Recognition) results are empty. The electronic device returns no results, forcing the user to manually click on each video thumbnail to confirm, a cumbersome process.
[0017] This results in users having to spend a lot of time to obtain the multimedia files they want, which is inefficient.
[0018] Therefore, this application provides a multimedia file search method and apparatus that can quickly and accurately find the multimedia files needed by the user.
[0019] The multimedia file retrieval method and apparatus provided in this application will be described in detail below with reference to the accompanying drawings and through specific embodiments and application scenarios.
[0020] The multimedia file retrieval method provided in this application can be applied to scenarios such as personal media resource management, film and video production, and conference and interview analysis, which require selecting the multimedia files needed by the user from a large number of multimedia files. The method provided in this application can quickly and accurately select the multimedia files needed by the user from a large number of multimedia files.
[0021] Figure 1 The flowchart illustrates a multimedia file retrieval method provided in this application embodiment, which may include the following steps S110-S120.
[0022] S110, The electronic device receives the first input.
[0023] S120. The electronic device responds to the first input and outputs a target multimedia file; wherein the target multimedia file is obtained by matching the search element corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and the structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
[0024] This application embodiment pre-extracts at least two categories of semantic features, environmental features, and voice user features from each multimedia file in the multimedia file library. Through complementary fusion of multi-dimensional features, the structured description information corresponding to each multimedia file can be accurately extracted. After receiving the user's first input, the search element corresponding to the first input can be directly matched with the structured description information corresponding to each multimedia file, eliminating the need to perform repeated processing on each multimedia file. This improves the accuracy and efficiency of multimedia file retrieval.
[0025] The above steps are explained in detail below: In S110, the electronic device can be a smart device with multimedia file processing and display functions. In some embodiments of this application, the electronic device can be a mobile phone, tablet computer, laptop computer, or other terminal.
[0026] The first input is used by the user to search for multimedia files; that is, when a user wants to search for a specific multimedia file, they can use the first input. This first input can be text entered by the user via a soft keyboard or hard keyboard, or it can be voice input. When the user inputs voice, the electronic device can convert the received voice into text. This first input can be a complete sentence, such as "a snippet of my sister singing 'Happy Birthday' at my birthday party last year," or it can be multiple keywords, such as "last year," "my sister," "birthday party," and "Happy Birthday."
[0027] For example, if a user inputs the text "a clip of my sister singing 'Happy Birthday' at my birthday party last year", it means that the target multimedia file the user wants to query is: "a clip of my sister singing 'Happy Birthday' at my birthday party last year".
[0028] In S120, the search elements can be elements included in the first input that assist the electronic device in finding specific multimedia files. For example, they can include one or more keywords from the first input. Taking "a clip of my sister singing the birthday song at my birthday party last year" as an example, the corresponding search elements can include "last year", "my sister", "birthday party", and "birthday song".
[0029] A multimedia file library can be a database containing multiple multimedia files. For example, it could be an electronic device's photo album or gallery, a folder containing a specific multimedia file on the electronic device, or a database stored on another electronic device. In this case, the electronic device needs to establish a communication connection with other electronic devices and have permission to access the multimedia file library. For instance, when a mobile phone connects to a laptop, the mobile phone can access the laptop's photo album. In this case, the mobile phone can preprocess the multimedia files in the laptop's photo album to obtain structured description information for each multimedia file. These multimedia files can include, but are not limited to, audio, video, images, and documents.
[0030] The multimedia file library may include, but is not limited to, "a clip of my sister singing 'Happy Birthday' at my birthday party last year," "a video with the sound of flowing water filmed during a trip to A Mountain in May," "a video of a hot pot dinner filmed in C Town, B Mountain," "a workout video of arm exercises," "a video of washing vegetables," "a video of a baby crying," meeting audio, PDF documents, Word documents, images, etc. In some embodiments of this application, the multimedia file may be audio or video.
[0031] The structured description information here is a collection of data with clearly defined fields and formats extracted and organized from the original multimedia files. By establishing structured description information for each multimedia file, electronic devices can quickly and accurately find the target multimedia file that corresponds to the user's first input.
[0032] For example, for a video of a younger sister celebrating her birthday, the final structured description information could be in the following form: "Timestamp": "20:30:15-20:31:00"; "Speaker identity": "younger sister"; "Audio content": "Wow, thank you everyone! I'll make a wish first... (closes eyes and pauses)... I hope everyone is healthy and happy! Whew (blowing sound)"; "Environment type": "A cozy restaurant private room with the birthday song playing in the background."
[0033] This structured descriptive information is generated based on multidimensional features extracted from multimedia files. These multidimensional features can include at least two of the following: voice user features, environmental features, and semantic features. Voice user features can be features that characterize the voice user's identity, such as the voice user's voiceprint features, language accent features, gender, and age features. Here, the voice user can be the user who made the speech.
[0034] Environmental features characterize the environment in which the multimedia file is located. These may include, but are not limited to, scene acoustic features, specific event category features, and temporal structure features. Scene acoustic features reflect the overall acoustic environment and spatial characteristics during multimedia file recording; for example, they may include, but are not limited to, indoor, outdoor, enclosed, or open spaces. Specific event category features identify specific non-human sound events occurring in the multimedia file; these may include, but are not limited to, knocking sounds, rain sounds, alarm sounds, wind sounds, and musical instrument sounds. Temporal structure features identify the temporal distribution and change patterns of environmental events; these may include, but are not limited to, continuous background noise, sudden transient sounds, and regular sounds.
[0035] Semantic features are used to represent the content of speech, indicating what the voice user said.
[0036] Voice user features, environmental features, and semantic features can be encoded using different encoders. By extracting multidimensional features, each multimedia file can be understood more deeply, resulting in a more accurate output of structured descriptions of each multimedia file. This, in turn, allows for a more precise identification of multimedia files that meet the user's needs.
[0037] In this embodiment, after receiving the first input, the electronic device can parse the first input to determine the search element, and then match the search element with the structured description information corresponding to each multimedia file to obtain the corresponding target multimedia file. For example, the multimedia file corresponding to the structured description information with a similarity greater than or equal to the similarity threshold of the search element can be used as the target multimedia file.
[0038] This application embodiment can pre-process multimedia files in the multimedia file library, extract multi-dimensional features of the multimedia files, deeply understand and analyze these features, output and store the structured description information corresponding to each multimedia file, and subsequently, based on the user's initial input, quickly and accurately find the multimedia files that meet the user's needs. Moreover, since the structured description information of each multimedia file has been pre-stored, it can meet the user's search needs in offline mode.
[0039] The following is combined Figure 2 The process of generating structured descriptive information is explained. Figure 2 A flowchart illustrating another multimedia file search method provided in this application embodiment. Figure 2 and Figure 1 The difference is that, Figure 2 It also includes S210-S240.
[0040] S210. The electronic device extracts the basic characteristics of each multimedia file in the multimedia file library.
[0041] The basic features here are also called general features. These basic features can include various types of information, such as voice user information, environmental information, and audio content information, providing a foundation for subsequently obtaining features from different dimensions. For example, the basic features of each multimedia file can be extracted using a general feature extraction model.
[0042] A general feature extraction model can be, for example, the self-supervised learning framework wav2vec. Taking multimedia files, including audio and video, as an example, the main task of this framework is to transform the original, continuous audio and video waveform signals into discrete, context-rich latent speech representations.
[0043] For example, this general feature extraction model may include a feature encoding module, a quantization module, and a context network. The feature encoding module compresses the original audio / video waveform data and converts it into a low-dimensional, temporally continuous latent representation. This feature encoding module may include a multi-layer convolutional neural network, which can downsample, reducing the computational load of subsequent modules, while simultaneously extracting low-level acoustic features, achieving a preliminary abstraction from continuous temporal signals to local acoustic feature vectors.
[0044] The quantization module maps the continuous latent representation (Z) output by the feature encoder into a discrete, finite space to obtain the quantized representation (Q). This process "discretizes" continuous speech features into discrete units similar to phonemes or words, enhancing the robustness of the model and providing a clear "target" for subsequent contrastive learning.
[0045] The context network receives the latent representation (Z) after random masking and uses the Transformer encoder to learn its contextual dependencies, generating a rich, context-aware representation (C). Leveraging the global modeling capabilities of the Transformer, the model captures long-range contextual dependencies throughout the audio sequence, focusing not only on the current moment but also on the preceding and following speech segments to understand the current sound. For example, it infers currently ambiguous phonemes from the context. This step outputs a contextual representation rich in contextual information.
[0046] wav2vec uses contrastive predictive coding or similar self-supervised learning objectives, requiring no annotations. This module is pre-trained on massive amounts of unannotated speech data. Through this pre-training, the general feature extraction module "learns" the general rules of human language and sound. Therefore, the features it extracts not only contain semantic information of speech but also implicitly include the speaker's timbre and non-linguistic acoustic texture information, providing a foundation for subsequent feature extraction across different dimensions.
[0047] S220: The electronic device encodes the basic features of each multimedia file according to different dimensions to obtain the multidimensional features of each multimedia file.
[0048] Understandably, the basic features extracted by the general feature extraction module are usually mixed. If it is necessary to accurately obtain the structured description information of each multimedia file, it is also necessary to extract the required information from the basic features.
[0049] For example, different encoders can be used to encode the basic features of multimedia files, resulting in features of different dimensions. For instance, the basic features can be input into a semantic encoder, which can then extract semantic features of the speech. Examples include what song is being sung, chat content containing birthday wishes, or discussions about a particular mountain.
[0050] For example, basic features are input into the Diarization encoder (speech user encoder), which can extract frame-level speech user features. These speech user features can include voiceprint information to distinguish different speech users. For example, even when singing, the Diarization encoder can distinguish which part is the voice of a "sister" and which part is the voice of a "mother".
[0051] For example, basic features are input into an event encoder, which can then extract environmental features. These environmental features can include the category and characteristics of ambient sounds. In a "waterfall" video, the background sound is the sound of flowing water; the event encoder will extract features representing the "flowing water sound."
[0052] For example, the basic features obtained above can be input into different encoders, and features of different dimensions can be extracted by different encoders, providing a reliable foundation for a deeper understanding of each multimedia file.
[0053] For example, semantic encoders and context encoders have similar structures, only their parameters differ.
[0054] Taking a semantic encoder as an example, Figure 3 An example of the structure of a semantic encoder is given, such as Figure 3 As shown, the semantic encoder 301 includes, from input to output, a first feedforward layer 302, a multi-head self-attention layer 303, a convolutional layer 304, a second feedforward layer 305, and a normalization layer 306.
[0055] The first feedforward layer 302 is used to perform nonlinear transformation on the input basic features 300 to enhance the representation capability of the features.
[0056] The multi-head self-attention layer 303 is used to compute the correlation between different time steps in the audio sequence. Multiple attention heads compute in parallel, with each head focusing on different semantic relationships. The output of the multi-head self-attention layer 303 incorporates features that fuse global contextual information and retain their shape.
[0057] Transformer's self-attention is good at capturing global relationships, but its ability to model local sequence patterns, such as the pronunciation process of phonemes and local changes in acoustic features, is relatively weak. Convolutional layer 304 is specifically designed to enhance the ability to extract local acoustic patterns. For example, convolutional layer 304 can use one-dimensional convolution (along the time dimension) to capture local features of different spans through convolutional kernels of different sizes, without changing the shape.
[0058] The second feedforward layer 305 is used for nonlinear transformation, which further abstracts and refines the features processed by the multi-head self-attention layer 303 and the convolutional layer 304. After integrating global and local information, the second feedforward layer 305 performs nonlinear transformation again to prepare for the output.
[0059] Normalization layer 306 is used to normalize the entire output, resulting in semantic features 307. The shape of the normalized features remains unchanged. Through the above structure, semantic encoder 301 can effectively extract hierarchical semantic features 307 from audio: including both global semantic context captured through self-attention and local acoustic details enhanced by convolution, and the final output is rich in word-level information that can be used for subsequent decoding.
[0060] S230: The electronic device performs cross-fusion processing on the multidimensional features to obtain the processed multidimensional features.
[0061] Considering that different features are easily interfered with by other features—for example, ambient sound is noise for semantic recognition, and human voice is noise for environmental detection—this embodiment can perform cross-fusion processing on multi-dimensional features to suppress noise and improve noise resistance in order to solve the noise interference problem in a single task. For example, the environmental features output by the environmental encoder can be used as a suppression signal to suppress non-human voice interference in semantic features, making the semantic features purer and improving the noise resistance of speech recognition.
[0062] Similarly, the semantic features output by the semantic encoder can be used as suppression signals to suppress human voice interference in environmental features, making environmental features more focused on the background event itself and avoiding misjudging human voice as environmental sound.
[0063] By cross-fusing multi-dimensional features, the noise interference problem in a single task can be effectively solved, and the purity and robustness of features in each dimension can be improved. This allows the system to adapt to complex and ever-changing acoustic scenarios, providing a solid data foundation for achieving high-precision, fine-grained semantic retrieval of multimedia files.
[0064] S240: The electronic device decodes the processed multidimensional features to obtain the structured description information corresponding to each multimedia file.
[0065] For example, a decoder can be used to decode the processed multidimensional features to obtain the structured description information corresponding to each multimedia file. This embodiment does not limit the specific structure of the decoder; for example, an autoregressive large text model can be used to decode the processed multidimensional features.
[0066] This application first extracts the basic features of multimedia files, and then extracts features from different dimensions. Through these multi-dimensional features, a complete content profile of the multimedia file can be constructed, achieving a multi-faceted perception of the multimedia file and providing a rich data foundation for subsequent accurate retrieval. Then, the multi-dimensional features are cross-fused, effectively solving the noise interference problem in a single task and improving the purity of each dimension's features. This results in the final output structured description information possessing multi-dimensionality, high purity, and strong correlation, significantly improving the search accuracy and recall rate of multimedia files.
[0067] Taking multidimensional features, including voice user features, as an example, such as Figure 4 As shown, Figure 4 A flowchart illustrating another multimedia file search method provided in this application embodiment. Figure 4 and Figure 2 The difference is that, Figure 2 S220 in the text can be further refined into Figure 4 S410-S440 in the series.
[0068] S410: The electronic device performs multi-level encoding on the basic features of the multimedia file to obtain multiple levels of voice user features corresponding to the multimedia file.
[0069] For example, basic features can be sequentially input into multiple feature extraction modules such as SE-Res2Block, and encoded by SE-Res2Block to obtain multiple levels of speech user features. Each SE-Res2Block outputs one level of speech user feature.
[0070] In some embodiments of this application, Figure 5 An exemplary schematic diagram of a speech user encoder is provided, which employs an improved ECAPA-TDNN model. For example... Figure 5 As shown, the speech user encoder 501 includes an initial convolution module 502, a feature extraction module 503, an adaptive weight aggregation module 504, a feature transformation module 505, an attention statistical pooling module 506, and a fully connected mapping module 507. Figure 5 Take a case involving three SE-Res2Blocks as an example.
[0071] The initial convolution module 502 is used to perform a linear transformation on the basic features of the input, mapping them to a dimension suitable for subsequent processing.
[0072] The feature extraction module 503 is used to extract speech user features at different levels of abstraction layer by layer. Figure 5 Taking three cascaded feature extraction modules as an example, namely feature extraction module 1, feature extraction module 2 and feature extraction module 3, feature extraction module 1 focuses on extracting shallow acoustic features such as pitch, formants and tract length information; feature extraction module 2 focuses on extracting mid-level articulation features such as phoneme transitions and articulation habits; feature extraction module 3 focuses on extracting high-level abstract features such as global speaker style and identity semantic information.
[0073] The three feature extraction modules all adopt the SE-Res2Block structure as an example. The specific structure and internal processing of SE-Res2Block can be found in relevant technologies, and will not be elaborated here.
[0074] The adaptive weighted aggregation module 504 is mainly used to dynamically weight and fuse the features output by the three feature extraction modules, replacing the traditional static splicing.
[0075] The feature transformation module 505 is mainly used to perform nonlinear transformation and dimension adjustment on the fused high-order features, and further fuse the spliced multi-level information through one-dimensional convolution to prepare suitable feature representations for the subsequent attention statistical pooling module 506.
[0076] The attention statistical pooling module 506 is mainly used to aggregate variable-length frame-level feature sequences into fixed-length global feature vectors.
[0077] The fully connected mapping module 507 is mainly used to map the pooled high-dimensional features to the final speaker embedding space, and output the final speaker features 508.
[0078] The specific processing details of the initial convolution module 502, feature extraction module 503, feature transformation module 505, attention statistical pooling module 506, and fully connected mapping module 507 can be found in relevant technologies and will not be repeated here.
[0079] S420: The electronic device determines the level weight corresponding to each level of voice user characteristics based on multiple levels of voice user characteristics.
[0080] This step is mainly implemented through the adaptive weight aggregation module 504. Specifically, the hierarchical weights corresponding to each level of speech user features can be determined based on multiple levels of speech user features.
[0081] For example, the above S420 may include the following steps: Determine the global average value of the voice user features for each level; Linear mapping is performed on the global average values corresponding to the speech user features at each level to obtain the initial weights corresponding to the speech user features at each level; The initial weights corresponding to the voice user features at each level are normalized to obtain the level weights corresponding to the voice user features at each level.
[0082] The following example illustrates the process of determining the hierarchical weights for each level of voice users.
[0083] The features output by the three feature extraction modules are as follows: , and For example, , and All are vectors with dimension 1 and length 3. For example, , , .
[0084] In some embodiments, a global average of the voice user features at each level can be calculated separately. For example, , , .
[0085] in, Representative calculation The global average, , and For respectively , and The corresponding global average.
[0086] For example, a linear mapping can be performed on the global average value corresponding to each level of voice user features to obtain the initial weights corresponding to each level of voice user features. For example, .in, This represents a 3x3 linear transformation matrix. This represents a 1x3 bias matrix. For example, .
[0087] by , For example, according to the above formula, we can obtain That is, the initial weights corresponding to the voice user features at each level are 0.186, 0.296, and 0.02, respectively.
[0088] For example, it can be done by The function normalizes each initial weight to obtain the hierarchical weights corresponding to the speech user features at each level.
[0089] For example, , j=1,2,3; similarly, we can obtain , That is, the hierarchical weights corresponding to the voice user features at each level are as follows: , and .
[0090] The embodiments of this application can dynamically evaluate the importance of features at different levels based on complex factors such as speech duration, noise environment, and speaking style. This effectively overcomes the inherent equal weight assumption of traditional static splicing methods, enhances the model's adaptability to different speech lengths and complex environments, reduces the interference of redundant features on speech user features, and lays a solid foundation for subsequent multi-dimensional feature fusion and accurate retrieval.
[0091] S430: The electronic device performs weighted fusion of the voice user features at each level and the corresponding level weights of the voice user features at each level to obtain the first fused feature.
[0092] For example, after obtaining the hierarchical weights corresponding to the voice user features at each level, the voice user features at each level can be fused based on the hierarchical weights to obtain the first fused feature.
[0093] For example, , It represents splicing. , and These represent the hierarchical weights corresponding to the voice user features at each level. , , . This represents the first fusion characteristic. Therefore, we can obtain... .
[0094] S440: The electronic device performs mapping processing on the first fusion feature to obtain the voice user feature corresponding to the multimedia file.
[0095] For example, the first fused feature can be sequentially input into the feature transformation module 505, the attention statistical pooling module 506, and the fully connected mapping module 507 to obtain the speech user features corresponding to the multimedia file. The processing procedure for each multimedia file is similar. In practical applications, multiple multimedia files can be processed in parallel, which can improve processing efficiency.
[0096] The embodiments of this application can dynamically evaluate the importance of features at different levels based on complex factors such as speech duration, noise environment, and speaking style. This effectively overcomes the inherent equal weight assumption of traditional static splicing methods, enhances the model's adaptability to different speech lengths and complex environments, reduces the interference of redundant features on speech user features, and significantly improves the discriminative power and robustness of speech user features.
[0097] Taking multidimensional features, including voice user features, environmental features, and semantic features, as an example, Figure 6 An exemplary flowchart of a multimedia file search method is provided. Figure 6 and Figure 2 The difference is that, Figure 2 The S230 in the text can be further refined into Figure 6 S610-S640 in the series.
[0098] S610, the electronic device generates a first attention weight based on environmental features and a second attention weight based on semantic features.
[0099] It is understandable that when recognizing semantic features, ambient sound is noise relative to semantic recognition. Similarly, when recognizing ambient sound, human voice is noise relative to ambient sound. In order to effectively solve the noise interference problem in a single task, for example, a first attention weight can be generated based on the environmental features and a second attention weight can be generated based on the semantic features. The first attention weight and the second attention weight can be values between 0 and 1.
[0100] The process of determining the first attention weight and the second attention weight is similar. Taking the process of determining the first attention weight as an example, for instance, ,in, Represents a linear transformation matrix. Represents the bias matrix. This is the raw output of the Event encoder, i.e., the environmental characteristics of the raw output. This represents the first attention weight.
[0101] Taking the process of determining the second attention weight as an example, exemplarily, ,in, Represents a linear transformation matrix. Represents the bias matrix. This represents the original output of the Semantic encoder, i.e., the semantic features of the original output. This represents the second attention weight.
[0102] S620: The electronic device calibrates the semantic features according to the first attention weight to obtain calibrated semantic features, and calibrates the environmental features according to the second attention weight to obtain calibrated environmental features.
[0103] For example, semantic features can be calibrated based on a first attention weight and environmental features can be calibrated based on a second attention weight, thereby suppressing the interference of environmental sounds on semantic recognition and the interference of human voices on environmental sound recognition, and improving the purity of features.
[0104] For example, , , This represents multiplying corresponding elements separately. Among them, Represents the calibrated semantic features. This represents the calibrated environmental characteristics.
[0105] Assumption , According to the above formula, we can obtain .
[0106] The S630 electronic device performs fusion processing on the calibrated semantic features and voice user features to obtain fused semantic features.
[0107] This embodiment does not limit the specific fusion method of the calibrated semantic features and voice user features. For example, the calibrated semantic features and voice user features can be fused using a weighted method. ,in, This represents the semantic features after fusion, that is, the final semantic features obtained. Representing weights, in order to balance calibrated semantic features and speech user features, for example, . This represents the voice user characteristics output by the Diarization encoder.
[0108] Assumption , According to the above formula, we can obtain .
[0109] For example, the calibrated semantic features and speech user features can be directly concatenated as the fused semantic features. For example, the calibrated semantic features and speech user features can also be interactively fused using an attention mechanism. For example, the calibrated semantic features and speech user features can also be input into a multilayer perceptron (MLP) and fused nonlinearly using the MLP's nonlinear activation function. A multilayer perceptron is a feedforward artificial neural network model composed of multiple neuron layers.
[0110] By fusing calibrated semantic features and voice user features, enhanced semantic features can be obtained, realizing the binding of voice user identity with voice content. This not only provides downstream decoders with rich, ready-to-use, and unambiguous input, significantly improving the generation quality of structured descriptive information, but also fundamentally supports accurate responses to complex query conditions such as "person + content".
[0111] The S640 electronic device identifies the voice user characteristics, calibrated environmental characteristics, and fused semantic characteristics as the processed multidimensional features.
[0112] For example, the voice user features, calibrated environmental features, and fused semantic features can be used as processed multidimensional features. That is, the voice user features, calibrated environmental features, and fused semantic features can be decoded to obtain the structured description information corresponding to each multimedia file.
[0113] For example, such as Figure 7 As shown, the electronic device can input the basic features 300 extracted by the general feature extraction module into the semantic encoder 301, the speech user encoder 501, and the environment encoder 701 respectively, and extract semantic features through the semantic encoder 301 respectively. Voice user features are extracted using the 501 voice user encoder. Environmental features are extracted using the environmental encoder 701. Then utilize environmental characteristics semantic features Perform calibration 705 to obtain the calibrated semantic features. Based on this, and Perform fusion 706 to obtain the fused semantic features. Similarly, utilizing semantic features Environmental characteristics Perform calibration 705 to obtain the calibrated environmental characteristics. Thus, the processed multidimensional features are obtained: , and The processed multidimensional features are input into decoder 707 for decoding to obtain structured description information.
[0114] This embodiment calibrates semantic features based on environmental features and simultaneously calibrates environmental features based on semantic features, effectively solving the problem of noise interference in a single task and improving the purity of environmental and semantic features. At the same time, it fuses the calibrated semantic features with voice user features, realizing the binding of voice user identity and speaking content, significantly improving the generation quality of structured description information, and thus improving the retrieval accuracy and recall rate of multimedia files.
[0115] In order to obtain the structured description information corresponding to each multimedia file, in some embodiments of this application, the above-mentioned S240 may include the following steps: The processed multidimensional features are then fused to obtain the second fused feature. The second fusion feature and prompt words are input into the decoder and decoded to obtain the structured description information corresponding to each multimedia file; wherein, the structured description information includes at least one of the following: timestamp, voice user identity information, environment type information, and audio content information.
[0116] For example, the processed multidimensional features can be directly concatenated to obtain the second fused feature. Alternatively, the processed multidimensional features can be fused using an attention mechanism to obtain the second fused feature.
[0117] This embodiment does not limit the specific structure of the decoder. For example, the large text model Qwen3-0.6B can be used as the decoder. Qwen3-0.6B is a model that has been pre-trained with a large amount of text.
[0118] The prompts here are used to provide contextual information to the decoder, constraining the decoder's output within the expected structured scope. Exemplarily, these prompts may include, but are not limited to, role definition information, output format requirements, task description information, and field definition information. For example, in some embodiments of this application, the prompts may be in the following form: You are an intelligent audio content analysis assistant. Based on the given audio features, please generate a structured JSON description, including timestamp, voice user, voice content, and environmental information fields.
[0119] The output format is as follows: { "timestamp": "start time - end time", "speaker": "speaker's name or role" "content": "A transcription of the spoken content, which may include descriptions such as modal particles and pauses." "environment": "Background environment description" } The audio features have been encoded and used as context input; please output JSON directly.
[0120] Among them, "timestamp", "speaker", "content", and "environment" represent the timestamp field, the voice user field, the voice content field, and the environment information field, respectively.
[0121] Taking Qwen3-0.6B as the decoder as an example, the specific decoding process can be found in [link to documentation]. Figure 8 Decoder 707 performs autoregressive decoding on the input prompt word 802 and the second fused feature 803. Autoregression is the standard way for GPT-like models to generate text, similar to a word chain game. The model outputs one word at a time and feeds that word back to itself as the basis for generating the next word. The model uses its output from the previous time step as the input for the next time step. The output becomes the input, thus forming a closed-loop feedback.
[0122] Taking the aforementioned prompt word as an example, initially, the decoder 707 generates "{" based on the prompt word 802 and the second fusion feature 803. Then, "{" and the original input (prompt word 802 and the second fusion feature 803) are used as new input to generate "timestamp". Afterward, "{" and "timestamp" are used as a known sequence, along with the original input, as new input to continue generating the next word, until a complete JSON structure is generated. Thus, structured description information for each multimedia file can be obtained. In some embodiments of this application, the structured description information includes timestamps, voice user identity, environment type, and audio content.
[0123] In accordance with the requirements of the above-mentioned prompt words, in some embodiments of this application, the final output structured description information may be in the following form: { "timestamp": "20:30:15-20:31:00", “speaker”: “little sister” "content": "Wow, thank you everyone! I'll make a wish first... (closes eyes and pauses)... I hope everyone can be healthy and happy! *sigh*" "environment": A cozy private dining room with a birthday song playing in the background. } This embodiment achieves alignment and interaction of the processed multidimensional features at the feature level by fusing the processed multidimensional features, forming a complete information package. This avoids generation errors caused by feature fragmentation. Based on this, combined with the decoding of prompt words, high-quality structured description information can be output, which can be directly used as a high-quality index for multimedia files, effectively supporting accurate and fast response to complex natural language queries.
[0124] In some embodiments of this application, taking a multimedia file including video as an example, the above S120 may include the following steps: In response to the first input, the target multimedia file is displayed in the multimedia file playback window; the multimedia file playback window also includes a first control and a progress bar component, the first control is used to control the playback or pause of the target multimedia file, and the playback point of the progress bar component is located at the starting position of the content corresponding to the search element in the target multimedia file.
[0125] The multimedia file playback window is used to display the found target multimedia file. For example, such as... Figure 9 As shown, after receiving the first input, the electronic device can match the search elements corresponding to the first input with the structured description information of each multimedia file in the pre-stored multimedia file library to obtain the target multimedia file 903 corresponding to the first input, and display the target multimedia file 903 through the multimedia file playback window 901 for user use.
[0126] In some embodiments of this application, such as Figure 9 As shown, the multimedia file playback window 901 may also include a search area 902, where the user can enter a first input such as "Help me find the video clip of my sister's birthday last year" or "A video with the sound of flowing water that I took in May when I went to A Mountain." Alternatively, the user can directly input their search request via voice, and the electronic device can convert the voice into text and display it in the search area 902 for the user's confirmation. The user can modify the text information displayed in the search area 902 as needed.
[0127] In some embodiments of this application, the multimedia file playback window 901 may further include a first control 904 and a progress bar component 905. The first control 904 is used to control the playback and pause of the target multimedia file 903. The progress bar component 905 is used to display the playback progress of the target multimedia file 903 in real time. Considering that the target multimedia file 903 corresponding to the user's first input may only be a part of a certain multimedia file, in order to facilitate the user's viewing or use of the target multimedia file 903, this embodiment can place the playback point 906 of the progress bar component 905 at the beginning position of the content corresponding to the search element contained in the target multimedia file 903. That is, when the user clicks the first control 904, the electronic device can directly play the target multimedia file 903 corresponding to the first input, skipping the beginning position of the entire video. In this way, the user no longer needs to search manually, improving the efficiency of finding multimedia files.
[0128] For example, the progress bar component 905 can be located to the right of the first control 904 and extend horizontally. The left side of the progress bar component 905 can be displayed as a gray line to indicate the length of the played or buffered file, and the right side can be displayed as a white line to indicate the length of the currently unplayed multimedia file.
[0129] This embodiment can not only display the target multimedia file, but also a first control for controlling the playback and pause of the multimedia file, as well as a progress bar component. The playback point of the progress bar component is directly located at the beginning of the target multimedia file. In this way, users can conveniently perform target operations on the target multimedia file, improving operational efficiency.
[0130] This application's embodiments effectively solve the problems of poor search result accuracy and relevance in existing solutions, providing users with accurate multimedia file search services. Furthermore, this solution can be implemented offline, making it more widely applicable.
[0131] It should be noted that the multimedia file search method provided in this application embodiment can be executed by a multimedia file search device or a processing module within that device for executing the multimedia file search method. This application embodiment uses the execution of the multimedia file search method by a multimedia file search device as an example to illustrate the multimedia file search device provided in this application embodiment.
[0132] Figure 10 This is a schematic diagram of a multimedia file search device provided in an embodiment of this application.
[0133] like Figure 10 As shown, the multimedia file search device 1000 may include: Receiver module 1001 is used to receive the first input; Output module 1002 is used to output a target multimedia file in response to the first input; The target multimedia file is obtained by matching the retrieval element corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and the structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
[0134] This application embodiment pre-extracts at least two categories of semantic features, environmental features, and speaker features from each multimedia file in the multimedia file library. Through complementary fusion of multi-dimensional features, the structured description information corresponding to each multimedia file can be accurately extracted. After receiving the user's first input, the search element corresponding to the first input can be directly matched with the structured description information corresponding to each multimedia file, eliminating the need to perform repeated processing for each multimedia file. This improves the accuracy and efficiency of multimedia file retrieval.
[0135] In some possible implementations of the embodiments of this application, the multimedia file search device 1000 may further include: Extraction module, used for: Before receiving the first input, the receiving module 1001 extracts the basic features of each multimedia file in the multimedia file library; The encoding module is used to encode the basic features of each multimedia file according to different dimensions, so as to obtain the multi-dimensional features of each multimedia file; The processing module is used to perform cross-fusion processing on multidimensional features to obtain processed multidimensional features. The decoding module is used to decode the processed multidimensional features to obtain the structured description information corresponding to each multimedia file.
[0136] In some possible implementations of the embodiments of this application, the multidimensional features include voice user features; The encoding module includes: The encoding unit is used to encode the basic features of a multimedia file at multiple levels to obtain multiple levels of speech user features corresponding to the multimedia file. The determining unit is used to determine the hierarchical weight corresponding to each level of voice user feature based on multiple levels of voice user features. The fusion unit is used to perform weighted fusion of the voice user features at each level and the corresponding hierarchical weights of the voice user features at each level to obtain the first fused feature; The processing module is also used to map the first fused feature to obtain the voice user features corresponding to the multimedia file.
[0137] In some possible implementations of the embodiments of this application, the determining unit is specifically used for: Determine the global average value of the voice user features for each level; Linear mapping is performed on the speech user features at each level to obtain the initial weights corresponding to the speech user features at each level; The processing module is also used to normalize the initial weights corresponding to the voice user features at each level, so as to obtain the level weights corresponding to the voice user features at each level.
[0138] In some possible implementations of the embodiments of this application, the processing module is further configured to fuse the processed multidimensional features to obtain a second fused feature; The decoding module is specifically used for: Decoding is performed based on the second fusion feature and prompt words to obtain the structured description information corresponding to each multimedia file; wherein, the structured description information includes at least one of the following: timestamp, voice user identity information, environment type information, and audio content information.
[0139] The multimedia file retrieval device in this application embodiment can be a device or a component in an electronic device, such as an integrated circuit or a chip. For example, the electronic device can be a mobile phone, tablet computer, laptop computer, PDA, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc. It can also be a server, network attached storage (NAS), personal computer (PC), television (TV), ATM, or self-service machine, etc. This application embodiment does not specifically limit the device.
[0140] The electronic device in this application embodiment can be a terminal with an operating system. The operating system can be Android, iOS, or other possible operating systems; this application embodiment does not specifically limit the specific operating system.
[0141] The multimedia file search device provided in this application embodiment can achieve... Figures 1 to 9The various processes in the multimedia file search method embodiments can achieve the same technical effect, and will not be described again here to avoid repetition.
[0142] like Figure 11 As shown, this application embodiment also provides an electronic device 1100, including a processor 1101 and a memory 1102. The memory 1102 stores programs or instructions that can run on the processor 1101. When the program or instructions are executed by the processor 1101, they implement the various steps of the above-described multimedia file search method embodiment and can achieve the same technical effect. To avoid repetition, they will not be described again here.
[0143] It should be noted that the electronic devices in the embodiments of this application include the mobile terminals and non-mobile terminals mentioned above.
[0144] Figure 12 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application.
[0145] The electronic device 1200 includes, but is not limited to, components such as: radio frequency unit 1201, network module 1202, audio output unit 1203, input unit 1204, sensor 1205, display unit 1206, user input unit 1207, interface unit 1208, memory 1209, and processor 1210.
[0146] Those skilled in the art will understand that the electronic device 1200 may also include a power supply (such as a battery) for supplying power to various components. The power supply may be logically connected to the processor 1210 through a power management system, thereby enabling functions such as managing charging, discharging, and power consumption through the power management system. Figure 12 The structure of the electronic device 1200 shown does not constitute a limitation on the electronic device 1200. The electronic device 1200 may include more or fewer components than shown, or combine certain components, or have different component arrangements, which will not be described in detail here.
[0147] The user input unit 1207 is used to receive the first input; Processor 1210 is configured to output a target multimedia file in response to a first input; The target multimedia file is obtained by matching the retrieval element corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and the structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
[0148] This application embodiment pre-extracts at least two categories of semantic features, environmental features, and voice user features from each multimedia file in the multimedia file library. Through complementary fusion of multi-dimensional features, the structured description information corresponding to each multimedia file can be accurately extracted. After receiving the user's first input, the search element corresponding to the first input can be directly matched with the structured description information corresponding to each multimedia file, eliminating the need to perform repeated processing on each multimedia file. This improves the accuracy and efficiency of multimedia file retrieval.
[0149] In some possible implementations of embodiments of this application, the processor 1210 is specifically used for: Before the user input unit 1207 receives the first input, the basic features of each multimedia file in the multimedia file library are extracted; The basic features of each multimedia file are encoded according to different dimensions to obtain the multidimensional features of each multimedia file. The multidimensional features are cross-fused to obtain the processed multidimensional features. The processed multidimensional features are decoded to obtain the structured description information corresponding to each multimedia file.
[0150] In some possible implementations of the embodiments of this application, the multidimensional features include voice user features; Processor 1210, specifically used for: Multi-level encoding is performed on the basic features of multimedia files to obtain multiple levels of voice user features; The hierarchical weight corresponding to each level of voice user feature is determined based on multiple levels of voice user features; The voice user features at each level and the corresponding level weights of each voice user feature are weighted and fused to obtain the first fused feature; The first fusion feature is mapped to obtain the voice user features corresponding to the multimedia file.
[0151] In some possible implementations of embodiments of this application, the processor 1210 is specifically used for: Determine the global average value of the voice user features for each level; Linear mapping is performed on the global average values corresponding to the speech user features at each level to obtain the initial weights corresponding to the speech user features at each level; The initial weights corresponding to the voice user features at each level are normalized to obtain the level weights corresponding to the voice user features at each level.
[0152] In some possible implementations of embodiments of this application, the processor 1210 is specifically used for: The processed multidimensional features are then fused to obtain the second fused feature. The second fusion feature and prompt words are input into the decoder and decoded to obtain the structured description information corresponding to each multimedia file; wherein, the structured description information includes at least one of the following: timestamp, voice user identity information, environment type information, and audio content information.
[0153] It should be understood that, in this embodiment, the input unit 1204 may include a graphics processing unit (GPU) 12041 and a microphone 12042. The GPU 12041 processes image data of still images or videos obtained by an image capture device (such as a camera) in video capture mode or image capture mode. The display unit 1206 may include a display panel 12061, which may be configured in the form of a liquid crystal display, an organic light-emitting diode, or the like. The user input unit 1207 includes a touch panel 12071 and at least one of other input devices 12072. The touch panel 12071 is also called a touch screen. The touch panel 12071 may include a touch detection device and a touch controller. Other input devices 12072 may include, but are not limited to, physical keyboards, function keys (such as volume control buttons, power buttons, etc.), trackballs, mice, and joysticks, which will not be described in detail here.
[0154] The memory 1209 can be used to store software programs and various data. The memory 1209 may primarily include a first storage area for storing programs or instructions and a second storage area for storing data. The first storage area may store the operating system, application programs or instructions required for at least one function (such as sound playback, image playback, etc.). Furthermore, the memory 1209 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DRRAM). The memory 1209 in this embodiment includes, but is not limited to, these and any other suitable types of memory.
[0155] Processor 1210 may include one or more processing units; optionally, processor 1210 integrates an application processor and a modem processor, wherein the application processor mainly handles operations involving the operating system, user interface, and applications, and the modem processor mainly handles wireless communication signals, such as a baseband processor. It is understood that the aforementioned modem processor may also not be integrated into processor 1210.
[0156] This application also provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the various processes of the above-described multimedia file search method embodiments and achieve the same technical effects. To avoid repetition, they will not be described again here.
[0157] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.
[0158] This application embodiment also provides a chip, which includes a processor and a communication interface. The communication interface and the processor are coupled. The processor is used to run programs or instructions to implement the various processes of the above-described multimedia file search method embodiments and can achieve the same technical effect. To avoid repetition, it will not be described again here.
[0159] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.
[0160] This application provides a computer program product that is stored in a storage medium and executed by at least one processor to implement the various processes of the multimedia file search method embodiment described above, and can achieve the same technical effect. To avoid repetition, it will not be described again here.
[0161] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
[0162] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the related technology, can be embodied in the form of a computer software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0163] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.
Claims
1. A multimedia file search method, characterized by, include: Receive the first input; In response to the first input, the target multimedia file is output; The target multimedia file is obtained by matching the retrieval elements corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
2. The method of claim 1, wherein, Before receiving the first input, the method further includes: Extract the basic features of each multimedia file in the multimedia file library; The basic features of each multimedia file are encoded according to different dimensions to obtain the multidimensional features of each multimedia file. The multidimensional features are cross-fused to obtain the processed multidimensional features; The processed multidimensional features are decoded to obtain the structured description information corresponding to each multimedia file.
3. The method of claim 2, wherein, The multidimensional features include voice user features; The basic features of each multimedia file are encoded according to different dimensions to obtain the multidimensional features of each multimedia file, including: The basic features of the multimedia file are encoded in multiple levels to obtain multiple levels of voice user features corresponding to the multimedia file. The hierarchical weight corresponding to each hierarchical voice user feature is determined based on the multiple hierarchical voice user features described above; The voice user features of each level and the corresponding level weights of each level voice user feature are weighted and fused to obtain the first fused feature; The first fusion feature is mapped to obtain the voice user feature corresponding to the multimedia file.
4. The method of claim 3, wherein, The step of determining the hierarchical weight corresponding to each hierarchical voice user feature based on multiple hierarchical voice user features includes: Determine the global average value of the voice user characteristics for each level; A linear mapping is performed on the global average value corresponding to each level of voice user features to obtain the initial weights corresponding to each level of voice user features; The initial weights corresponding to the speech user features at each level are normalized to obtain the level weights corresponding to the speech user features at each level.
5. The method according to any one of claims 2-4, characterized in that, Decoding the processed multidimensional features to obtain structured description information corresponding to each multimedia file includes: The processed multidimensional features are fused to obtain a second fused feature; The second fusion feature and prompt words are input into the decoder and decoded to obtain the structured description information corresponding to each multimedia file; wherein, the structured description information includes at least one of the following: timestamp, voice user identity information, environment type information, and audio content information.
6. A multimedia file search apparatus, characterized by comprising: include: The receiving module is used to receive the first input; The output module is used to output the target multimedia file in response to the first input; The target multimedia file is obtained by matching the retrieval elements corresponding to the first input with the structured description information in the multimedia file library. The multimedia file library includes multiple multimedia files and structured description information corresponding to each multimedia file. The structured description information is determined based on the multidimensional features of the multimedia file. The multidimensional features include at least two of the following options: voice user features, environmental features, and semantic features.
7. The apparatus of claim 6, wherein, The device further includes: The extraction module is used to extract the basic features of each multimedia file in the multimedia file library before the receiving module receives the first input; The encoding module is used to encode the basic features of each multimedia file according to different dimensions to obtain the multidimensional features of each multimedia file; The processing module is used to perform cross-fusion processing on the multidimensional features to obtain the processed multidimensional features; The decoding module is used to decode the processed multidimensional features to obtain the structured description information corresponding to each multimedia file.
8. The apparatus of claim 7, wherein, The multidimensional features include voice user features; The encoding module includes: The encoding unit is used to perform multi-level encoding on the basic features of the multimedia file to obtain multiple levels of voice user features corresponding to the multimedia file. The determining unit is used to determine the hierarchical weight corresponding to each hierarchical voice user feature based on the multiple hierarchical voice user features. The fusion unit is used to perform weighted fusion of the voice user features of each level and the level weights corresponding to the voice user features of each level to obtain the first fused feature; The processing module is further configured to perform mapping processing on the first fused feature to obtain the voice user feature corresponding to the multimedia file.
9. The apparatus of claim 8, wherein, The determining unit is specifically used for: Determine the global average value of the voice user characteristics for each level; A linear mapping is performed on the global average value corresponding to each level of voice user features to obtain the initial weights corresponding to each level of voice user features; The processing module is further configured to normalize the initial weights corresponding to the voice user features at each level to obtain the level weights corresponding to the voice user features at each level.
10. The device of any of claims 7-9, wherein, The processing module is further configured to perform fusion processing on the processed multidimensional features to obtain a second fused feature; The decoding module is specifically used for: Based on the second fusion feature and prompt words, the structured description information corresponding to each multimedia file is obtained; wherein, the structured description information includes at least one of the following: timestamp, voice user identity information, environment type information, and audio content information.