An advertisement audio perceptual feature extraction and vectorization method based on a large model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing feature encoding mapping relationships and structured prompt words, and using large models to analyze advertising audio, we have achieved efficient and low-cost extraction of perceptual features of advertising audio, solving the problem of high manpower and time costs in existing technologies, and improving the efficiency and consistency of feature extraction.

CN122201334APending Publication Date: 2026-06-12BEIHANG UNIV +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIHANG UNIV
Filing Date: 2026-02-11
Publication Date: 2026-06-12

Application Information

Patent Timeline

11 Feb 2026

Application

12 Jun 2026

Publication

CN122201334A

IPC: G10L25/03; G10L25/27; G06Q30/0241

AI Tagging

Application Domain

Speech analysis Commerce

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Sound source isolating device
WO2026126309A1Speech analysis
Method for Driving Face of Virtual Image, Electronic Device, and Non-Transitory Readable Storage Medium
US20260162670A1Character and pattern recognition Animation
Wireless audio system and method for wirelessly communicating audio information
US20260161344A1Headphones for stereophonic communicationSpeech analysis
Methods, apparatus, and systems for enabling adaptive prediction and quantization in frequency domain predictors
WO2026122426A1Speech analysisCode conversion
Systems, devices, and methods for generating vocal data
US20260171102A1Speech analysisElectrostatic transducer microphones

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies struggle to efficiently and cost-effectively extract complex features from advertising audio, resulting in high manpower and time costs and long data preparation cycles, making it difficult to meet the demand for high-quality audio feature data for advertising effectiveness evaluation and optimization.

⚗Method used

By pre-constructing feature encoding mapping relationships, combining them with a large model to build structured prompt words, using the large model to analyze the target audio and output encoded values, and finally integrating them into a feature vector, the automated extraction of perceptual features of advertising audio is achieved.

🎯Benefits of technology

Complex advertising audio features can be automatically extracted without training a dedicated model, significantly reducing labor and time costs and improving feature extraction efficiency and consistency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122201334A_ABST

Patent Text Reader

Abstract

The application relates to a large model-based advertisement audio perception feature extraction and vectorization method. The method comprises the following steps: obtaining a pre-constructed feature code mapping relationship for advertisement audio; constructing corresponding structured prompt words for advertisement audio perception features to be extracted based on the feature code mapping relationship; inputting target audio and the structured prompt words into a large model to analyze advertisement audio perception features of the target audio by using the large model, and outputting code values of the advertisement audio perception features in a format specified by the prompt words, the target audio being an audio file corresponding to a feature category of the advertisement audio perception features to be extracted; and in the case that all code values of the advertisement audio perception features of the target audio are obtained by using the large model, integrating all the code values into a feature vector to represent complete advertisement audio perception features of the target audio. The application solves the technical problem that the prior art cannot efficiently and at low cost realize extraction of complex advertisement audio features.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of advertising effectiveness evaluation and content generation technology, and in particular to a method for extracting and vectorizing advertising audio perception features based on a large model. Background Technology

[0002] In the field of advertising effectiveness evaluation and content generation, audio, as an important component of advertising, contains rich information, and the quality of its feature extraction plays a crucial role in the training effect of advertising effectiveness prediction models. With the deepening of business needs, in order to accurately predict advertising conversion effects, in addition to basic acoustic attributes such as volume and pitch that can be automatically extracted by signal processing algorithms, it is also necessary to introduce more complex higher-order semantic and perceptual features such as emotional valence, arousal, specific marketing rhetoric, and textual metaphors.

[0003] Currently, existing technologies for labeling and extracting complex features from advertising audio have significant limitations. One approach relies entirely on manual labeling, where professionals manually judge and label data according to pre-defined rules. While this can handle complex features, the labor and time costs increase linearly with the amount of data, resulting in low efficiency, difficulty in handling massive amounts of advertising materials, and long data preparation cycles that severely lag behind business needs. Another approach uses dedicated supervised learning models for automated labeling. However, this method requires training the model with a large amount of high-quality manually labeled data, failing to fundamentally solve the problems of difficult and costly data labeling. Furthermore, it suffers from poor scalability; adding new analysis dimensions requires collecting new data and training a new model, leading to long development cycles, insufficient flexibility, limited generalization ability, and difficulty in understanding complex contextual semantics. Therefore, existing technologies struggle to efficiently and cost-effectively extract complex features from advertising audio, failing to meet the demand for high-quality audio feature data for advertising effectiveness evaluation and optimization.

[0004] There is currently no effective solution to the problem that existing technologies cannot efficiently and cost-effectively extract the complex features of advertising audio. Summary of the Invention

[0005] This application provides a method for extracting and vectorizing perceptual features of advertising audio based on a large model, in order to solve the technical problem that existing technologies are unable to efficiently and cost-effectively extract complex features of advertising audio.

[0006] According to one aspect of the embodiments of this application, this application provides a method for extracting and vectorizing advertising audio perception features based on a large model, including: obtaining a pre-constructed feature encoding mapping relationship for advertising audio; constructing corresponding structured prompt words for the advertising audio perception features to be extracted based on the feature encoding mapping relationship; inputting the target audio and the structured prompt words into a large model to analyze the advertising audio perception features of the target audio using the large model, and outputting the encoded values of the advertising audio perception features according to the format specified by the prompt words, wherein the target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted; and, after obtaining the encoded values of all advertising audio perception features of the target audio using the large model, integrating all encoded values into a feature vector to represent the complete advertising audio perception features of the target audio.

[0007] Optionally, constructing corresponding structured prompts for the extracted advertising audio perception features based on the feature encoding mapping relationship includes: obtaining a system prompt template and a user prompt template; filling the feature categories of the extracted advertising audio perception features into the system prompt template to obtain system prompts, wherein the system prompts are used to establish the role setting, global behavioral norms, and task objectives of the large model; finding the feature name, feature description, encoding method, and output format of the extracted advertising audio perception features from the feature encoding mapping relationship and filling them into the user prompt template to obtain user prompts, wherein the user prompts are used to determine the analysis task instructions for the advertising audio perception features; and combining the system prompts and user prompts to obtain the structured prompts constructed for the extracted advertising audio perception features.

[0008] Optionally, constructing the corresponding structured prompt words also includes: determining the feature extraction mode of the large model; when the feature extraction mode of the large model is a single feature extraction mode, constructing a structured prompt word for each advertising audio perception feature to be extracted, so that the large model analyzes only one advertising audio perception feature of an advertising audio segment in a single inference; when the feature extraction mode of the large model is a multi-feature combination mode, determining multiple advertising audio perception features to be extracted selected by the user, and constructing a structured prompt word for the multiple advertising audio perception features to be extracted selected by the user in a single inference, so that the large model analyzes the multiple advertising audio perception features to be extracted selected by the user in a single inference.

[0009] Optionally, constructing the corresponding structured prompt words further includes: performing duration detection on the target audio; if the duration of the target audio exceeds a preset duration threshold, slicing the target audio into multiple audio segments according to a preset time interval; constructing corresponding initial prompt words for each audio segment; calling the large model to generate a global semantic summary of the target audio, and embedding the global semantic summary into the initial prompt words of each audio segment to obtain structured prompt words for each audio segment. The global semantic summary includes the overall content theme, emotional tone, and core information elements of the target audio, and is used as a context to enhance the large model's understanding of individual audio segments.

[0010] Optionally, before inputting the target audio and structured prompts into the large model, the method further includes obtaining the target audio in the following manner: extracting audio from the advertisement video to be analyzed to obtain the original advertisement audio; separating human voice and background sound from the original advertisement audio to obtain human voice audio file and background sound audio file; and using the human voice audio file or background sound audio file as the target audio.

[0011] Optionally, inputting the target audio and structured prompts into the large model includes: if the feature category of the advertising audio perception feature to be extracted is advertising human voice feature, inputting the human voice audio file and the corresponding structured prompts into the large model; if the feature category of the advertising audio perception feature to be extracted is advertising music feature or advertising sound effect feature, inputting the background sound audio file and the corresponding structured prompts into the large model; if the feature category of the advertising audio perception feature to be extracted is advertising text feature, performing speech recognition on the human voice audio file, and inputting the speech recognition result, the human voice audio file, and the corresponding structured prompts into the large model.

[0012] Optionally, integrating all encoded values into a feature vector includes: performing format and value range checks on each encoded value to determine whether the encoded value conforms to the preset output format and the value range specified in the feature encoding mapping relationship; if there are abnormal encoded values with abnormal format or values exceeding the value range, the large model is called again to obtain the encoded value of the corresponding advertising audio perception feature until a valid encoded value cannot be obtained after retrying, and then the default value is filled in according to the preset rules; the verified encoded values are normalized to map the values of the count-type numerical features to the preset standard range; the normalized encoded values are arranged in an orderly manner according to the preset fixed feature order in the feature encoding mapping relationship to form a feature vector representing the complete advertising audio perception feature of the target audio.

[0013] According to another aspect of the embodiments of this application, this application provides an apparatus for extracting and vectorizing advertising audio perception features based on a large model, comprising: an acquisition module for acquiring a pre-constructed feature encoding mapping relationship for advertising audio; a construction module for constructing corresponding structured prompt words for the advertising audio perception features to be extracted based on the feature encoding mapping relationship; an analysis module for inputting the target audio and the structured prompt words into a large model to analyze the advertising audio perception features of the target audio using the large model, and outputting the encoded values of the advertising audio perception features according to the format specified by the prompt words, wherein the target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted; and a vectorization module for integrating all encoded values of the advertising audio perception features of the target audio into a feature vector after obtaining the encoded values of all advertising audio perception features of the target audio using the large model, so as to represent the complete advertising audio perception features of the target audio.

[0014] According to another aspect of the embodiments of this application, this application provides an electronic device, including a memory, a processor, a communication interface and a communication bus. The memory stores a computer program that can run on the processor. The memory and the processor communicate with each other through the communication bus and the communication interface. When the processor executes the computer program, it implements the steps of the above method.

[0015] According to another aspect of the embodiments of this application, this application also provides a computer-readable medium having processor-executable non-volatile program code that causes the processor to perform the above-described method.

[0016] Compared with related technologies, the technical solutions provided in this application have the following advantages: This application provides a method for extracting and vectorizing perceptual features of advertising audio based on a large model, including: obtaining a pre-constructed feature encoding mapping relationship for advertising audio; constructing corresponding structured prompt words for the perceptual features of the advertising audio to be extracted based on the feature encoding mapping relationship; inputting the target audio and structured prompt words into a large model to analyze the perceptual features of the target audio using the large model, and outputting the encoded values of the perceptual features of the advertising audio according to the format specified by the prompt words, wherein the target audio is an audio file corresponding to the feature category of the perceptual features of the advertising audio to be extracted; after obtaining the encoded values of all perceptual features of the target audio using the large model, integrating all encoded values into a feature vector to represent the complete perceptual features of the target audio. This application pre-constructs an encoding mapping relationship for perceptual features of advertising audio, constructs structured prompt words based on this mapping relationship, performs feature analysis on the target audio matching the feature category using a large model and outputs standardized encoded values, and finally integrates them into a complete feature vector. This method can automatically extract complex perceptual features of advertising audio without training a dedicated model, significantly reducing manual and time costs, improving feature extraction efficiency and consistency, thereby solving the technical problem that existing technologies cannot efficiently and cost-effectively extract complex features of advertising audio. Attached Figure Description

[0017] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0018] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, those skilled in the art can obtain other drawings based on these drawings without creative effort.

[0019] Figure 1 This is a schematic diagram of the hardware environment for an optional large-model-based method for extracting and vectorizing advertising audio perception features according to an embodiment of this application. Figure 2 This is a schematic diagram of an optional method for extracting and vectorizing advertising audio perception features based on a large model, according to an embodiment of this application. Figure 3 This is a block diagram of an optional large-model-based advertising audio perception feature extraction and vectorization device according to an embodiment of this application; Figure 4 This is a schematic diagram of an optional electronic device structure provided in an embodiment of this application. Detailed Implementation

[0020] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0021] In the following description, the use of suffixes such as "module," "part," or "unit" to denote elements is solely for the purpose of illustration and has no specific meaning in itself. Therefore, "module" and "part" may be used interchangeably.

[0022] To address the problems mentioned in the background art, according to one aspect of the embodiments of this application, an embodiment of a method for extracting and vectorizing advertising audio perception features based on a large model is provided.

[0023] Optionally, in the embodiments of this application, the above-described method for extracting and vectorizing advertising audio perception features based on a large model can be applied to, for example... Figure 1 The hardware environment shown consists of terminal 101 and server 103. Figure 1 As shown, server 103 is connected to terminal 101 via a network and can be used to provide services to the terminal or clients installed on the terminal. Database 105 can be set up on the server or independently of the server to provide data storage services for server 103. The network mentioned above includes, but is not limited to, wide area network, metropolitan area network or local area network. Terminal 101 includes, but is not limited to, PC, mobile phone, tablet computer, etc.

[0024] The advertising audio perception feature extraction and vectorization method based on a large model in this application embodiment can be executed by server 103, or it can be jointly executed by server 103 and terminal 101, such as... Figure 2 As shown, the method may include the following steps: Step S202: Obtain the feature encoding mapping relationship pre-constructed for the advertising audio; Step S204: Construct corresponding structured prompt words for the perceptual features of the advertisement audio to be extracted based on the feature encoding mapping relationship; Step S206: Input the target audio and structured prompt words into the large model to analyze the advertising audio perception features of the target audio using the large model, and output the encoded values of the advertising audio perception features according to the format specified by the prompt words. The target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted. Step S208: After obtaining the encoded values of all advertising audio perception features of the target audio using a large model, integrate all encoded values into a feature vector to represent the complete advertising audio perception features of the target audio.

[0025] In this embodiment, the feature encoding mapping relationship is a set of standardized mapping rules pre-constructed for various features of advertising audio. It includes information such as feature category, feature name, feature description, encoding method, output format, and feature order, and is used to convert unstructured advertising audio perceptual features into computer-recognizable structured codes. The feature encoding mapping relationship needs to clearly define the advertising audio perceptual features and their encoding rules. The advertising audio perceptual features and encoding rules are described in detail below.

[0026] The perceptual features of advertising audio can be flexibly set according to actual needs. This application embodiment divides the perceptual features of advertising audio to be extracted into four categories: human voice features, music features, sound effect features, and text features in the advertising audio.

[0027] The characteristics of advertising voice features include the voice actor's gender, age, pitch, pitch variation, volume, volume variation, tone style, and emotional arousal. For example, the encoding rules for advertising voice features, such as "gender," first analyze whether a voice exists in a given audio segment. If it does, it is encoded according to the speaker's gender: 0 for male, 1 for female, and "no voice" for the absence of a voice. Continuous numerical features are discretized. For continuous values such as pitch, pitch variation, volume, and volume variation, threshold ranges are set to convert them into three category labels: "low," "medium," and "high."

[0028] The characteristics of advertising music include rhythm, style, and emotional valence. For example, the coding rules for advertising music features, taking "style" as an example, firstly, it's necessary to analyze whether background music exists in the advertising audio; if so, the style type is further identified and assigned a corresponding code. The rhythm feature, being a continuous value, needs to be converted into three category labels: "slow," "medium," and "fast" by setting a threshold.

[0029] Advertising sound effects features include scene sound effects and special effects sound effects. Given an audio segment, it is necessary to analyze and identify whether it contains sound effects, and to count the frequency of different types of sound effects.

[0030] Advertising text characteristics include brand name, distinctive wording, sentence length, text function, marketing focus, text style, and text emotional valence. Taking "brand name" as an example, we first need to analyze whether advertising text exists in the advertising audio; if so, we need to count the number of times the brand name is mentioned.

[0031] As a preferred embodiment, this application provides an example of the encoding rules for the aforementioned advertising voice features, advertising music features, advertising sound effect features, advertising text features, and their encoding rules, as shown in Table 1: Table 1

[0032] In this embodiment, the structured prompts are text instructions built based on preset templates, containing explicit instructions and constraints. They consist of system prompts and user prompts, and are used to guide the large model to accurately execute the task of analyzing and encoding advertising audio perception features. The target audio is an audio file that matches the feature category of the advertising audio perception features to be extracted. It is the direct object of the large model's feature analysis and can be a human voice audio file, background sound audio file, etc. It can be a complete audio file or an audio segment. The large model is an artificial intelligence model with multimodal understanding and cognitive capabilities, capable of receiving audio and text input, performing semantic analysis, feature recognition, and reasoning, and outputting results that meet the requirements. The feature vector is an ordered array formed by arranging all the advertising audio perception feature encoding values of the target audio in a preset order, used to comprehensively and structurally represent the complete advertising audio perception features of the target audio.

[0033] In step S202, the feature encoding mapping relationship is the foundation of the entire method. It is pre-constructed by technical personnel based on the business needs of advertising audio analysis, covering the core feature dimensions of advertising audio and clarifying the definition, encoding rules, output format, and arrangement order of each type of feature. When step S202 is executed, the system directly reads the feature encoding mapping relationship from the database or configuration file, providing a basis for subsequent structured prompt word construction and feature encoding value integration.

[0034] In step S204, structured prompts are the core instructions that guide the large model to accurately complete feature extraction. During the construction process, the feature categories of the perceptual features of the advertisement audio to be extracted can be identified first. Then, based on the commonalities and characteristics of these features, and according to the feature definitions, encoding rules, and output formats in the feature encoding mapping relationship, structured prompts containing core elements such as role settings, task objectives, feature descriptions, encoding requirements, and output formats can be constructed. This enables the large model to clearly understand the analysis tasks, judgment criteria, and output requirements that need to be performed.

[0035] In step S206, the target audio corresponding to the category of features to be extracted must first be determined. This ensures that the input audio file accurately carries the features to be analyzed; for example, when extracting human voice features, the target audio should be a human voice audio file. Then, the target audio file and the constructed structured prompts are input into the large model via an API interface. Upon receiving the input, the large model activates its multimodal understanding mechanism. On one hand, it parses the task instructions and judgment criteria in the structured prompts; on the other hand, it performs acoustic analysis and semantic recognition on the target audio. Combining the encoding rules in the prompts, it judges and encodes the features to be extracted in the target audio, and finally outputs the encoded value corresponding to each feature according to the format specified by the prompts, such as JSON format. In this way, the multimodal understanding capabilities of the large model, combined with modular and hierarchical structured prompt construction technology, can be used to achieve automatic analysis and annotation of advertising audio content.

[0036] In step S208, after the large model completes the analysis of all features to be extracted and outputs the corresponding encoded values, the system summarizes all encoded values. The integration process also strictly follows the preset fixed feature order in the feature encoding mapping relationship, arranging the encoded values of each feature sequentially to form an ordered feature vector. This feature vector completely covers all the key advertising audio perception features of the target audio, realizing the transformation of unstructured advertising audio perception features into structured data, facilitating direct use by downstream systems such as subsequent advertising effect prediction models. If some features fail to obtain encoded values, exception handling can be performed first, followed by the integration operation to ensure the integrity of the feature vector.

[0037] Through steps S202 to S208, this application pre-constructs an advertising audio perception feature encoding mapping relationship, constructs structured prompt words based on this mapping relationship, performs feature analysis on the target audio matching feature categories using a large model, and outputs standardized encoding values, which are finally integrated into a complete feature vector. This allows for the automatic extraction of complex advertising audio perception features without the need to train a dedicated model, significantly reducing manual and time costs, and improving feature extraction efficiency and consistency. This solves the technical problem that existing technologies struggle to efficiently and cost-effectively extract complex features from advertising audio.

[0038] In an optional embodiment, constructing corresponding structured prompts for the perceptual features of the advertisement audio to be extracted based on the feature encoding mapping relationship includes: Step 1: Obtain the system prompt word template and the user prompt word template; Step 2: Fill the feature categories of the advertising audio perception features to be extracted into the system prompt word template to obtain system prompt words. The system prompt words are used to establish the role setting, global behavior norms and task objectives of the large model. Step 3: Find the feature name, feature description, encoding method and output format of the advertising audio perception feature to be extracted from the feature encoding mapping relationship and fill them into the user prompt word template to obtain the user prompt word. The user prompt word is used to determine the analysis task instruction for the advertising audio perception feature. Step 4: Combine the system prompts and user prompts to obtain structured prompts for the perceptual features of the advertisement audio to be extracted.

[0039] To guide the large model in accurately and systematically outputting the required features, this application employs a modular and hierarchical structured cue word construction technique. During feature extraction, specific information can be dynamically filled in for each feature to be extracted based on a predefined structured cue word template, generating the final instructions for executing the analysis task.

[0040] The following sections provide detailed explanations of the system prompt template and system prompt words.

[0041] In this embodiment, the system prompt template is a pre-defined fixed text framework used to define the roles and behavioral norms of a large model. It includes placeholders for core elements such as role settings, global constraints, and task objectives, and can generate personalized system prompts by filling in specific information. Specifically, the system prompt template can be: "You are to play the role of an advertising audio research expert, skilled at identifying structured [feature categories] from advertising audio. You will hear an audio excerpt, and you need to analyze and label the given audio content based on the following feature definitions and classification criteria. Each feature has a clear meaning and preset value options. Based on the audio and its description, determine the best matching option and output in JSON format, including the reasoning, confidence level, and final encoded value. Do not output any Markdown tags or other explanatory text."

[0042] The [feature category] field is the content that needs to be filled in.

[0043] In this embodiment, the system prompt word is a text instruction formed by filling the feature category of the feature to be extracted into the system prompt word template. It is used to establish the role of the large model in the task of extracting perceptual features from advertising audio (e.g., an advertising audio research expert), global behavioral norms (e.g., based on the definition and classification criteria of the following features), and core task objectives (e.g., analyzing and labeling the given audio content, i.e., outputting encoded values). Taking the feature category of extracting advertising human voice features as an example, filling the advertising human voice features into the corresponding placeholders in the above system prompt word template yields the system prompt word: "Please play the role of an advertising audio research expert, skilled at identifying structured vocal features in advertising audio. You will hear an audio excerpt, and you need to analyze and label the given audio content based on the following feature definitions and classification criteria. Each feature has a clear meaning and preset value options. Please determine the best matching option based on the audio and its description, and output the result in JSON format, including the reasoning, confidence level, and final encoded value. Do not output any Markdown tags or other explanatory text."

[0044] In this embodiment, the large model is explicitly designated as an audio analysis expert, requiring it to strictly adhere to preset rules in performing feature extraction tasks. Regarding output constraints, to ensure the interpretability and usability of the results, system prompts can mandate that the model explicitly generate intermediate reasoning processes—that is, a detailed explanation of the judgment basis—before outputting the final feature encoding. Simultaneously, it can also require the output of the prediction confidence level for the judgment result, allowing the system to filter low-quality data accordingly. Furthermore, the prompts can specify that the model must strictly follow a preset output format (such as JSON), prohibiting the addition of any irrelevant explanatory text. The system prompts can be entered at the initial stage of interaction and remain effective throughout all subsequent analysis tasks.

[0045] The following sections provide detailed explanations of user prompt templates and user prompts.

[0046] In this embodiment, the user prompt template is a pre-defined fixed text framework used to clearly define specific analysis task instructions. It includes placeholders for task-related elements such as feature names, feature descriptions, encoding methods, and output formats, and is used to generate user prompts for a specific feature extraction task. Specifically, the user prompt template can be: "Context information: [speech recognition result or global summary]".

[0047] Feature name: [Special certificate name].

[0048] Definition: [Feature description].

[0049] Value option: [Feature encoding].

[0050] Output format example: <[Expected output format]>.

[0051] Among them, [Special Certificate Name], [Feature Description], [Feature Code], and [Expected Output Format] are the contents that need to be filled in according to the feature code mapping relationship.

[0052] In this embodiment, the user prompt word is a text instruction formed by filling the user prompt word template with specific information about the features to be extracted. It clarifies the feature details, judgment criteria, encoding requirements, and output format that the large model needs to analyze, ensuring that the large model accurately executes the specific feature extraction task. Taking the gender of the voice actor in an advertising audio as an example, the feature name, feature description, feature code, and expected output format of the voice actor's gender are found from the feature encoding mapping relationship and filled into the user prompt word template to obtain the user prompt word: Feature name: Gender.

[0053] Definition: The gender of the voice actor in an advertising audio clip.

[0054] Value options: No voice = -1, Male = 0, Female = 1.

[0055] Output format example: {"Reasoning basis": "Brief analysis process...", "Feature value": [Enter only the numeric code], "Confidence level: [a decimal number between 0 and 1]"

[0056] In this embodiment, to enhance the model's ability to judge complex scenes, multimodal contextual information can be optionally added to the user prompts to assist the model's judgment and improve inference performance. To compensate for the information limitations of a single audio modality, the text results after speech recognition can be integrated into the prompts. Simultaneously, when analyzing long audio segments, the large model can first generate a global summary of the entire audio segment and inject it as prior knowledge, effectively preventing feature misjudgment due to missing segment information.

[0057] Furthermore, to further improve accuracy, user prompts can incorporate a few-shot learning mechanism, dynamically embedding 2-5 representative "audio segment-feature standard ground truth" pairs as reference examples to assist the model in aligning feature boundaries. The specific structure and rules of the prompts can also be flexibly adjusted or supplemented according to actual business needs.

[0058] This application constructs structured prompts by using template-based filling and combination, ensuring the standardization, consistency and completeness of the prompts. This avoids problems such as information omissions and logical confusion that may occur when manually writing prompts, reduces the complexity of prompt construction, and enables large models to clearly understand role positioning and task requirements, thereby improving the accuracy and efficiency of feature extraction.

[0059] In an optional embodiment, constructing the corresponding structured prompt words further includes: Step 1: Determine the feature extraction mode for the large model; Step 21: When the feature extraction mode of the large model is single feature extraction mode, construct a structured prompt word for each advertising audio perception feature to be extracted, so that the large model can analyze only one advertising audio perception feature of an advertising audio segment in a single inference. Step 22: When the feature extraction mode of the large model is a multi-feature combination mode, determine the multiple perceptual features of the advertising audio selected by the user to be extracted, and construct a structured prompt word for the multiple perceptual features of the advertising audio selected by the user to be extracted, so that the large model can analyze the multiple perceptual features of the advertising audio selected by the user in a single inference.

[0060] In this embodiment, the feature extraction mode refers to the method by which the large model performs the task of extracting perceptual features from advertising audio. It is divided into single-feature extraction mode and multi-feature combination mode, which can be flexibly selected according to the number of features, feature relevance, and business efficiency requirements. Single-feature extraction mode refers to the extraction mode where the large model analyzes only one perceptual feature of an advertising audio segment in a single inference, allowing the large model to focus on a single feature and improve the accuracy of feature judgment. Multi-feature combination mode is the extraction mode where the large model analyzes multiple perceptual features of advertising audio in a single inference. Users can decide which perceptual features of advertising audio need the model to analyze simultaneously, or directly select any major category such as advertising voice features, advertising music features, advertising sound effect features, or advertising text features, and the model extracts features from multiple perceptual features of advertising audio under a certain major category together.

[0061] In this embodiment, the selection of the feature extraction mode needs to comprehensively consider business requirements. If the number of features to be extracted is small and the requirement for feature extraction accuracy is extremely high, then the single feature extraction mode is selected; if the number of features to be extracted is large, some features are highly correlated, and there are clear requirements for processing efficiency, then the multi-feature combination mode is selected. The determination process can be automatically determined by the system according to preset rules, or it can be manually specified by the user to ensure that the mode selection matches the actual needs.

[0062] When the single-feature extraction mode is selected, the structured prompt word construction process described in the above embodiments is executed separately for each feature to be extracted. That is, for each feature, the system prompt word template and the user prompt word template are obtained separately, and information such as the category, name, description, encoding method, and output format of the feature are filled in to generate an independent structured prompt word. Each prompt word corresponds to only one feature extraction task, so that the large model can focus all computing resources and attention on the feature in a single inference process, avoid judgment errors caused by interference from multiple features, and maximize the accuracy of feature extraction.

[0063] When selecting the multi-feature combination mode, feature correlation clustering is performed first. This involves grouping highly correlated features into the same category based on the feature categories and attributes defined in the feature encoding mapping relationship. After clustering, a structured prompt word is constructed for each feature category. This involves filling the system prompt word template with the category name of the feature, and sequentially filling the user prompt word template with the names, descriptions, encoding methods, and unified output formats of all features under that category, generating a structured prompt word covering all related features of that category. During a single inference iteration of the large model, all features within the category are analyzed simultaneously based on this prompt word, outputting the encoded values corresponding to multiple features. This reduces the number of inference iterations and improves overall processing efficiency while ensuring analytical accuracy.

[0064] This application achieves a dynamic balance between accuracy and efficiency by flexibly selecting feature extraction modes and constructing corresponding structured prompt words. This is possible in scenarios with high accuracy requirements by using a single feature mode to ensure extraction accuracy, and in scenarios with high efficiency requirements by using a multi-feature combination mode to improve processing speed.

[0065] In an optional embodiment, constructing the corresponding structured prompt words further includes: Step 1: Detect the duration of the target audio. Step 2: If the duration of the target audio exceeds a preset duration threshold, slice the target audio at preset time intervals to obtain multiple audio segments. Step 3: Construct initial prompts for each audio segment; Step 4: Call the large model to generate a global semantic summary of the target audio, and embed the global semantic summary into the initial prompt words of each audio segment to obtain the structured prompt words of each audio segment. The global semantic summary includes the overall content theme, emotional tone and core information elements of the target audio. The global semantic summary is used as a context to enhance the large model's understanding of individual audio segments.

[0066] In this embodiment, the preset duration threshold is a pre-set standard for determining whether the target audio needs to be sliced. It is determined based on the large model's audio processing capabilities, feature extraction accuracy requirements, and business scenarios, and is used to avoid incomplete analysis or low inference efficiency due to excessively long audio. Audio slicing refers to the operation of dividing target audio exceeding the preset threshold into multiple consecutive short audio segments according to a preset time interval. Initial prompts are pre-constructed prompts for each audio segment, containing the feature extraction task instructions corresponding to that segment, but without incorporating global information; they are generated solely based on the segment's own analysis needs. The global semantic summary is a comprehensive descriptive information generated by the large model after analyzing the complete target audio. It covers the overall content theme (e.g., product promotion, brand advertising), emotional tone (e.g., positive, serious), and core information elements (e.g., key product selling points, brand name), providing contextual support for the analysis of individual segments.

[0067] In this embodiment, the system obtains the specific duration information of the target audio and compares it with a preset duration threshold to determine whether the target audio is a long-duration audio. The core purpose of duration detection is to identify audio that may cause bias in the analysis of large models. Long-duration audio contains a large amount of information and the scene may change. If analyzed directly as a whole, the large model may miss local features or misjudge feature associations. Therefore, duration detection is needed to filter out audio that requires special processing. When the duration of the target audio exceeds the preset threshold, the slicing process is initiated. During the slicing process, the system uniformly divides the target audio according to a set time interval (such as 1 second, 3 seconds, etc.), generating multiple continuous audio segments, and records the timestamp information of each segment. The timestamp information is used to associate the feature vector with the audio time dimension in the future.

[0068] For each segmented audio fragment, initial prompts are constructed according to the structured prompt construction process described in the above embodiments. The construction of initial prompts needs to consider the feature extraction requirements corresponding to that fragment, filling in information such as feature category, name, description, encoding method, and output format based on the feature encoding mapping relationship, so that the initial prompts can guide the large model to analyze the local features of that fragment. Since each fragment is part of a complete audio file, the initial prompts only focus on the feature analysis of the fragment itself, and the relationship between the fragment and the overall audio is not considered at this stage.

[0069] To guide the model to a more comprehensive understanding of audio content and enhance its understanding of local content, a large model can be invoked to generate a global semantic summary of the target audio. This global semantic summary is then embedded into the initial prompts for each audio segment. Specifically, the complete target audio can be input into the large model, which is guided by preset prompts to analyze the audio as a whole and generate a global semantic summary. This summary must cover the core information of the audio, providing contextual reference for the analysis of local segments. Subsequently, the generated global semantic summary is used as contextual information and embedded into the initial prompts for each audio segment, forming the final structured prompts. The embedded prompts contain both analysis instructions for local segment features and the semantic background of the overall audio, guiding the large model to make judgments based on global information when analyzing local segments. This avoids feature misjudgment caused by isolated segment information and improves the accuracy of feature extraction from long audio clips.

[0070] This application solves the problem of insufficient overall analysis accuracy of long audio by slicing long target audio and embedding global semantic summaries into segment prompts. It also ensures the accuracy of local segment analysis through context enhancement, thus achieving efficient and accurate extraction of complex features from long audio.

[0071] In an optional embodiment, before inputting the target audio and structured cue words into the large model, the method further includes obtaining the target audio in the following manner: Step 1: Extract the audio from the ad video to be analyzed to obtain the original ad audio; Step 2: Separate the human voice from the background sound in the original advertising audio to obtain human voice audio files and background sound audio files; Step 3: Select either a human voice audio file or a background sound audio file as the target audio.

[0072] In this embodiment, the original advertising audio is an unprocessed audio file directly extracted from the advertising video to be analyzed, containing all the sound information of the advertisement. Voice and background sound separation refers to the operation of separating the voice portion from the background sound portion (including background music, scene sound effects, special effects, etc.) in the original advertising audio using audio separation technology. The purpose is to obtain clean voice and background sound audio files, facilitating precise analysis for different types of features. The voice audio file, obtained after voice and background sound separation, contains only the voice of the voice actor in the advertisement and is the core data source for extracting the voice and text features of the advertisement. The background sound audio file, obtained after voice and background sound separation, contains all sounds in the advertisement except for the voice and is the core data source for extracting the music and sound effect features of the advertisement.

[0073] In this embodiment, the advertisement video to be analyzed can be a video file of various formats. The system uses mature audio extraction technology to separate the audio track from the video file and generate the original advertisement audio file. The original advertisement audio contains multiple sound components, and different components correspond to different types of advertisement audio perception features. For example, human voice corresponds to human voice features and text features, and background sound corresponds to music features and sound effect features. This application can use a professional audio separation algorithm to accurately separate the human voice and background sound in the original advertisement audio, remove background sound interference in the human voice and human voice interference in the background sound, and obtain a clean human voice audio file and background sound audio file. According to the category of the advertisement audio perception feature to be extracted, the corresponding audio file is selected as the target audio. If the feature to be extracted is the advertisement human voice feature or the advertisement text feature, the human voice audio file is selected as the target audio because it can accurately carry the feature information related to human voice. If the feature to be extracted is the advertisement music feature or the advertisement sound effect feature, the background sound audio file is selected as the target audio because it can completely contain the feature information related to background sound.

[0074] In this embodiment, by extracting audio from advertising videos and separating human voices from background sounds, a clean and accurate target audio data source is provided for the extraction of different categories of features. This avoids feature misjudgment caused by mutual interference between different sound components, lays the foundation for accurate analysis of subsequent large models, and improves the reliability of overall feature extraction.

[0075] In an optional embodiment, inputting the target audio and structured cue words into the large model includes: When the feature category of the advertising audio perception features to be extracted is advertising human voice features, the human voice audio file and the corresponding structured prompt words are input into the large model; When the feature category of the advertising audio perception features to be extracted is advertising music features or advertising sound effect features, the background audio file and the corresponding structured cue words are input into the large model; When the feature category of the advertising audio perception features to be extracted is advertising text features, speech recognition is performed on the human voice audio file, and the speech recognition results, human voice audio file and corresponding structured prompt words are input into the large model.

[0076] In this embodiment of the application, the voice features of the advertisement are features related to the voice of the voice actor in the advertisement, such as gender, age, pitch, volume, tone style, and emotional arousal of the voice, which reflect the voice actor's voice attributes and expression state.

[0077] If the feature category of the perceptual features to be extracted from the advertising audio is advertising voice features, then all the features to be extracted are related to the voice actor's voice attributes, and the voice audio file is the most accurate data source. Therefore, the voice audio file, along with structured prompts constructed based on the advertising voice features, is input into the large model. After receiving the input, the large model focuses on the acoustic attribute analysis of the voice audio file, and, combined with the encoding rules in the prompts, judges and encodes voice features such as gender, pitch, and tone style, and outputs the corresponding feature encoding values.

[0078] In this embodiment, the advertising music features are those related to the background music in the advertisement, such as music rhythm, music style, and music emotional valence, reflecting the attributes and emotional tone of the background music. The advertising sound effect features are those related to scene sound effects and special effects sound effects in the advertisement, such as the presence and frequency of scene sound effects and the presence and frequency of special effects sound effects, used to enhance the expressiveness and atmosphere of the advertisement.

[0079] If the feature category of the advertising audio perception features to be extracted is advertising music features or advertising sound effect features, then all the features to be extracted are related to the background sound of the advertisement, and the background sound audio file can completely carry relevant information. Therefore, the background sound audio file, along with the structured prompts constructed for the advertising music features or advertising sound effect features, is input into the large model. The large model analyzes the background sound audio file, identifies the rhythm, style, and emotional tone of the background music, as well as the presence and frequency of scene sound effects and special effects, and outputs the corresponding feature encoding values according to the prompt requirements.

[0080] In this embodiment of the application, the advertising text features are features related to the speech-to-text in the advertisement, such as brand name, distinctive words, sentence length, text function, marketing focus, etc., which reflect the content attributes and marketing intent of the advertising text.

[0081] If the feature category of the advertising audio perceptual features to be extracted is advertising text features, and the extraction of text features requires based on the speech content in the advertisement, then speech recognition must first be performed on the human voice audio file to convert the speech into analyzable text information, i.e., the speech recognition result. Subsequently, the speech recognition result, the human voice audio file, and structured prompts constructed based on the advertising text features are input into the large model. The speech recognition result provides direct evidence for text feature analysis, the human voice audio file helps the large model understand the context and emotion of the text, and the structured prompts clarify the analysis standards and coding requirements for text features. The large model combines these three elements for comprehensive analysis, identifying features such as brand name, marketing focus, and distinctive wording in the text, and outputting the corresponding encoded values.

[0082] This application configures input data and structured prompts in a targeted manner according to the attributes of different feature categories, ensuring that the input information acquired by the large model is highly compatible with the features to be extracted, and effectively improving the accuracy of various feature extractions.

[0083] In an optional embodiment, integrating all encoded values into a feature vector includes: Step 1: Perform format verification and value range verification on each encoded value to determine whether the encoded value conforms to the preset output format and the value range specified in the feature encoding mapping relationship; Step 2: If there are abnormal encoded values with abnormal format or values that are outside the range, the large model is called again to obtain the encoded value of the corresponding advertising audio perception feature. If a valid encoded value cannot be obtained after retrying, the default value is filled in according to the preset rules. Step 3: Normalize the verified encoded values to map the values of the count-type numerical features to a preset standard range. Step 4: Arrange the normalized encoded values in an ordered manner according to the preset fixed feature order in the feature encoding mapping relationship to form a feature vector that represents the complete perceptual features of the target audio.

[0084] In this embodiment, format verification refers to checking the format of the encoded values output by the large model to determine whether they conform to the output format specified in the structured prompt words, and excluding encoded values with incorrect formats. Value range verification refers to checking the numerical range of the encoded values output by the large model to determine whether they are within the value range specified in the feature encoding mapping relationship, and excluding outliers that exceed a reasonable range. Default values are pre-set standard values used to fill in invalid encoded values, ensuring the integrity of the feature vector. Normalization processing refers to standardizing the verified encoded values, mapping the values of count-type numerical features (such as the number of times sound effects appear, the number of times brand names appear, etc.) to a preset standard range (such as [0,1]), eliminating the dimensional differences between different features, and facilitating direct use by downstream models.

[0085] In this embodiment, the system reads each feature encoding value output by the large model one by one. First, it performs format verification, checking whether the data type and structure of the encoding value are consistent with the output format specified by the prompt words. Then, it performs value range verification, checking whether the encoding value is within the value range corresponding to the feature by referring to the feature encoding mapping relationship. Through this double verification, valid encoding values with correct format and reasonable values are filtered out, while abnormal encoding values with abnormal format or values outside the range are marked. For marked abnormal encoding values, the system initiates a retry mechanism, re-inputting the corresponding target audio and structured prompt words into the large model and requesting a re-analysis and output of the feature's encoding value. The number of retries can be set according to actual needs. If the encoding value obtained after the retry passes the verification, it is taken as a valid encoding value; if a valid encoding value cannot be obtained after the retry, a default value is filled according to preset rules, such as setting a fixed default value based on the feature type, or using the average encoding value of the feature for similar audio as the default value.

[0086] Considering that the encoded values of different features may have different dimensions, such as the frequency of sound effects appearing (0, 1, 2, 3) and the frequency of brand names appearing (0, 1, 2), direct integration would affect the analysis effect of downstream models. Therefore, the encoded values that pass the verification need to be normalized. That is, for count-type numerical features, a standardization algorithm (such as min-max normalization) is used to map their values to a preset standard range (such as [0,1]). After normalization, all feature encoded values are at a uniform magnitude, eliminating dimension interference and improving the usability of feature vectors. In the embodiments of this application, the count-type numerical features include the frequency of scene sound effects, the frequency of special effects sound effects, and the frequency of brand names appearing.

[0087] The feature encoding mapping predefines a fixed order for all features, such as a broad category order of voice features → music features → sound effect features → text features, followed by prioritization within each category. The system arranges all normalized valid encoded values sequentially according to this fixed order, forming an ordered feature vector. This feature vector fully and structurally represents all the key perceptual features of the target audio for advertising, and can be directly input into downstream systems such as advertising effectiveness prediction models, providing data support for advertising effectiveness evaluation and optimization.

[0088] This application ensures the integrity, accuracy, and standardization of feature vectors by performing double verification, anomaly handling, normalization, and ordered arrangement of encoded values, eliminating the impact of invalid data and dimensional differences, and enabling feature vectors to be directly adapted to downstream application scenarios.

[0089] This application pre-constructs a mapping relationship for perceptual features of advertising audio, constructs structured prompts based on this mapping relationship, performs feature analysis on target audio matching feature categories using a large model, and outputs standardized encoded values. Finally, these are integrated into a complete feature vector. This allows for the automatic extraction of complex perceptual features of advertising audio without the need to train a dedicated model, significantly reducing manual and time costs and improving feature extraction efficiency and consistency. This solves the technical problem that existing technologies struggle to efficiently and cost-effectively extract complex features from advertising audio.

[0090] This application can also employ a cascaded combination of multiple models. First, a speech recognition model is used to transcribe audio into text; then, an audio classification model or signal processing algorithm is used to extract physical acoustic features; finally, the speech recognition text and acoustic feature data are input into a large language model, which performs comprehensive semantic understanding and reasoning, and completes the final feature vector integration. Alternatively, multiple dedicated models with single functions can be used to form a pipeline to collaboratively complete comprehensive feature annotation.

[0091] This application directly leverages the powerful generalization and cognitive capabilities of a general-purpose large model, which can replace most of the traditional manual annotation work, thereby saving high manpower and management costs. At the same time, this solution eliminates the need to collect data separately for each sub-feature to train a dedicated model, effectively addressing the pain points of expensive data annotation and difficult dedicated model construction in traditional solutions, significantly lowering the implementation threshold.

[0092] Machine inference is far faster than human auditory judgment, and the large model strictly adheres to pre-defined encoding standards for output, completely eliminating judgment errors caused by fatigue and subjective emotional fluctuations in manual annotation, ensuring a high degree of consistency in feature data. The automated processing significantly improves annotation efficiency, enabling rapid response to the processing needs of massive amounts of audio data and drastically shortening the data preparation cycle for downstream model training.

[0093] When new analytical features are needed, modifications to the configuration file and prompts are sufficient; the model can be quickly adapted to new annotation tasks without retraining, enabling rapid response to changing business requirements. Furthermore, automated batch processing of multiple audio files and multi-dimensional features allows for efficient and automatic extraction and integration of these features.

[0094] According to another aspect of the embodiments of this application, such as Figure 3 As shown, a device for extracting and vectorizing perceptual features of advertising audio based on a large model is provided, including: The acquisition module 301 is used to acquire the feature encoding mapping relationship pre-built for the advertising audio; Module 303 is used to construct corresponding structured prompt words for the perceptual features of the advertisement audio to be extracted based on the feature encoding mapping relationship; Analysis module 305 is used to input the target audio and structured prompt words into the large model, so as to use the large model to analyze the advertising audio perception features of the target audio, and output the encoded value of the advertising audio perception features according to the format specified by the prompt words. The target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted. The vectorization module 307 is used to integrate all encoded values into a feature vector to represent the complete advertising audio perception features of the target audio, given that the encoded values of all advertising audio perception features of the target audio have been obtained using a large model.

[0095] It should be noted that the acquisition module 301 in this embodiment can be used to execute step S202 in this application embodiment, the construction module 303 in this embodiment can be used to execute step S204 in this application embodiment, the analysis module 305 in this embodiment can be used to execute step S206 in this application embodiment, and the vectorization module 307 in this embodiment can be used to execute step S208 in this application embodiment.

[0096] It should be noted that the examples and application scenarios implemented by the above modules and corresponding steps are the same, but are not limited to the content disclosed in the above embodiments. It should also be noted that the above modules, as part of a device, can operate in environments such as... Figure 1 The hardware environment shown can be implemented either through software or through hardware.

[0097] Optionally, this construction module is specifically used for: obtaining system prompt word templates and user prompt word templates; filling the feature categories of the advertising audio perception features to be extracted into the system prompt word templates to obtain system prompt words, wherein the system prompt words are used to establish the role settings, global behavioral norms, and task objectives of the large model; finding the feature names, feature descriptions, encoding methods, and output formats of the advertising audio perception features to be extracted from the feature encoding mapping relationship and filling them into the user prompt word templates to obtain user prompt words, wherein the user prompt words are used to determine the analysis task instructions for the advertising audio perception features; and combining the system prompt words and user prompt words to obtain structured prompt words constructed for the advertising audio perception features to be extracted.

[0098] Optionally, this building module is also used to: determine the feature extraction mode of the large model; when the feature extraction mode of the large model is a single feature extraction mode, construct a structured cue word for each advertising audio perception feature to be extracted, so that the large model analyzes only one advertising audio perception feature in a single inference; when the feature extraction mode of the large model is a multi-feature combination mode, cluster the advertising audio perception features to be extracted according to feature correlation, and construct a structured cue word for each category of advertising audio perception features, so that the large model analyzes multiple related features of a category in a single inference.

[0099] Optionally, the building module is further configured to: perform duration detection on the target audio; if the duration of the target audio exceeds a preset duration threshold, slice the target audio into multiple audio segments according to a preset time interval; construct corresponding initial prompt words for each audio segment; call the large model to generate a global semantic summary of the target audio, and embed the global semantic summary into the initial prompt words of each audio segment to obtain structured prompt words for each audio segment. The global semantic summary includes the overall content theme, emotional tone, and core information elements of the target audio, and is used as a context-enhancing feature for the large model to understand individual audio segments.

[0100] Optionally, the device for extracting and vectorizing perceptual features of advertising audio based on a large model further includes an audio acquisition module, specifically used for: extracting audio from the advertising video to be analyzed to obtain the original advertising audio; separating human voice and background sound from the original advertising audio to obtain human voice audio file and background sound audio file; and using the human voice audio file or background sound audio file as the target audio.

[0101] Optionally, this analysis module is specifically used for: when the feature category of the advertising audio perception feature to be extracted is advertising human voice feature, inputting the human voice audio file and the corresponding structured prompt words into the large model; when the feature category of the advertising audio perception feature to be extracted is advertising music feature or advertising sound effect feature, inputting the background sound audio file and the corresponding structured prompt words into the large model; when the feature category of the advertising audio perception feature to be extracted is advertising text feature, performing speech recognition on the human voice audio file, and inputting the speech recognition result, the human voice audio file, and the corresponding structured prompt words into the large model.

[0102] Optionally, this vectorization module is specifically used for: performing format verification and value range verification on each encoded value to determine whether the encoded value conforms to the preset output format and the value range specified in the feature encoding mapping relationship; if there are abnormal encoded values with abnormal format or exceeding the value range, the large model is called again to obtain the encoded value of the corresponding advertising audio perception feature, until a valid encoded value cannot be obtained after retrying, and the default value is filled in according to the preset rules; the verified encoded values are normalized to map the values of numerical statistical features to the preset standard range; the normalized encoded values are arranged in an orderly manner according to the preset fixed feature order in the feature encoding mapping relationship to form a feature vector representing the complete advertising audio perception feature of the target audio.

[0103] According to another aspect of the embodiments of this application, this application provides an electronic device, such as... Figure 4 As shown, the system includes a memory 401, a processor 403, a communication interface 405, and a communication bus 407. The memory 401 stores a computer program that can run on the processor 403. The memory 401 and the processor 403 communicate through the communication interface 405 and the communication bus 407. When the processor 403 executes the computer program, it implements the steps of the above method.

[0104] The memory and processor in the aforementioned electronic devices communicate with each other via a communication bus and a communication interface. The communication bus can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into an address bus, a data bus, a control bus, etc.

[0105] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0106] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0107] According to another aspect of the embodiments of this application, a computer program product or computer program is also provided, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps of any of the above embodiments.

[0108] Optionally, in embodiments of this application, the computer-readable medium is configured to store program code for the processor to perform the following steps: Obtain the pre-built feature encoding mapping relationship for the advertising audio; Based on the feature encoding mapping relationship, construct corresponding structured prompt words for the perceptual features of the advertisement audio to be extracted; The target audio and structured cue words are input into a large model to analyze the advertising audio perception features of the target audio and output the encoded values of the advertising audio perception features in accordance with the format specified by the cue words. The target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted. Having obtained the encoded values of all advertising audio perceptual features of the target audio using a large model, all encoded values are integrated into a feature vector to represent the complete advertising audio perceptual features of the target audio.

[0109] Optionally, specific examples in this embodiment can refer to the examples described in the above embodiments, and will not be repeated here.

[0110] In specific implementation, the embodiments of this application can be referred to the above embodiments and have corresponding technical effects.

[0111] It is understood that the embodiments described herein can be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microcontrollers, microprocessors, other electronic units for performing the functions described herein, or combinations thereof.

[0112] For software implementation, the techniques described herein can be implemented by units that perform the functions described herein. The software code can be stored in memory and executed by a processor. The memory can be implemented in the processor or external to the processor.

[0113] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0114] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0115] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interfaces, devices, or units, and may be electrical, mechanical, or other forms.

[0116] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0117] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0118] If the aforementioned function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks. It should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. In the absence of further restrictions, an element defined by the phrase "comprising a..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0119] The above description is merely a specific embodiment of this application, enabling those skilled in the art to understand or implement this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.

Claims

1. A method for extracting and vectorizing perceptual features of advertising audio based on a large model, characterized in that, include: Obtain the pre-built feature encoding mapping relationship for the advertising audio; Based on the feature encoding mapping relationship, construct corresponding structured prompt words for the perceptual features of the advertisement audio to be extracted; The target audio and the structured prompt words are input into a large model to analyze the advertising audio perception features of the target audio using the large model, and the encoded values of the advertising audio perception features are output according to the format specified by the prompt words. The target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted. Having obtained the encoded values of all advertising audio perceptual features of the target audio using the large model, all the encoded values are integrated into a feature vector to characterize the complete advertising audio perceptual features of the target audio.

2. The method according to claim 1, characterized in that, The process of constructing corresponding structured prompts based on the feature encoding mapping relationship for the perceptual features of the advertisement audio to be extracted includes: Get system prompt word templates and user prompt word templates; The feature categories of the advertising audio perception features to be extracted are filled into the system prompt word template to obtain system prompt words, wherein the system prompt words are used to establish the role setting, global behavior norms and task objectives of the large model; The feature name, feature description, encoding method, and output format of the advertising audio perception feature to be extracted are found from the feature encoding mapping relationship and filled into the user prompt word template to obtain the user prompt word. The user prompt word is used to determine the analysis task instruction for the advertising audio perception feature. The system prompts and user prompts are combined to obtain the structured prompts constructed for the perceptual features of the advertising audio to be extracted.

3. The method according to claim 2, characterized in that, The construction of the corresponding structured prompt words also includes: Determine the feature extraction mode of the large model; When the feature extraction mode of the large model is a single feature extraction mode, a structured prompt word is constructed for each of the advertising audio perception features to be extracted, so that the large model analyzes only one advertising audio perception feature of an advertising audio segment in a single inference. When the feature extraction mode of the large model is a multi-feature combination mode, multiple perceptual features of the advertising audio selected by the user to be extracted are determined, and a structured prompt word is constructed for the multiple perceptual features of the advertising audio selected by the user to be extracted, so that the large model can perform analysis on the multiple perceptual features of the advertising audio selected by the user in a single inference.

4. The method according to claim 2, characterized in that, The construction of the corresponding structured prompt words also includes: The duration of the target audio is detected; If the duration of the target audio exceeds a preset duration threshold, the target audio is sliced according to a preset time interval to obtain multiple audio segments; Develop corresponding initial prompt words for each audio segment; The large model is invoked to generate a global semantic summary of the target audio, and the global semantic summary is embedded in the initial prompt words of each audio segment to obtain the structured prompt words of each audio segment. The global semantic summary includes the overall content theme, emotional tone and core information elements of the target audio. The global semantic summary is used as context to enhance the large model's understanding of individual audio segments.

5. The method according to claim 1, characterized in that, Before inputting the target audio and the structured prompts into the large model, the method further includes obtaining the target audio in the following manner: The audio of the advertisement video to be analyzed is extracted to obtain the original advertisement audio; The original advertising audio is separated into human voice and background sound to obtain human voice audio file and background sound audio file; The human voice audio file or the background sound audio file is used as the target audio.

6. The method according to claim 5, characterized in that, The process of inputting the target audio and the structured cue words into the large model includes: If the feature category of the perceptual features of the advertising audio to be extracted is advertising human voice features, the human voice audio file and the corresponding structured prompt words are input into the large model; If the feature category of the advertising audio perception feature to be extracted is advertising music feature or advertising sound effect feature, the background sound audio file and the corresponding structured prompt words are input into the large model; If the feature category of the perceptual features of the advertising audio to be extracted is advertising text features, then speech recognition is performed on the human voice audio file, and the speech recognition result, the human voice audio file, and the corresponding structured prompt words are input into the large model.

7. The method according to claim 1, characterized in that, The step of integrating all the encoded values into a feature vector includes: Each encoded value is subjected to format verification and value range verification to determine whether the encoded value conforms to the preset output format and the value range specified in the feature encoding mapping relationship; If there are abnormal encoded values with incorrect format or values that are outside the range, the large model will be called again to obtain the encoded value of the corresponding advertising audio perception feature. If a valid encoded value cannot be obtained after retrying, the default value will be filled in according to the preset rules. The coded values that pass the verification are normalized to map the values of the count-type numerical features to a preset standard range. The normalized encoded values are arranged in an ordered manner according to the preset fixed feature order in the feature encoding mapping relationship to form a feature vector that represents the complete advertising audio perception features of the target audio.

8. A device for extracting and vectorizing perceptual features of advertising audio based on a large model, characterized in that, include: The acquisition module is used to acquire the pre-built feature encoding mapping relationship for the advertising audio; The construction module is used to construct corresponding structured prompt words for the perceptual features of the advertisement audio to be extracted based on the feature encoding mapping relationship; The analysis module is used to input the target audio and the structured prompt words into a large model, so as to use the large model to analyze the advertising audio perception features of the target audio, and output the encoded values of the advertising audio perception features in accordance with the format specified by the prompt words, wherein the target audio is an audio file corresponding to the feature category of the advertising audio perception features to be extracted; The vectorization module is used to integrate all the encoded values of the advertising audio perception features of the target audio into a feature vector, which represents the complete advertising audio perception features of the target audio, after obtaining the encoded values of all advertising audio perception features of the target audio using the large model.

9. An electronic device comprising a memory, a processor, a communication interface, and a communication bus, wherein the memory stores a computer program executable on the processor, and the memory and the processor communicate via the communication bus and the communication interface, characterized in that... When the processor executes the computer program, it implements the advertising audio perception feature extraction and vectorization method based on a large model as described in any one of claims 1 to 7.

10. A computer-readable medium having processor-executable non-volatile program code, characterized in that, The program code causes the processor to execute the advertising audio perception feature extraction and vectorization method based on a large model as described in any one of claims 1 to 7.