Audio representation learning method based on multilayer time sequence pooling

A learning method and time sequence technology, applied in audio data retrieval, audio data clustering/classification, speech analysis, etc., can solve problems such as the lack of flexible and efficient capture of time series dynamic information, and achieve the effect of improving performance and improving performance.

Inactive Publication Date: 2019-10-15
HARBIN INST OF TECH
View PDF1 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the existing problem of lack of feature representation technology that can flexibly and efficiently capture temporal dynamic information of audio samples of any duration, the present invention provides an audio representation learning method based on multi-layer temporal pooling

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Audio representation learning method based on multilayer time sequence pooling
  • Audio representation learning method based on multilayer time sequence pooling
  • Audio representation learning method based on multilayer time sequence pooling

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0019] Specific implementation mode one: combine figure 1 This embodiment is described. The audio representation learning method based on multi-layer temporal pooling given in this embodiment specifically includes the following steps:

[0020] Step 1. Extract the spectral features of each audio sample in the training set and the audio to be represented, and divide each spectral feature into segments of equal length, so as to obtain the segment-level time-frequency feature set of the training set and the segment-level of the audio to be represented Time-frequency feature set, using the segment-level time-frequency feature set of the training set to train the CNN network (the time-frequency feature set is used as the input of the CNN network, and the CNN network outputs its representation vector for each segment-level time-frequency feature), and then the training A good CNN network acts as a segment-level feature representation extractor to extract the segment-level feature rep...

specific Embodiment approach 2

[0022] Embodiment 2: This embodiment is different from Embodiment 1 in that Step 1 specifically includes the following steps:

[0023] Step 11. Extract segment-level time-frequency features from the training set and the audio to be represented:

[0024] Extract its 120-dimensional logarithmic Mel energy spectrum (Log-Mel) for each audio sample; then cut the logarithmic Mel energy spectrum of each sample into multiple logarithmic Mel energy spectrum fragments of equal length, namely , segment-level time-frequency features; there is 50% overlap between adjacent segments; obtain the segment-level time-frequency feature set of the training set and the segment-level time-frequency feature set of the audio to be represented;

[0025] Step 12, CNN network training:

[0026] Normalize the segments in the segment-level time-frequency feature set of the training set obtained in step 1. The specific method is to calculate the mean and standard deviation of the segments in the set, and t...

specific Embodiment approach 3

[0030] Specific embodiment three: the difference between this embodiment and specific embodiment two is that the CNN network includes two convolutional layers (Conv), three pooling layers (Pool), three fully connected layers (FC) and an output Layer, wherein, each convolutional layer and fully connected layer have all carried out the batch normalization operation, and the last (the third) pooling layer is full-time domain (complete time domain) pooling; Described CNN network adopts can The leaky linear rectification activation function Leaky ReLU is used to retain the negative part of the output vector of each layer, so as to ensure that the CNN network can retain as much dynamic information as possible. The specific network architecture and training parameter settings are shown in Table 1:

[0031] Table 1. CNN network architecture and training parameter settings

[0032]

[0033]As shown in Table 1, each Conv layer (convolutional layer) and FC layer (full connection laye...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an audio representation learning method based on multilayer time sequence pooling, and belongs to the technical field of audio classification. The method comprises the followingsteps: firstly, extracting spectral characteristics of each audio sample in a training set and a to-be-represented audio, segmenting the spectral characteristics into fragments with equal lengths, training a CNN network by utilizing a fragment-level time-frequency characteristic set of the training set, and then extracting fragment-level characteristic representation of the to-be-represented audio by utilizing the trained CNN network; taking the extracted fragment-level feature representation of the to-be-represented audio as the input of a multi-layer time sequence pooling network, sequentially performing nonlinear feature mapping and time sequence coding operation on the input data by each time sequence pooling layer of the multi-layer time sequence pooling network, and finally outputting a representation vector of the to-be-represented audio. According to the invention, the problem of lack of a feature representation technology capable of flexibly and efficiently capturing the timesequence dynamic information of the audio sample of any time length in the prior art is solved. The method can be used for robust and advanced audio representation.

Description

technical field [0001] The invention relates to an audio representation learning method and belongs to the technical field of audio classification. Background technique [0002] With the modern society's high dependence on Internet technology and the rapid development of multimedia technology, massive audio data is flooding people's daily life and work. How to automatically and efficiently process a large amount of audio data and classify and recognize it has become an urgent and challenging research topic. As a human-computer interaction technology involving sound signal processing, audio classification is widely used in various artificial intelligence fields, such as content-based audio retrieval, robust speech recognition, intelligent security monitoring, and unmanned driving. Audio data classification aims to classify and identify different sound data according to predefined semantics. Common semantics include background sounds in acoustic scenes such as airports, subw...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/65G06F16/683G10L15/02G10L15/06G10L25/30
CPCG10L15/02G10L15/063G10L25/30G06F16/65G06F16/683
Inventor 韩纪庆张力文郑铁然郑贵滨
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products